# Financial Informatics –XVI: Supervised Backpropagation Learning

## Presentation on theme: "Financial Informatics –XVI: Supervised Backpropagation Learning"— Presentation transcript:

Financial Informatics –XVI: Supervised Backpropagation Learning
Khurshid Ahmad, Professor of Computer Science, Department of Computer Science Trinity College, Dublin-2, IRELAND November 19th, 2008. 1

Preamble Neural Networks 'learn' by adapting in accordance with a training regimen: Five key algorithms. ERROR-CORRECTION OR PERFORMANCE LEARNING HEBBIAN OR COINCIDENCE LEARNING BOLTZMAN LEARNING (STOCHASTIC NET LEARNING) COMPETITIVE LEARNING FILTER LEARNING (GROSSBERG'S NETS)

ANN Learning Algorithms

ANN Learning Algorithms

ANN Learning Algorithms

Back-propagation Algorithm: Supervised Learning
Backpropagation (BP) is amongst the ‘most popular algorithms for ANNs’: it has been estimated by Paul Werbos, the person who first worked on the algorithm in the 1970’s, that between 40% and 90% of the real world ANN applications use the BP algorithm. Werbos traces the algorithm to the psychologist Sigmund Freud’s theory of psychodynamics. Werbos applied the algorithm in political forecasting. David Rumelhart, Geoffery Hinton and others applied the BP algorithm in the 1980’s to problems related to supervised learning, particularly pattern recognition. The most useful example of the BP algorithm has been in dealing with problems related to prediction and control. From Paul Werbos (1995). ‘Backpropagation: Basics and New Developments’. In (Ed. ) Michael A Arbib. The Handbook of Brain Theory and Neural Networks. Cambridge (Mass., USA) & London: The MIT Press. pp

Back-propagation Algorithm: Architecture of a BP system
From Paul Werbos (1995). ‘Backpropagation: Basics and New Developments’. In (Ed. ) Michael A Arbib. The Handbook of Brain Theory and Neural Networks. Cambridge (Mass., USA) & London: The MIT Press. pp

Back-propagation Algorithm
BASIC DEFINITIONS Backpropagation is a procedure for efficiently calculating the derivatives of some output quantity of a non-linear system, with respect to all inputs and parameters of that system, through calculations proceeding backwards from outputs to inputs. Backpropagation is any technique for adapting the weights or parameters of a nonlinear system by somehow using such derivatives or the equivalent. According to Paul Werbos there is no such thing as a “backpropagation network”, he used an ANN design called a multilayer perceptron.

Back-propagation Algorithm
Paul Werbos provided a ‘rule for updating the weights of a multi-layered network undergoing supervised learning. It is the weight adaptation rule which is called backpropagation. Typically, a fully connected feedforward network is used to be trained using the BP algorithm: activation in such networks travels in a direction from the input to the output layer and the units in one layer are connected to every other unit in the next layer. There are two sweeps of the fully connected network: forward sweep and backward sweep. From Robert Callan (1999). The Essence of Neural Networks. London: Prentice Hall Europe. Pp

Back-propagation Algorithm
There are two sweeps of the fully connected network: forward sweep and backward sweep. Forward Sweep: This sweep is similar to any other feedforward ANN – the input stimuli is given to the network, the network computes the weighted average from all the input units and then passes the average through a squash function. The ANN generates an output subsequently. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. From Robert Callan (1999). The Essence of Neural Networks. London: Prentice Hall Europe. Pp

Back-propagation Algorithm
There are two sweeps of the fully connected network: forward sweep and backward sweep. Backwards Sweep: This sweep is similar to the forward sweep, except what is ‘swept’ are the error values. These values essentially are the differences between the actual output and a desired output. The ANN may have a number of hidden layers, for example, the multi-net perceptron, and the output from each hidden layer becomes the input to the next layer forward. In the backward sweep output unit sends errors back to the first proximate hidden layer which in turn passes it onto the next hidden layer. No error signal is sent to the input units. From Robert Callan (1999). The Essence of Neural Networks. London: Prentice Hall Europe. Pp

Back-propagation Algorithm
For each input vector associate a target output vector while not STOP STOP = TRUE for each input vector perform a forward sweep to find the actual output obtain an error vector by comparing the actual and target output if the actual output is not within tolerance set STOP = FALSE use the backward sweep to determine weight changes update weights Step 1. Read first input pattern & associated output pattern. CONVERGE = TRUE. Step 2. For input layer – assign as net input to each unit its corresponding element in the input vector. The output for each unit is its net input. Read next input pattern & associated output patterns Step 3. For the first hidden layer units – calculate the net input and output: Repeat step 3 for all subsequent hidden layers. Step 4. For the output layer units – calculate the net input & output (same formulae as step 3) Step 5. Is the difference between target & output pattern within tolerance? IF No THEN CONVERGE = FALSE. Step 6. For each output calculate its error: j=(dj-0j)0j(1-0j) Step 7. For last hidden layer calculate error for each unit: Repeat step 7 for all subsequent hidden layers Step 8. For all layers update weights for each unit: wij(n+1)=(j0i)+wij(n) Last pattern? CONVERGE = = TRUE STOP

Back-propagation Algorithm
Example: Perform a complete forward and backward sweep of a ( 2 input units, 2 hidden layer units and 1 output unit) with the following architecture. The target output d=0.9. The input is [ ]. 0.1 0.2 -0.2 0.2 1 4 -0.1 6 d 0.9 The input units are labelled {1} and {2}, the hidden layer units as {4} and {5}, and {6}for the output layer, with units {0} being the bias units for the input, hidden and output layers. The computation is as follows net4 = (1.0 x 0.1) + (0.1 x –0.2) + (0.9 x 0.1) = o4 = = 0.542 net5 = (1.0 x 0.1) + (0.1 x –0.1) + (0.9 x 0.3) = 0.360 o5 = = Similarly, from the hidden to output layer: net6 = (1.0 x 0.2) + (0.542 x 0.2) + (0.589 x 0.3) = 0.485 o6 = The error for the output node is: 6 = (0.9 – 0.619) x x (1 – 0.619) = 0.066 Note that the hidden unit errors are used to update the first layer of weights. There is no error calculated for the bias unit as no weights from the first layer connect to the hidden bias. The error for the hidden nodes is: 5 = x (1.0 – 0.589) x (0.066 x 0.3) = 0.005, 4 = x (1.0 – 0.542) x (0.066 x 0.2) = The rate of learning for this example is taken as There is no need to give a momentum term for this first pattern since there are no previous weight changes: w5,6 = x x = 0.01 The new weight is: = 0.31 w4,6 = 0.25 x x = 0.009 Then: = 0.209 w3,6 = 0.25 x x 1.0 = 0.017 Finally: = 0.217 The calculation of the new weights for the first layer is left as an exercise. 0.1 0.1 0.3 0.3 2 5

Back-propagation Algorithm
Target value is 0.9 so error for the output unit is: (0.900 – 0.288) x x (1 – 0.288) = 0.125 Forward Pass 0.288 -2 -2 -0.906 0.125 3 1 3 1 0.122 0.728 0.375 0.125 3 -2 3 -2 -1.972 0.986 0.040 0.025 2 -4 Output from unit 2 -4 -2 2 -2 2 0.500 0.993 -0.030 -0.110 2 2 2 2 5.000 -0.007 -0.001 Input to unit 3 -2 3 3 -2 3 -2 -2 0.1 0.9 Example of a forward pass and a backward pass through a feedforward network. Inputs, outputs and errors are shown in boxes.

Back-propagation Algorithm
-2 + (0.8 x 0.125) x 1 = 3 + (0.8 x 0.125) x = 3.012 1.073 = 1 + (0.8 x 0.125) x 0.728 3 + (0.8 x 0.04) x 1 = 3.032 -1.98 = -2 + (0.8 x 0.025) x 1 -2 + (0.8 x 0.04) x 0.5 = 2.019 = 2 + (0.8 x 0.025) x 0.993 2 + (0.8 x 0.025) x 0.5) = 2.010 = -4 + (0.8 x 0.04) x 0.993 2 + (0.8 x –0.007) x 1 = 1.994 1.999 = 2 + (0.8 x –0.001) x 1 -2 + (0.8 x –0.007) x 0.1) = 2.999 = 3 + (0.8 x –0.001) x 0.9 3 + (0.8 x –0.001) x 0.1 = 2.999 = -2 + (0.8 x –0.007) x 0.9 New weights calculated following the errors derived above

Back-propagation Algorithm
A derivation of the BP algorithm The error signal at the output of neuron j at the nth training cycle is given as: The instantaneous value of error energy for neuron j is The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer

Back-propagation Algorithm Derivative or differential coefficient
For a function f(x) at the argument x, the limit of the differential coefficient as x0 y=f(x) (x+x,y+y) (x,y) Q P y x x

Back-propagation Algorithm Derivative or differential coefficient
Typically defined for a function of a single variable, if the left and right hand limits exist and are equal, it is the gradient of the curve at x, and is the limit of the gradient of the chord adjoining the points (x,f(x)) and (x+x,f(x+ x)). The function of x defined as this limit for each argument x is the first derivative y=f(x).

Back-propagation Algorithm Partial derivative or partial differential coefficient
The derivative of a function of two or more variables with respect to one of these variables, the others being regarded as constant; written as:

Back-propagation Algorithm Total Derivative
The derivative of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed as: if z=f(x,y) with parametric equations: x=U(t), y=V(t) then under appropriate conditions the total derivative is:

Back-propagation Algorithm Total or Exact Differential
The differential of a function of two or more variables with regard to a single parameter in terms of which these variables are expressed, equal to the sum of the products of each partial derivative of the function with the corresponding increment. If z=f(x,y), x=U(t), y=V(t) then under appropriate conditions, the total differential is:

Back-propagation Algorithm Chain rule of calculus
A theorem that may be used in the differentiation of a function of a function where y is a differentiable function of t, and t is a differentiable function of x. This enables a function of y=f(x) to be differentiated by finding a suitable function x, such that f is a composition of y and y is a differentiable function of u, and u is a differentiable function of x.

Back-propagation Algorithm Chain rule of calculus
Similarly for partial differentiation where f is a function of u and u is a function of x.

Back-propagation Algorithm Chain rule of calculus
Now consider the error signal at the output of a neuron j at iteration n - i.e. presentation of the nth training examples: ej(n)=dj(n)-yj(n) where dj is the desired output and yj is the actual output and the total error energy overall the neurons in the output layer: where ‘C’ is the number of all the neurons in the output layer

Back-propagation Algorithm Chain rule of calculus
If N is the total number of patterns, the averaged squared error energy is: Note that ej is a function of yj and wij (the weights of connections between neurons in two adjacent layers ‘i’ and ‘j’)

Back-propagation Algorithm
A derivation of the BP algorithm The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer where the set C includes all the neurons in the output layer. The total error energy E(n) can be computed by summing up the instantaneous energy over all the neurons in the output layer The instantaneous error energy, E(n) and therefore the average energy Eav is a function of the free parameters, including synaptic weights and bias levels

Back-propagation Algorithm
The originators of the BP algorithm suggest that where  is the learning rate parameter of the BP algorithm. The minus sign indicates the use of gradient descent in the weight; seeking a direction for weight change that reduces the value of E(n)

Back-propagation Algorithm
A derivation of the BP algorithm A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x: E(n) is a function of a function of a function of a function of wji(n)

Back-propagation Algorithm
A derivation of the BP algorithm The back-propagation algorithm trains a multilayer perceptron by propagating back some measure of responsibility to a hidden (non-output) unit. Back-propagation: Is a local rule for synaptic adjustment; Takes into account the position of a neuron in a network to indicate how a neuron’s weight are to change A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm Layers in back-propagating multi-layer perceptron First layer – comprises fixed input units; Second (and possibly several subsequent layers) comprise trainable ‘hidden’ units carrying an internal representation. Last layer – comprises the trainable output units of the multi-layer perceptron A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm Modern back-propagation algorithms are based on a formalism for propagating back the changes in the error energy E , with respect to all the weights wij from a unit (i) to its inputs (j). More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm More precisely, what is being computed during the backwards propagation is the rate of change of the error energy E with respects to the networks weight. This computation is also called the computation of the gradient: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The back-propagation learning rule is formulated to change the connection weights wij so as to reduce the error energy E by gradient descent A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The error energy E is a function of the error e, the output y, the weighted sum of all the inputs v, and of the weights wij: According to the chain rule then: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm Further applications of the chain rule suggests that: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm Each of the partial derivatives can be simplified as: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The rate of change of the error energy E with respect to changes in the synaptic weights is: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The so-called delta rule suggests that A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x: d is called the local gradient

Back-propagation Algorithm
A derivation of the BP algorithm The local gradient d is given by A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the output neuron j: The weight adjustment requires the computation of the error signal A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the output neuron j: The so-called delta rule suggests that A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neuron (j) - the delta rule suggests that The weight adjustment requires the computation of the error signal, but the situation is complicated in that we do not have a (set of) desired output(s) for the hidden neurons and consequently we will have difficulty in computing the error signal ej for the hidden neuron j. A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j): Recall that the local gradient of the hidden neuron j is given as j and that yj is the output and equals We have used the chain rule of calculus here A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j): The ‘error’ energy related to the hidden neurons is given as The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j): The rate of change of the error (energy) with respect to the input during the backward pass – the input during the pass yk(n)- is given as: A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (k)- the delta rule suggests that The error signal for a hidden neuron has, thus, to be computed recursively in terms of the error signals of ALL the neurons to which the hidden neuron j is directly connected. Therefore, the back-propagation algorithm becomes complicated. A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient d is given by A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j) The local gradient d is given by A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x:

Back-propagation Algorithm
A derivation of the BP algorithm The case of the hidden neurons (j) The error energy for the hidden neuron can be given as Note that we have used the index k instead of the index j in order to avoid confusion. A note on partial derivatives: Partial derivative: The derivative of a function of two or more variables with respect to one of them whilst the others are regarded as constant. Consider, for example, a function f which is a function of three variables, The partial derivative of f with respect to x leads to: Chain Rule: A theorem in calculus which states that a function can be differentiated as function of a function: where y is a differentiable function of t, and t is a differentiable function of x. This enables a function y = f(x) to be differentiated by finding a suitable function u, such that f is a composition of y and u, y is a differentiable function of u, and u is a differentiable function of x: