ERROR BACK-PROPAGATION LEARNING ALGORITHM Zohreh B. Irannia
Single-Layer Perceptron xi input vector t=c(x) is the target value o is the perceptron output learning rate (a small constant ), assume =1 wi = wi + wi wi = (t - o) xi Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Single-Layer Perceptron Sigmoid-Function as Activation Function: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta Rule ? Gradient-Descent Delta-Rule Says: But WHY? Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Steepest Descent Method (w1,w2) (w1+w1,w2 +w2) Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta Rule ? Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta Rule define Finally Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta Rule j V j f( V j ) y j: Desired Target … Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta Rule So we have: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Perceptron learning problem Only suitable if inputs are linearly separable Consider XOR-Problem: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Non linearly separable problems Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Solution: Multi-layer Networks New Problem: How to train different layer weights in such networks? Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Idea of Error Back Propagation Update of weights in output layer: Delta rule Delta rule is not applicable to hidden layers Because we don’t know the desired values for hidden nodes Solution: Propagating errors at output nodes back to hidden nodes Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration 3 layer / 2 inputs / 1 output: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Each neuron composed of 2 units Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Training Starts through the input layer: The same happens for y2 and y3. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Propagation of signals through the hidden layer: The same happens for y5. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Propagation of signals through the output layer: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Error signal of output layer neuron: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration propagate error signal back to all neurons. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration If propagated errors came from few neurons, they are added: The same happens for neuron-2 and neuron-3. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Weight updating starts: The same happens for all neurons. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Intuition by Illustration Weight updating terminates in the last neuron: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Some Questions How often to update? After each training case? After a full sweep through the training data? How many epochs? How much to update? Use a fixed or variable learning rate? Is it true to use steepest descent method? Does it necessarily converge to global minimum? How long does it take to converge to some minimum? Etc. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Batch Mode Training Batch mode of weight updates: Weight update once per each epoch (cumulated over all P samples) Smoothing the training sample outliers Learning independent of the order of sample presentations Usually slower than in sequential mode Sometimes more likely to get stuck in local minima. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Major Problems of EBP Constant learning rate problems: Small Slow convergence Large Overshooting the minimum. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Steepest-Descent’s Problems Convergence to Local Minima Local Minimum Global Minimum Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Steepest-Descent’s Problems Slow Convergence (zigzag path) One solution: Steepest Descent Conjugate Gradient Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Modifications to EBP Learning Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Modifications to EBP Learning Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Speed It Up: Momentum Momentum Adds a percentage of the last movement to the current movement GD with Momentum Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Speed It Up: Momentum Weight change’s Direction: Combination of current gradient and previous gradient. Advantage: Reduce the role of outliers (Smooth search) But doesn’t adjust learning rate directly. (an indirect method) Disadvantages: May result to over-shooting. Not always reduce the number of iterations. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
But some problems remain! Remaining problem: Equal learning rates for all weights ! Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta-bar-Delta Allows each weight to have its own learning rate. Lets learning rates vary with time. Two heuristics are used to determine appropriate changes : If weight changes is in the same direction for several time steps , learning rate for that weight should be increased. If direction of weight change alternates , the learning rate should be decreased. Note: these heuristics won’t always improve the performance. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Delta-bar-Delta Learning rate increase linearly and decrease exponentially. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Training Samples Quality & quantity of training samples Quality of learning results. Samples must represent well the problem space: Random sampling Proportional sampling (prior knowledge of the problem) # of training patterns needed: There is no theoretically idea number. Baum and Haussler (1989): P = W/e W: total # of weights e: acceptable classification error rate If the net can be trained to correctly classify (1 – e/2)P of the P training samples, then classification accuracy of this net is 1 – e for input patterns drawn from the same sample space Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Activation Functions Sigmoid Activation Function: Saturation regions When some incoming weights become very large input to a node may fall into a saturation region during learning. Possible remedies: Use non-saturating activation functions. Periodically normalize all weights. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Activation Functions Another sigmoid function with slower saturation rate: Change the range of the logistic function from (0,1) to (a, b) Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Activation Functions Change the slope of the logistic function Larger slope: Quicker to move to saturation regions // Faster convergence Smaller slope: Slow to move to saturation regions and allows refined weight adjustment // Slow convergence Larger slope Solution Adaptive slope (each node has a learned slope) Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Practical Considerations Many parameters must be carefully selected to ensure a good performance. Although the deficiencies of BP nets cannot be completely cured, some of them can be eased by some practical means. 2 important issues: Hidden Layers & Hidden Nodes Effect of Initial weights Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Hidden Layers & Hidden Nodes Theoretically, one hidden layer (possibly with many hidden nodes) is sufficient for any functions. There is no theoretical results on minimum necessary # of hidden nodes Practical rule of thumb: n = # of input nodes; m = # of hidden nodes For binary/bipolar data: m = 2n For real data: m >> 2n Multiple hidden layers with fewer nodes may be trained faster for similar quality in some applications Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Effect of Initial Weights (and Biases) Fully Random like: [-0.05, 0.05], [-0.1, 0.1], [-1, 1] Problems: Small values Slow learning. Large values Go to saturation (f’(x) 0) Slow learning. Normalize weights for hidden layer (Widrow) Random initial weights for all hidden nodes: [-0.5, 0.5] For each hidden node j, normalize its weight: m: # of input neurons n: # of hidden nodes For bias choose a random value: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Effect of Initial Weights (and Biases) Initialization of output weights shouldn’t result in small weighs. If small “contribution of hidden layer neurons to the output error”, and “effect of the hidden layer weights” is not visible enough. If small, deltas (of hidden layer) become very small small changes in the hidden layer weights. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
of different BP variants NOW A Comparison of different BP variants Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Comparison of different BP variants Four versions: BP Pattern Mode Learning Algorithm BP Batch Mode Learning Algorithm BP Delta-Bar-Delta Learning Algorithm Problem: classification of breast cancer problem 9 attributes 699 examples: 458 benign and 241 malignant 16 instances with missing attribute rejected Attributes normalized with respect to their highest value Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 45
BP pattern mode results for different η Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 46
BP pattern mode results for different η Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 47
BP pattern mode results for different α Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 48
BP pattern mode results for different α Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 49
BP pattern mode results for different network structure Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 50
BP pattern mode results for different range values Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 51
BP pattern mode results for 9-2-1 net & range [-0.1,0.1] Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 52
BP pattern mode results for 9-2-1 net & range [-1,1] Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 53
BP batch mode results for different η and α Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 54
BP batch mode results for different network structure Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 55
BP batch mode results for different range values Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 56
BP batch mode results for 9-3-1 net, η=α=0.1, range[-1,1] Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 57
BP Delta-Bar-Delta results For 9-13-1 network, α = ξ = 0.1, Κ = β = 0.2, Training epochs =100 Range of random numbers for the values of the synaptic weights and thresholds : [-0.1,0.1] Range for the learning rate parameters ηji of the synaptic weights and the thresholds : [0, 0.2] Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 58
BP Delta-Bar-Delta results Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 59
Error-Back-Propagation Learning Algorithm Conclusions On Error-Back-Propagation Learning Algorithm Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Summary of BP Nets Architecture: Multi-layer Feed-forward (full connection between nodes in adjacent layers, no connection within a layer) One or more hidden layers with non-linear activation function (most commonly used are sigmoid functions) Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Summary of BP Nets Back-Propagation learning algorithm: Supervised learning Approach: Gradient descent to reduce the total error (why it is also called generalized delta rule) Error terms at output nodes and Error terms at hidden nodes (why it is called error BP) Ways to speed up the learning process (next slide) Adding momentum terms Adaptive learning rate (delta-bar-delta) Quickprop Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Conclusions Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Conclusions Strengths of EBP learning: Wide practical applicability Easy to implement Good generalization power Great representation power Etc. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie Conclusions Problems of EBP learning: Often takes a long time to converge Gradient descent approach only guarantees a local minimum error Selection of learning parameters can only be done by trial-and-error Network paralysis may occur (learning is stopped Saturation case) BP learning is non-incremental (to include new training samples, the network must be re-trained with all old and new samples) Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie References Dilip Sarkar, “Methods to Speed Up Error Back-Propagation Learning Algorithm”, ACM Computing Surveys. Vol. 27, No. 4, December 1995 Sergios Theodoridis, Konstantinos Koutroumbas, “Pattern Recognitions, 2nd Edition”. Laurene Fausett, “Fundamentals of Neural Networks”. M. Jiang, G. Gielen, B. Zhang, Z. Luo, “Fast Learning Algorithms for Feedforward Neural Networks”, Applied Intelligence 18, 37–54, 2003. Konstantinos Adamopoulos, “Application of Back Propagation Learning Algorithms On Multi-Layer Perceptrons”, Final Year Project, Department of Computing, University of Bradford. And many more related articles. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Any Question? Thanks for your attention. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie