# On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence.

## Presentation on theme: "On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence."— Presentation transcript:

On Simple Adaptive Momentum - 1 Presented at CIS 2008 © Dr Richard Mitchell 2008 On Simple Adaptive Momentum Dr Richard Mitchell Cybernetics Intelligence Research Group Cybernetics, School of Systems Engineering University of Reading, UK R.J.Mitchell@reading.ac.uk

On Simple Adaptive Momentum - 2 Presented at CIS 2008 © Dr Richard Mitchell 2008 Overview Simple Adaptive Momentum speeds training of (MLPs) It adapts the normal momentum term depending on the angle between the current and previous changes in the weights of the MLP. In the original paper, the weight changes of the whole network are used in determining this angle. This paper considers adapting the momentum term using certain subsets of these weights. It is inspired by the authors object oriented approach to programming MLPs, successfully used in teaching. It is concluded that the angle is best determined using the weight changes in each layer separately.

On Simple Adaptive Momentum - 3 Presented at CIS 2008 © Dr Richard Mitchell 2008 Nomenclature in Multi Layer Net x r (i) is o/p of node i in layer r; w r (i,j) is weight i of link to node j in layer r x 1 (2) x 3 (1) x 1 (1) w 3 (3,2) x 2 (2) x 2 (3) x 3 (2) x 2 (1) w 2 (0,1) w 2 (0,2) w 2 (0,3) w 3 (0,2) w 3 (0,1) w 3 (1,2) w 3 (2,2) w 3 (3,1) w 3 (2,1) w 2 (1,2) w 3 (1,1) w 2 (1,1) w 2 (2,1) w 2 (2,3) w 2 (1,3) w 2 (2,2) Inputs Outputs Change weights : Δ t w r (i,j) = η δ r (j) x r-1 (i) + α Δ t-1 w r (i,j) δ is function of error; varies with f(z); error also varies

On Simple Adaptive Momentum - 4 Presented at CIS 2008 © Dr Richard Mitchell 2008 Simple Adaptive Momentum Swanston, Bishop, & Mitchell, R.J. (1994), "Simple adaptive momentum: new algorithm for training multilayer perceptrons", Elect. Lett, Vol 30, No 18, pp1498-1500 Concept: adapt the momentum term depending on whether weight change this time in same direction as last. Direction? Weight changes in array … so are a vector Have two vectors, for current and previous, Δwc & Δwp w2w2 w1w1 Δw p2 Δw p1 Can see angle between vectors w2w2 w1w1 θ ΔwpΔwp ΔwcΔwc e.g. In 2D

On Simple Adaptive Momentum - 5 Presented at CIS 2008 © Dr Richard Mitchell 2008 Implementing SAM The simple idea is to replace momentum constant by (1+cos( )) where is angle between vector of current and previous deltaWeights, Δw c and Δw p. In original paper Δws apply to all weights in network In this paper, we consider adapting α at the network level, layer level and neuron level. Inspired by object oriented programming of MLP – provides good example and practice for students of properties of OOP albeit on old ANN.

On Simple Adaptive Momentum - 6 Presented at CIS 2008 © Dr Richard Mitchell 2008 OO Approach – Network Layers Can program MLP with objects for each neuron. But as need inputs from prev layer and deltas from next – need many pointers – problematic for students. So easier to have object for layer of neurons (all with same inputs): get inputs and weighted deltas in an array Base object is layer of linearly activated neurons LinActLayer – a single layer network of neurons f(z) = z. For Neurons with Sigmoidal Activation – only need two different functions – for calculating output and delta So have SigActLayer – an object inheriting LinActLayer uses existing members, adds 2 different ones

On Simple Adaptive Momentum - 7 Presented at CIS 2008 © Dr Richard Mitchell 2008 Network For Hidden Layers Need enhanced SigActLayer with own calculate error func: (weighted deltas in next layer). Existing objects are whole net. So have SigActHidLayer as a multiple layer network, Inherits from SigActLayer but also has a pointer to next layer. Most functions have 2 lines - process own layer and next Class Base SigActHidLayer LinActLayer SigActLayer

On Simple Adaptive Momentum - 8 Presented at CIS 2008 © Dr Richard Mitchell 2008 SAM and Hierarchy Given approach can adjust momentum using weight changes a) over the whole network b) separately by layer c) separately for each neuron For a) need to calculate the η * delta * inputs for all layers, then globally set α (1 + cosθ) For b) calculate η * delta * inputs for each layer and set the α (1 + cosθ) for each layer separately For c) do the same, but for each neuron in each layer. This works easily in the hierarchy.

On Simple Adaptive Momentum - 9 Presented at CIS 2008 © Dr Richard Mitchell 2008 Experimentation 3 problems. Have Training Validation Unseen data Stop training when error on validation set rises Run 6 times per problem with different initial weights Problem 1: 2 inputs, 10 nodes in hidden, 1 output SAM ModeNoneNeuronLayerNetwork Mean Epochs taken867227202257 SAM modeTrain SSEValid SSEUnseen SSE None0.00819850.00659650.0092535 Neuron0.01004450.00843950.0107985 Layer0.01032650.00868050.0106505 Network0.00771250.00710950.0084845

On Simple Adaptive Momentum - 10 Presented at CIS 2008 © Dr Richard Mitchell 2008 Problem 2 5 inputs, 15 nodes in hidden layer and 1 output SAM modeNoneNeuronLayerNetwork Mean Epochs1712315262312 SAM modeTrain SSEValid SSEUnseen SSE None0.00047250.00056250.0006665 Neuron0.00065850.00076350.0009525 Layer0.00076850.00087450.0011055 Network0.00062150.00076550.0009505 Trained much more quickly, but SSE worse Very little diff one layer and whole network, so..

On Simple Adaptive Momentum - 11 Presented at CIS 2008 © Dr Richard Mitchell 2008 Problem 3 5 inputs, 15 nodes in hidden layer and 3 outputs SAM ModeNoneNeuronLayerNetwork Mean Epochs1133497638977 SAM ModeTrain SSEValid SSEUnseen SSE None0.00447350.00438350.0054605 Neuron0.00482050.00456850.0057955 Layer0.00456750.00441050.0053225 Network0.00454650.00440550.0053445 SSEs averaged over 3 outputs : here Layer best

On Simple Adaptive Momentum - 12 Presented at CIS 2008 © Dr Richard Mitchell 2008 Conclusions and Further Work The Object Oriented hierarchy works neatly here SAM clearly reduces number of Epochs taken to learn – little extra overhead per epoch In one example it increased the Sum Squared Errors This needs investigating It needs to be tested on other problems, but it looks as if SAM at the layer level may be best (particularly with multiple outputs) Momentum used in other learning problems – SAM could be investigated for these.

Similar presentations