Variations on Backpropagation.

Slides:

Advertisements

Similar presentations

NEURAL NETWORKS Backpropagation Algorithm

Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Computing Gradient Vector and Jacobian Matrix in Arbitrarily Connected Neural Networks Author : Bogdan M. Wilamowski, Fellow, IEEE, Nicholas J. Cotton,

Linear Discriminant Functions

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Performance Optimization

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

The back-propagation training algorithm

1 Pertemuan 13 BACK PROPAGATION Matakuliah: H0434/Jaringan Syaraf Tiruan Tahun: 2005 Versi: 1.

Motion Analysis (contd.) Slides are from RPI Registration Class.

Back-Propagation Algorithm

Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.

18 1 Hopfield Network Hopfield Model 18 3 Equations of Operation n i - input voltage to the ith amplifier a i - output voltage of the ith amplifier.

12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Why Function Optimization ?

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.

UNCONSTRAINED MULTIVARIABLE

Collaborative Filtering Matrix Factorization Approach

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.

Radial Basis Function Networks

Biointelligence Laboratory, Seoul National University

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.

Multivariate Unconstrained Optimisation First we consider algorithms for functions for which derivatives are not available. Could try to extend direct.

Non-Bayes classifiers. Linear discriminants, neural networks.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.

Quasi-Newton Methods of Optimization Lecture 2. General Algorithm n A Baseline Scenario Algorithm U (Model algorithm for n- dimensional unconstrained.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.

Chapter 2-OPTIMIZATION

Variations on Backpropagation.

Hand-written character recognition

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Neural Networks 2nd Edition Simon Haykin

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.

第 3 章神经网络.

A Simple Artificial Neuron

Widrow-Hoff Learning (LMS Algorithm).

Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.

Artificial neural networks – Lecture 4

CS5321 Numerical Optimization

Non-linear Least-Squares

Collaborative Filtering Matrix Factorization Approach

Variations on Backpropagation.

Outline Single neuron case: Nonlinear error correcting learning

Structure from Motion with Non-linear Least Squares

Optimization Part II G.Anuradha.

Ch2: Adaline and Madaline

Instructor :Dr. Aamer Iqbal Bhatti

METHOD OF STEEPEST DESCENT

Backpropagation.

Neural Network Training

Backpropagation.

Performance Optimization

Outline Preface Fundamentals of Optimization

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

Section 3: Second Order Methods

Structure from Motion with Non-linear Least Squares

Presentation transcript:

Variations on Backpropagation

Variations Heuristic Modifications Standard Numerical Optimization Momentum Variable Learning Rate Standard Numerical Optimization Conjugate Gradient Newton’s Method (Levenberg-Marquardt)

Performance Surface Example Network Architecture Nominal Function Parameter Values

Squared Error vs. w11,1 and w21,1 w21,1 w21,1 w11,1 w11,1

Squared Error vs. w11,1 and b11 b11 w11,1 b11 w11,1

Squared Error vs. b11 and b12 b21 b11 b21 b11

Convergence Example w21,1 w11,1

Learning Rate Too Large w21,1 w11,1

Momentum Filter Example

Momentum Backpropagation Steepest Descent Backpropagation (SDBP) w21,1 Momentum Backpropagation (MOBP) w11,1

Variable Learning Rate (VLBP) If the squared error (over the entire training set) increases by more than some set percentage z after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor (1 > r > 0), and the momentum coefficient g is set to zero. If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multiplied by some factor h>1. If g has been previously set to zero, it is reset to its original value. If the squared error increases by less than z, then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

Example w21,1 w11,1

Conjugate Gradient 1. The first search direction is steepest descent. 2. Take a step and choose the learning rate to minimize the function along the search direction. 3. Select the next search direction according to: where or or

Interval Location

Interval Reduction

Golden Section Search t=0.618 Set c1 = a1 + (1-t)(b1-a1), Fc=F(c1) d1 = b1 - (1-t)(b1-a1), Fd=F(d1) For k=1,2, ... repeat If Fc < Fd then Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck c k+1 = a k+1 + (1-t)(b k+1 -a k+1 ) Fd= Fc; Fc=F(c k+1 ) else Set ak+1 = ck ; bk+1 = bk ; ck+1 = dk d k+1 = b k+1 - (1-t)(b k+1 -a k+1 ) Fc= Fd; Fd=F(d k+1 ) end end until bk+1 - ak+1 < tol

Conjugate Gradient BP (CGBP) Intermediate Steps Complete Trajectory w21,1 w21,1 w11,1 w11,1

Newton’s Method If the performance index is a sum of squares function: then the jth element of the gradient is

Matrix Form The gradient can be written in matrix form: where J is the Jacobian matrix: J x ( ) v 1 ¶ - 2 ¼ n N =

Hessian

Gauss-Newton Method Approximate the Hessian matrix as: Newton’s method becomes: x k J T ( ) [ ] 1 – v =

Levenberg-Marquardt Gauss-Newton approximates the Hessian by: This matrix may be singular, but can be made invertible as follows: If the eigenvalues and eigenvectors of H are: then Eigenvalues of G

Adjustment of mk As mk®0, LM becomes Gauss-Newton. As mk®¥, LM becomes Steepest Descent with small learning rate. Therefore, begin with a small mk to use Gauss-Newton and speed convergence. If a step does not yield a smaller F(x), then repeat the step with an increased mk until F(x) is decreased. F(x) must decrease eventually, since we will be taking a very small step in the steepest descent direction.

Application to Multilayer Network The performance index for the multilayer network is: The error vector is: The parameter vector is: The dimensions of the two vectors are:

Jacobian Matrix J x ( ) e 1 , ¶ w - 2 ¼ S R b M =

Computing the Jacobian SDBP computes terms like: using the chain rule: where the sensitivity is computed using backpropagation. For the Jacobian we need to compute terms like:

Marquardt Sensitivity If we define a Marquardt sensitivity: We can compute the Jacobian as follows: weight bias

Computing the Sensitivities Initialization Backpropagation S ˜ m 1 2 ¼ Q =

LMBP Present all inputs to the network and compute the corresponding network outputs and the errors. Compute the sum of squared errors over all inputs. Compute the Jacobian matrix. Calculate the sensitivities with the backpropagation algorithm, after initializing. Augment the individual matrices into the Marquardt sensitivities. Compute the elements of the Jacobian matrix. Solve to obtain the change in the weights. Recompute the sum of squared errors with the new weights. If this new sum of squares is smaller than that computed in step 1, then divide mk by u, update the weights and go back to step 1. If the sum of squares is not reduced, then multiply mk by u and go back to step 3.

Example LMBP Step w21,1 w11,1

LMBP Trajectory w21,1 w11,1