Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variations on Backpropagation.

Similar presentations


Presentation on theme: "Variations on Backpropagation."— Presentation transcript:

1 Variations on Backpropagation

2 Variations Heuristic Modifications Standard Numerical Optimization
Momentum Variable Learning Rate Standard Numerical Optimization Conjugate Gradient Newton’s Method (Levenberg-Marquardt)

3 Performance Surface Example
Network Architecture Nominal Function Parameter Values

4 Squared Error vs. w11,1 and w21,1 w21,1 w21,1 w11,1 w11,1

5 Squared Error vs. w11,1 and b11 b11 w11,1 b11 w11,1

6 Squared Error vs. b11 and b12 b21 b11 b21 b11

7 Convergence Example w21,1 w11,1

8 Learning Rate Too Large
w21,1 w11,1

9 Momentum Filter Example

10 Momentum Backpropagation
Steepest Descent Backpropagation (SDBP) w21,1 Momentum Backpropagation (MOBP) w11,1

11 Variable Learning Rate (VLBP)
If the squared error (over the entire training set) increases by more than some set percentage z after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor (1 > r > 0), and the momentum coefficient g is set to zero. If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multiplied by some factor h>1. If g has been previously set to zero, it is reset to its original value. If the squared error increases by less than z, then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

12 Example w21,1 w11,1

13 Conjugate Gradient 1. The first search direction is steepest descent.
2. Take a step and choose the learning rate to minimize the function along the search direction. 3. Select the next search direction according to: where or or

14 Interval Location

15 Interval Reduction

16 Golden Section Search t=0.618 Set c1 = a1 + (1-t)(b1-a1), Fc=F(c1)
d1 = b1 - (1-t)(b1-a1), Fd=F(d1) For k=1,2, ... repeat If Fc < Fd then Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck c k+1 = a k+1 + (1-t)(b k+1 -a k+1 ) Fd= Fc; Fc=F(c k+1 ) else Set ak+1 = ck ; bk+1 = bk ; ck+1 = dk d k+1 = b k+1 - (1-t)(b k+1 -a k+1 ) Fc= Fd; Fd=F(d k+1 ) end end until bk+1 - ak+1 < tol

17 Conjugate Gradient BP (CGBP)
Intermediate Steps Complete Trajectory w21,1 w21,1 w11,1 w11,1

18 Newton’s Method If the performance index is a sum of squares function:
then the jth element of the gradient is

19 Matrix Form The gradient can be written in matrix form:
where J is the Jacobian matrix: J x ( ) v 1 - 2 n N =

20 Hessian

21 Gauss-Newton Method Approximate the Hessian matrix as:
Newton’s method becomes: x k J T ( ) [ ] 1 v =

22 So, in Gauss-Newton’s method xk+1 is approximated by:
Gauss-Newton Method So, in Gauss-Newton’s method xk+1 is approximated by: x k J T ( ) [ ] 1 v = Note that this contains the pseudo-inverse of Jk ≡ J(xk) applied on vector vk ≡ v(xk) which can be computed in Matlab efficiently as: xk+1 = xk – Jk\vk

23 Problem in Gauss-Newton’s :
Gauss-Newton Method Problem in Gauss-Newton’s : The approximation of the Hessian: Hk ≡ JkTJk can be singular and thus the inverse does not exist… … Solution: the Levenberg-Marquardt algorithm ...

24 Levenberg-Marquardt Gauss-Newton approximates the Hessian by:
This matrix may be singular, but can be made invertible as follows: If the eigenvalues and eigenvectors of H are: then Eigenvalues of G

25 Adjustment of mk As mk®0, LM becomes Gauss-Newton.
As mk®¥, LM becomes Steepest Descent with small learning rate. Therefore, begin with a small mk to use Gauss-Newton and speed convergence. If a step does not yield a smaller F(x), then repeat the step with an increased mk until F(x) is decreased. F(x) must decrease eventually, since we will be taking a very small step in the steepest descent direction.

26 Application to Multilayer Network
The performance index for the multilayer network is: The error vector is: The parameter vector is: The dimensions of the two vectors are:

27 Jacobian Matrix J x ( ) e 1 , w - 2 S R b M =

28 Computing the Jacobian
SDBP computes terms like: using the chain rule: where the sensitivity is computed using backpropagation. For the Jacobian we need to compute terms like:

29 Marquardt Sensitivity
If we define a Marquardt sensitivity: We can compute the Jacobian as follows: weight bias

30 Computing the Sensitivities
Initialization Backpropagation S ˜ m 1 2 Q =

31 LMBP Present all inputs to the network and compute the corresponding network outputs and the errors. Compute the sum of squared errors over all inputs. Compute the Jacobian matrix. Calculate the sensitivities with the backpropagation algorithm, after initializing. Augment the individual matrices into the Marquardt sensitivities. Compute the elements of the Jacobian matrix. Solve to obtain the change in the weights. Recompute the sum of squared errors with the new weights. If this new sum of squares is smaller than that computed in step 1, then divide mk by u, update the weights and go back to step 1. If the sum of squares is not reduced, then multiply mk by u and go back to step 3.

32 Example LMBP Step w21,1 w11,1

33 LMBP Trajectory w21,1 w11,1


Download ppt "Variations on Backpropagation."

Similar presentations


Ads by Google