Section 3: Second Order Methods

Slides:



Advertisements
Similar presentations
Curved Trajectories towards Local Minimum of a Function Al Jimenez Mathematics Department California Polytechnic State University San Luis Obispo, CA
Advertisements

Optimization.
Engineering Optimization
Optimization Methods TexPoint fonts used in EMF.
Optimization : The min and max of a function
Optimization of thermal processes
Inexact SQP Methods for Equality Constrained Optimization Frank Edward Curtis Department of IE/MS, Northwestern University with Richard Byrd and Jorge.
Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.
Numerical Optimization
Function Optimization Newton’s Method. Conjugate Gradients
1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .
Tutorial 12 Unconstrained optimization Conjugate gradients.
Optimization Methods One-Dimensional Unconstrained Optimization
Tutorial 5-6 Function Optimization. Line Search. Taylor Series for Rn
Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.
18 1 Hopfield Network Hopfield Model 18 3 Equations of Operation n i - input voltage to the ith amplifier a i - output voltage of the ith amplifier.
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
Function Optimization. Newton’s Method Conjugate Gradients Method
Advanced Topics in Optimization
Linear Discriminant Functions Chapter 5 (Duda et al.)
Why Function Optimization ?
Math for CSLecture 51 Function Optimization. Math for CSLecture 52 There are three main reasons why most problems in robotics, vision, and arguably every.
Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
UNCONSTRAINED MULTIVARIABLE
84 b Unidimensional Search Methods Most algorithms for unconstrained and constrained optimisation use an efficient unidimensional optimisation technique.
Optimization in Engineering Design Georgia Institute of Technology Systems Realization Laboratory 101 Quasi-Newton Methods.
Nonlinear programming Unconstrained optimization techniques.
1 Optimization Multi-Dimensional Unconstrained Optimization Part II: Gradient Methods.
Multivariate Unconstrained Optimisation First we consider algorithms for functions for which derivatives are not available. Could try to extend direct.
“On Sizing and Shifting The BFGS Update Within The Sized-Broyden Family of Secant Updates” Richard Tapia (Joint work with H. Yabe and H.J. Martinez) Rice.
MODELING MATTER AT NANOSCALES 3. Empirical classical PES and typical procedures of optimization Geometries from energy derivatives.
Quasi-Newton Methods of Optimization Lecture 2. General Algorithm n A Baseline Scenario Algorithm U (Model algorithm for n- dimensional unconstrained.
Dan Simon Cleveland State University Jang, Sun, and Mizutani Neuro-Fuzzy and Soft Computing Chapter 6 Derivative-Based Optimization 1.
Chapter 10 Minimization or Maximization of Functions.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Variations on Backpropagation.
Performance Surfaces.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Function Optimization
Non-linear Minimization
A Fast Trust Region Newton Method for Logistic Regression
Structure Refinement BCHM 5984 September 7, 2009.
Click to edit Master title style
CS5321 Numerical Optimization
CS5321 Numerical Optimization
Non-linear Least-Squares
Variations on Backpropagation.
Outline Single neuron case: Nonlinear error correcting learning
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
Hopfield Network.
CS5321 Numerical Optimization
METHOD OF STEEPEST DESCENT
Optimization Methods TexPoint fonts used in EMF.
6.5 Taylor Series Linearization
~ Least Squares example
~ Least Squares example
Variations on Backpropagation.
Neural Network Training
Performance Surfaces.
Performance Optimization
Outline Preface Fundamentals of Optimization
Conjugate Direction Methods
Presentation transcript:

Section 3: Second Order Methods

Second Order Methods Newton’s Method Conjugate Gradients Davidon-Fletcher-Powell or DFP BFGS In contrast to first order gradient methods, second order methods make use of second derivatives to improve optimization. Davidon-Fletcher-Powell or DFP:

Why do they perform better? Second Order Methods Why do they perform better? Fig Courtesy: https://seas.ucla.edu/~kao/nndl/lectures/optimization.pdf

Why do they perform better? Second Order Methods Why do they perform better? Fig Courtesy: Introduction to Optimization, Marc Toussaint

Newton’s Method Newton’s Method It is based on Taylor’s series expansion to approximate 𝐽(𝜃) near some point 𝜃 0 incorporating second order derivative terms and ignoring derivatives of higher order. 𝐽 𝜃 ≈ 𝐽 𝜃 0 + 𝜃− 𝜃 0 𝑇 ∇ 𝜃 𝐽 𝜃 0 + 1 2 𝜃− 𝜃 0 𝑇 𝐻𝐽 𝜃 0 𝜃− 𝜃 0 Solving for the critical point of this function we obtain the Newton parameter update rule. 𝜃 ∗ = 𝜃 0 − 𝐻 −1 ∇ 𝜃 𝐽( 𝜃 0 )

Newton’s Method Thus for a quadratic function (with positive definite 𝐻) by rescaling the gradient by 𝐻 −1 Newton’s method directly jumps to the minimum If objective function is convex but not quadratic (there are higher-order terms) this update can be iterated yielding the training algorithm given next : If objective function is convex but not quadratic (there are higher-order terms) this update can be iterated yielding the training algorithm given next

Newton’s Method: The Algorithm For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s method can be applied iteratively . This implies a two-step procedure: – First update or compute the inverse Hessian (by updating the qudratic approximation) – Second, update the parameters according to

Problems and Solutions Newton’s method is appropriate only when the Hessian is positive definite In deep learning the surface of the objective function is generally nonconvex. Many saddle points: problematic for Newton’s method Can be avoided by regularizing the Hessian – Adding a constant α along the Hessian diagonal 𝜃 ∗ = 𝜃 0 − [𝐻(𝑓 𝜃 0 +𝛼𝐼] −1 ∇ 𝜃 𝑓 𝜃 0 . Hmmm.. Why?

Can be avoided by regularizing the Hessian . . . When we take a step in the direction of the gradient with size 𝜖, the loss function becomes f 𝑤 − 𝜖𝑔 ≈𝑓 𝑤 0 −𝜖 𝑔 𝑇 𝑔 + 1 2 𝜖 2 𝑔 𝑇 𝐻𝑔. Hessian affects the convergence by introducing second order terms. The eigenvalues represent how fast the parameters converge from the direction of the corresponding eigenvector. Directions with large eigenvalues need smaller step-sizes because they converge much faster, and will diverge if the step-size is too large. Directions with smaller eigenvalues need larger step-sizes because otherwise, they converge far too slowly. The ratio between the largest and smallest eigenvalues is, therefore, very important, which is why it has its own name: the condition number. Adding α decreases this ratio.

Conjugate Gradient Method Method to efficiently avoid calculating 𝐻 −1 – By iteratively descending conjugate directions. Arises from steepest descent for quadratic bowl has an ineffective zig-zag pattern. Let the previous search direction be 𝑑 𝑡−1 . At the minimum, where the line search terminates, the directional derivative is zero in direction 𝑑 𝑡−1 : ∇ 𝜃 𝐽(𝜃) · 𝑑 𝑡−1 = 0. Since the gradient at this point defines the current search direction, 𝑑 𝑡 = ∇ 𝜃 𝐽(𝜃) will have no contribution in the direction 𝑑 𝑡−1 . Thus 𝑑 𝑡 is orthogonal to 𝑑 𝑡−1 .

Imposing Conjugate Gradients We seek to find a search direction that is conjugate to the previous line search direction. At iteration 𝑡 the next search direction 𝑑 𝑡 takes the form 𝑑 𝑡 = ∇ 𝜃 𝐽 𝜃 + 𝛽 𝑡 𝑑 𝑡−1 Directions 𝑑 𝑡 and 𝑑 𝑡−1 are conjugate if 𝑑 𝑡 𝐻 𝑑 𝑡−1 =0.

Imposing Conjugate Gradients: Methods

Conjugate Gradient Method: The Algorithm For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s method can be applied iteratively . This implies a two-step procedure: – First update or compute the inverse Hessian (by updating the qudratic approximation) – Second, update the parameters according to

BFGS Newton’s method without the computational burden Primary difficulty is computation of 𝐻 −1 BFGS is quasi Newton: approximates 𝐻 −1 by matrix 𝑀 𝑡 that is iteratively refined by low-rank (Rank 2) updates. Once the inverse Hessian 𝑀 𝑡 is updated, the direction of descent 𝜌 𝑡 is determined by 𝜌 𝑡 = 𝑀 𝑡 𝑔 𝑡 . A line search is performed in this direction to determine the size of the step, 𝜀 ∗ , taken in this direction. Final update to parameters is 𝜃 𝑡+1 = 𝜃 𝑡+1 + 𝜀 ∗ 𝜌 𝑡 Broyden-Fletcher-Goldfarb-Shanno (BFGS)

BFGS: The Algorithm

L-BFGS The memory costs of the BFGS algorithm can be significantly decreased by avoiding storing the complete inverse Hessian approximation M. The L-BFGS algorithm computes the approximation M using the same method as the BFGS algorithm but beginning with the assumption that 𝑀 𝑡−1 is the identity matrix, rather than storing the approximation from one step to the next. Does not store all iterations, but last m iterations. . If used with exact line searches, the directions defined by L-BFGS are mutually conjugate.