Section 3: Second Order Methods

Section 3: Second Order Methods

Second Order Methods Newton’s Method Conjugate Gradients
Davidon-Fletcher-Powell or DFP BFGS In contrast to first order gradient methods, second order methods make use of second derivatives to improve optimization. Davidon-Fletcher-Powell or DFP:

Why do they perform better?
Second Order Methods Why do they perform better? Fig Courtesy:

Why do they perform better?
Second Order Methods Why do they perform better? Fig Courtesy: Introduction to Optimization, Marc Toussaint

Newton’s Method Newton’s Method It is based on Taylor’s series expansion to approximate 𝐽(𝜃) near some point 𝜃 0 incorporating second order derivative terms and ignoring derivatives of higher order. 𝐽 𝜃 ≈ 𝐽 𝜃 0 + 𝜃− 𝜃 0 𝑇 ∇ 𝜃 𝐽 𝜃 𝜃− 𝜃 0 𝑇 𝐻𝐽 𝜃 0 𝜃− 𝜃 0 Solving for the critical point of this function we obtain the Newton parameter update rule. 𝜃 ∗ = 𝜃 0 − 𝐻 −1 ∇ 𝜃 𝐽( 𝜃 0 )

Newton’s Method Thus for a quadratic function (with positive definite 𝐻) by rescaling the gradient by 𝐻 −1 Newton’s method directly jumps to the minimum If objective function is convex but not quadratic (there are higher-order terms) this update can be iterated yielding the training algorithm given next : If objective function is convex but not quadratic (there are higher-order terms) this update can be iterated yielding the training algorithm given next

Newton’s Method: The Algorithm
For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s method can be applied iteratively . This implies a two-step procedure: – First update or compute the inverse Hessian (by updating the qudratic approximation) – Second, update the parameters according to

Problems and Solutions
Newton’s method is appropriate only when the Hessian is positive definite In deep learning the surface of the objective function is generally nonconvex. Many saddle points: problematic for Newton’s method Can be avoided by regularizing the Hessian – Adding a constant α along the Hessian diagonal 𝜃 ∗ = 𝜃 0 − [𝐻(𝑓 𝜃 0 +𝛼𝐼] −1 ∇ 𝜃 𝑓 𝜃 0 . Hmmm.. Why?

Can be avoided by regularizing the Hessian . . .
When we take a step in the direction of the gradient with size 𝜖, the loss function becomes f 𝑤 − 𝜖𝑔 ≈𝑓 𝑤 0 −𝜖 𝑔 𝑇 𝑔 𝜖 2 𝑔 𝑇 𝐻𝑔. Hessian affects the convergence by introducing second order terms. The eigenvalues represent how fast the parameters converge from the direction of the corresponding eigenvector. Directions with large eigenvalues need smaller step-sizes because they converge much faster, and will diverge if the step-size is too large. Directions with smaller eigenvalues need larger step-sizes because otherwise, they converge far too slowly. The ratio between the largest and smallest eigenvalues is, therefore, very important, which is why it has its own name: the condition number. Adding α decreases this ratio.

Conjugate Gradient Method
Method to efficiently avoid calculating 𝐻 −1 – By iteratively descending conjugate directions. Arises from steepest descent for quadratic bowl has an ineffective zig-zag pattern. Let the previous search direction be 𝑑 𝑡−1 . At the minimum, where the line search terminates, the directional derivative is zero in direction 𝑑 𝑡−1 : ∇ 𝜃 𝐽(𝜃) · 𝑑 𝑡−1 = 0. Since the gradient at this point defines the current search direction, 𝑑 𝑡 = ∇ 𝜃 𝐽(𝜃) will have no contribution in the direction 𝑑 𝑡−1 . Thus 𝑑 𝑡 is orthogonal to 𝑑 𝑡−1 .

Imposing Conjugate Gradients
We seek to find a search direction that is conjugate to the previous line search direction. At iteration 𝑡 the next search direction 𝑑 𝑡 takes the form 𝑑 𝑡 = ∇ 𝜃 𝐽 𝜃 + 𝛽 𝑡 𝑑 𝑡−1 Directions 𝑑 𝑡 and 𝑑 𝑡−1 are conjugate if 𝑑 𝑡 𝐻 𝑑 𝑡−1 =0.

Imposing Conjugate Gradients: Methods

Conjugate Gradient Method: The Algorithm
For surfaces that are not quadratic, as long as the Hessian remains positive definite, Newton’s method can be applied iteratively . This implies a two-step procedure: – First update or compute the inverse Hessian (by updating the qudratic approximation) – Second, update the parameters according to

BFGS Newton’s method without the computational burden
Primary difficulty is computation of 𝐻 −1 BFGS is quasi Newton: approximates 𝐻 −1 by matrix 𝑀 𝑡 that is iteratively refined by low-rank (Rank 2) updates. Once the inverse Hessian 𝑀 𝑡 is updated, the direction of descent 𝜌 𝑡 is determined by 𝜌 𝑡 = 𝑀 𝑡 𝑔 𝑡 . A line search is performed in this direction to determine the size of the step, 𝜀 ∗ , taken in this direction. Final update to parameters is 𝜃 𝑡+1 = 𝜃 𝑡+1 + 𝜀 ∗ 𝜌 𝑡 Broyden-Fletcher-Goldfarb-Shanno (BFGS)

BFGS: The Algorithm

L-BFGS The memory costs of the BFGS algorithm can be significantly decreased by avoiding storing the complete inverse Hessian approximation M. The L-BFGS algorithm computes the approximation M using the same method as the BFGS algorithm but beginning with the assumption that 𝑀 𝑡−1 is the identity matrix, rather than storing the approximation from one step to the next. Does not store all iterations, but last m iterations. . If used with exact line searches, the directions defined by L-BFGS are mutually conjugate.

Section 3: Second Order Methods

Similar presentations

Presentation on theme: "Section 3: Second Order Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Section 3: Second Order Methods

Similar presentations

Presentation on theme: "Section 3: Second Order Methods"— Presentation transcript:

Similar presentations

About project

Feedback