 # Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.

## Presentation on theme: "Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in."— Presentation transcript:

Unconstrained Optimization Rong Jin

Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in determining the step size  Small step size  slow convergence  Large step size  oscillation or bubbling

Recap: Newton Method  Univariate Newton method  Mulvariate Newton method  Guarantee to converge when the objective function is convex/concave Hessian matrix

Recap  Problem with standard Newton method Computing inverse of Hessian matrix H is expensive (O(n^3)) The size of Hessian matrix H can be very large (O(n^2))  Quasi-Newton method (BFGS): Approximate the inverse of Hessian matrix H with another matrix B Avoid the difficulty in computing inverse of H However, still have problem when the size of B is large  Limited memory Quasi-Newton method (L-BFGS) Storing a set of vectors instead of matrix B Avoid the difficulty in computing the inverse of H Avoid the difficulty in storing the large-size B

Recap Number of Variable Standard Newton method: O(n 3 ) Small Medium Quasi Newton method (BFGS): O(n 2 ) Limited-memory Quasi Newton method (L-BFGS): O(n) Large Convergence Rate V-Fast Fast R-Fast

Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29,602246 Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule3504.8 811.13 Lex1545114.21 17620.02 Summary3321190.22 698.52 Shallow1452785962.53 4212420.30 Limited-memory Quasi-Newton method Gradient ascent

Free Software  http://www.ece.northwestern.edu/~nocedal/so ftware.html http://www.ece.northwestern.edu/~nocedal/so ftware.html L-BFGS L-BFGSB

Conjugate Gradient  Another Great Numerical Optimization Method !

Linear Conjugate Gradient Method  Consider optimizing the quadratic function  Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property  The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution:   k is the minimizer along the kth conjugate direction

Example  Minimize the following function  Matrix A  Conjugate direction  Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1

How to Efficiently Find a Set of Conjugate Directions  Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1

Nonlinear Conjugate Gradient  Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions Guarantee convergence if the objective is convex/concave  Variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)  More robust than FR-CG  Compared to Newton method The first order method Usually less efficient than Newton method However, it is simple to implement

Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29,602246 Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule1421.93 811.13 Lex28121.72 17620.02 Summary53731.66 698.52 Shallow281316251.12 4212420.30 Limited-memory Quasi-Newton method Conjugate Gradient (PR)

Free Software  http://www.ece.northwestern.edu/~nocedal/so ftware.html http://www.ece.northwestern.edu/~nocedal/so ftware.html CG+

When Should We Use Which Optimization Technique  Using Newton method if you can find a package  Using conjugate gradient if you have to implement it  Using gradient ascent/descent if you are lazy

Logarithm Bound Algorithms  To maximize Start with a guess Do it for t = 1, 2, …, T  Compute  Find a decoupling function  Find optimal solution  Touch Point

Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function  (x)  f(x) + f(x 0 ) Touch point:  (x 0 ) =0 Touch Point Optimal solution x 1 for  (x)

Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function  (x)  f(x) + f(x 0 ) Touch point:  (x 0 ) =0 Optimal solution x 1 for  (x) Repeat the above procedure

Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function  (x)  f(x) + f(x 0 ) Touch point:  (x 0 ) =0 Optimal solution x 1 for  (x) Repeat the above procedure Converge to the optimal point Optimal Point

Property of Concave Functions  For any concave function

Important Inequality  log(x), -exp(x) are concave functions  Therefore

Expectation-Maximization Algorithm  Derive the EM algorithm for Hierarchical Mixture Model m 1 (x) r(x) m 2 (x) X y  Log-likelihood of training data

Download ppt "Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in."

Similar presentations