Optimization
Issues What is optimization? What real life situations give rise to optimization problems? When is it easy to optimize? What are we trying to optimize? What can cause problems when we try to optimize? What methods can we use to optimize?
One-Dimensional Minimization Golden section search Brent’s method
One-Dimensional Minimization Golden section search: successively narrowing the brackets of upper and lower bounds Terminating condition: |x3–x1|<e Start with x1,x2,x3 where f2 is smaller than f1 and f3 Iteration: Choose x4 somewhere in the larger interval Two cases for f4: f4a: [x1,x2,x4] f4b: [x2,x4,x3] Initial bracketing…
From GSL Upper bound a, lower bound b, initial estimate x f(a) > f(x) < f(b) This condition guarantees that a minimum is contained somewhere within the interval. On each iteration a new point x' is selected using one of the available algorithms. If the new point is a better estimate of the minimum, i.e. where f(x') < f(x), then the current estimate of the minimum x is updated. The new point also allows the size of the bounded interval to be reduced, by choosing the most compact set of points which satisfies the constraint f(a) > f(x) < f(b). The interval is reduced until it encloses the true minimum to a desired tolerance. This provides a best estimate of the location of the minimum and a rigorous error estimate.
Golden Section Search Guaranteed linear convergence: [GSL] Choosing the golden section as the bisection ratio can be shown to provide the fastest convergence for this type of algorithm. Golden Section Search Guaranteed linear convergence: [x1,x3]/[x1,x4] = 1.618
Golden Section f (reference)
Fibonacci Search (ref) Related… Fi: 0, 1, 1, 2, 3, 5, 8, 13, …
Parabolic Interpolation (Brent)
Brent Details (From GSL) The minimum of the parabola is taken as a guess for the minimum. If it lies within the bounds of the current interval then the interpolating point is accepted, and used to generate a smaller interval. If the interpolating point is not accepted then the algorithm falls back to an ordinary golden section step. The full details of Brent's method include some additional checks to improve convergence.
Brent(details) The abscissa x that is the minimum of a parabola through three points (a,f(a)), (b,f(b)), (c,f(c))
Multi-Dimensional Minimization Gradient Descent Conjugate Gradient
Gradient and Hessian f: RnR. If f(x) is of class C2, objective function Gradient of f Hessian of f
Optimality Taylor’s expansion Positive semi-definite Hessian For one dimensional f(x)
Multi-Dimensional Optimization Higher dimensional root finding is no easier (more difficult) than minimization
Quasi-Newton Method Taylor’s series of f(x) around xk: B: an approximation to the Hessian matrix The gradient of this approximation: Setting this gradient to zero provides the Newton step: The various quasi-Newton methods (DFP, BFGS, Broyden) differ in their choice of the solution to update B.
Are the directions always orthogonal? Yes! Gradient Descent Are the directions always orthogonal? Yes!
Example Minimize minimum
…
Gradient is perpendicular to level curves and surfaces (proof)
Weakness of Gradient Descent Narrow valley
Any function f(x) can be locally approximated by a quadratic function where Conjugate gradient method is a method that works well on this kind of problems
Conjugate Gradient An iterative method for solving linear systems Ax=b, where A is symmetric and positive definite Guaranteed to converge in n steps, where n is the system size Symmetric A is positive definite if it has (any of these): All n eigenvalues are positive All n upper left determinants are positive All n pivots are positive xTAx is positive except at x = 0
Details (from wikipedia) Two nonzero vectors u & v are conjugate w.r.t. A: {pk} are n mutually conjugate directions. {pk} form a basis of Rn. x*, the solution to Ax=b, can be expressed in this basis Therefore, Find pk’s Solve ak’s
The Iterative Method Equivalent problem: find the minimal of the quadratic function, Taking the first basis vector p1 to be the gradient of f at x = x0; the other vectors in the basis will be conjugate to the gradient rk: the residual at kth step, Note that rk is the negative gradient of f at x = xk
The Algorithm
Example Stationary point at [-1/26, -5/26]
Solving Linear Equations The optimality condition seems to suggest that CG can be used to solve linear equations CG is only applicable for symmetric positive definite A. For arbitrary linear systems, solve the normal equation since ATA is symmetric and positive-semidefinite for any A But, k(ATA) = k(A)^2! Slower convergence, worse accuracy BiCG (biconjugate gradient) is the approach to use for general A
Multidimensional Minimizer [GSL] Conjugate gradient Fletcher-Reeves, Polak-Ribiere Quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) Utilizes 2nd order approximation Steepest descent Inefficient (for demonstration purpose) Simplex algorithm (Nelder and Mead) Without derivative
GSL Example Objective function: paraboloid Starting from (5,7)
Conjugate gradient Converge in 12 iterations Steepest descent Converge in 158 iterations
[Solutions in Numerical Recipe] Sec.2.7 linbcg (biconjugate gradient): general A Reference A implicitly through atimes Sec.10.6 frprmn (minimization) Model test problem: spacetime, …