Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

Slides:



Advertisements
Similar presentations
Zhen Lu CPACT University of Newcastle MDC Technology Reduced Hessian Sequential Quadratic Programming(SQP)
Advertisements

Nonlinear Programming McCarl and Spreen Chapter 12.
Engineering Optimization
Optimization of thermal processes
Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Computer vision: models, learning and inference
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.
Function Optimization Newton’s Method. Conjugate Gradients
Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.
1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .
Tutorial 12 Unconstrained optimization Conjugate gradients.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.
Optimization Methods One-Dimensional Unconstrained Optimization
The Widrow-Hoff Algorithm (Primal Form) Repeat: Until convergence criterion satisfied return: Given a training set and learning rate Initial:  Minimize.
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Tutorial 5-6 Function Optimization. Line Search. Taylor Series for Rn
Expectation Maximization Algorithm
Optimization Methods One-Dimensional Unconstrained Optimization
Unconstrained Optimization Problem
Engineering Optimization
Function Optimization. Newton’s Method Conjugate Gradients Method
Advanced Topics in Optimization
D Nagesh Kumar, IIScOptimization Methods: M2L3 1 Optimization using Calculus Optimization of Functions of Multiple Variables: Unconstrained Optimization.
Newton's Method for Functions of Several Variables
Why Function Optimization ?
An Introduction to Optimization Theory. Outline Introduction Unconstrained optimization problem Constrained optimization problem.
Optimization Methods One-Dimensional Unconstrained Optimization

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.
UNCONSTRAINED MULTIVARIABLE
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 9. Optimization problems.
Collaborative Filtering Matrix Factorization Approach
ENCI 303 Lecture PS-19 Optimization 2
84 b Unidimensional Search Methods Most algorithms for unconstrained and constrained optimisation use an efficient unidimensional optimisation technique.
1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.
Scalable training of L1-regularized log-linear models
Optimization in Engineering Design Georgia Institute of Technology Systems Realization Laboratory 101 Quasi-Newton Methods.
Application of Differential Applied Optimization Problems.
Nonlinear programming Unconstrained optimization techniques.
Fin500J: Mathematical Foundations in Finance
Discriminant Functions
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
1 Optimization Multi-Dimensional Unconstrained Optimization Part II: Gradient Methods.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
559 Fish 559; Lecture 5 Non-linear Minimization. 559 Introduction Non-linear minimization (or optimization) is the numerical technique that is used by.
A comparison between PROC NLP and PROC OPTMODEL Optimization Algorithm Chin Hwa Tan December 3, 2008.
Quasi-Newton Methods of Optimization Lecture 2. General Algorithm n A Baseline Scenario Algorithm U (Model algorithm for n- dimensional unconstrained.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Lecture 13. Geometry Optimization References Computational chemistry: Introduction to the theory and applications of molecular and quantum mechanics, E.
1 Chapter 6 General Strategy for Gradient methods (1) Calculate a search direction (2) Select a step length in that direction to reduce f(x) Steepest Descent.
Exam 1 Oct 3, closed book Place ITE 119, Time:12:30-1:45pm
Survey of unconstrained optimization gradient based algorithms
Hand-written character recognition
Data assimilation for weather forecasting G.W. Inverarity 06/05/15.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Regularized Least-Squares and Convex Optimization.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Probabilistic Models for Linear Regression
CS5321 Numerical Optimization
Collaborative Filtering Matrix Factorization Approach
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
Performance Optimization
Section 3: Second Order Methods
CS5321 Numerical Optimization
Presentation transcript:

Unconstrained Optimization Rong Jin

Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?

Gradient Ascent  Compute the gradient  Increase weights w and threshold b in the gradient direction

Problem with Gradient Ascent  Difficult to find the appropriate step size Small   slow convergence Large   oscillation or “bubbling”  Convergence conditions Robbins-Monroe conditions Along with “regular” objective function will ensure convergence 

Newton Method  Utilizing the second order derivative  Expand the objective function to the second order around x 0  The minimum point is  Newton method for optimization  Guarantee to converge when the objective function is convex

Multivariate Newton Method  Object function comprises of multiple variables Example: logistic regression model Text categorization: thousands of words  thousands of variables  Multivariate Newton Method Multivariate function: First order derivative  a vector Second order derivative  Hessian matrix  Hessian matrix is mxm matrix  Each element in Hessian matrix is defined as:

Multivariate Newton Method  Updating equation:  Hessian matrix for logistic regression model  Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000  100 million entries Even worse, we have compute the inverse of Hessian matrix H -1

Quasi-Newton Method  Approximate the Hessian matrix H -1 with another B matrix:  B is update iteratively (BFGS): Utilizing derivatives of previous iterations

Limited-Memory Quasi-Newton  Quasi-Newton Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix  large storage  Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)

Efficiency Number of Variable Standard Newton method: O(n 3 ) Small Medium Quasi Newton method (BFGS): O(n 2 ) Limited-memory Quasi Newton method (L-BFGS): O(n) Large Convergence Rate V-Fast Fast R-Fast

Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29, Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule Lex Summary Shallow Limited-memory Quasi-Newton method Gradient ascent

Free Software  ftware.html ftware.html L-BFGS L-BFGSB

Linear Conjugate Gradient Method  Consider optimizing the quadratic function  Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property  The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution:   k is the minimizer along the kth conjugate direction

Example  Minimize the following function  Matrix A  Conjugate direction  Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1

How to Efficiently Find a Set of Conjugate Directions  Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1

Nonlinear Conjugate Gradient  Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions  Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)  More robust than FR-CG  Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix