Chapter 2-OPTIMIZATION

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

Machine Learning and Data Mining Linear regression
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Experimental Design, Response Surface Analysis, and Optimization
Ch11 Curve Fitting Dr. Deshi Ye
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
x – independent variable (input)
Motion Analysis (contd.) Slides are from RPI Registration Class.
CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
Advanced Topics in Optimization
Linear Discriminant Functions Chapter 5 (Duda et al.)
CS 4700: Foundations of Artificial Intelligence
Classification and Prediction: Regression Analysis
Chapter 10 Real Inner Products and Least-Square (cont.)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Normalised Least Mean-Square Adaptive Filtering
Collaborative Filtering Matrix Factorization Approach
Radial Basis Function Networks
Neural Networks Lecture 8: Two simple learning algorithms
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Artificial Neural Networks
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Biointelligence Laboratory, Seoul National University
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
Model representation Linear regression with one variable
Andrew Ng Linear regression with one variable Model representation Machine Learning.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Curve-Fitting Regression
CHAPTER 4 Adaptive Tapped-delay-line Filters Using the Least Squares Adaptive Filtering.
Derivative Free Optimization G.Anuradha. Contents Genetic Algorithm Simulated Annealing Random search method Downhill simplex method.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
559 Fish 559; Lecture 5 Non-linear Minimization. 559 Introduction Non-linear minimization (or optimization) is the numerical technique that is used by.
The problem of overfitting
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
M Machine Learning F# and Accord.net.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Variations on Backpropagation.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Computacion Inteligente Least-Square Methods for System Identification.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Chapter 7. Classification and Prediction
CSE 4705 Artificial Intelligence
A Simple Artificial Neuron
Collaborative Filtering Matrix Factorization Approach
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
6.5 Taylor Series Linearization
Simple Linear Regression
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
What are optimization methods?
Linear regression with one variable
Presentation transcript:

Chapter 2-OPTIMIZATION G.Anuradha

Contents Derivative-based Optimization Derivative-free Optimization Descent Methods The Method of Steepest Descent Classical Newton’s Method Step Size Determination Derivative-free Optimization Genetic Algorithms Simulated Annealing Random Search Downhill Simplex Search

What is Optimization? Choosing the best element from some set of available alternatives Solving problems in which one seeks to minimize or maximize a real function

Notation of Optimization Optimize y=f(x1,x2….xn) --------------------------------1 subject to gj(x1,x2…xn) ≤ / ≥ /= bj ----------------------2 where j=1,2,….n Eqn:1 is objective function Eqn:2 a set of constraints imposed on the solution. x1,x2…xn are the set of decision variables Note:- The problem is either to maximize or minimize the value of objective function.

Complicating factors in optimization Existence of multiple decision variables Complex nature of the relationships between the decision variables and the associated income Existence of one or more complex constraints on the decision variables

Types of optimization Constraint:- Solution is arrived at by maximizing or minimizing the objective function Unconstraint:- No constraints are imposed on the decision variables and differential calculus can be used to analyze them

Least Square Methods for System Identification System Identification:- Determining a mathematical model for an unknown system by observing the input-output data pairs System identification is required To predict a system behavior To explain the interactions and relationship between inputs and outputs To design a controller System identification Structure identification Parameter identification

Structure identification Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is conducted y=f(u;θ) y – model’s output u – Input Vector θ – parameter vector

Parameter Identification Structure of the model is known and optimization techniques are applied to determine the parameter vector θ= θ

Block diagram of parameter identification

Parameter identification An input ui is applied to both the system and the model Difference between the target system’s output yi and model’s output yi is used to update a parameter vector θ to minimize the difference System identification is not a one-pass process; it needs to do both structure and parameter identification repeatedly

Classification of Optimization algorithms Derivative-based algorithms:- Derivative-free algorithms

Characteristics of derivative free algorithm Derivative freeness:- repeated evaluation of objective function Intuitive guidelines:- concepts are based on nature’s wisdom, such as evolution and thermodynamics Slower Flexibility Randomness:- global optimizers Analytic Opacity:-knowledge about them are based on empirical studies Iterative nature:-

Characteristics of derivative free algorithm Stopping condition of iteration:- let k denote an iteration count and fk denote the best objective function obtained at count k. stopping condition depends on Computation time Optimization goal; Minimal Improvement Minimal relative improvement

Basics of Matrix Manipulation and Calculus  

Basics of Matrix Manipulation and Calculus  

Gradient of a Scalar Function  

Jacobian of a Vector Function  

Least Square Estimator Method of least squares is a standard approach to approximate solution of overdetermined systems. Least Squares- Overall solution minimizes the sum of the squares of the errors made in solving every single equation Application—Data Fitting

Types of Least Squares Least Squares Linear:- It is a linear combination of parameters. The model may represent a straight line, a parabola or any other linear combination of functions Non-Linear:- the parameters appear as functions, such as β2,eβx. If the derivatives are either constant or depend only on the values of the independent variable, the model is linear else non-linear.

Differences between Linear and Non-Linear Least Squares Algorithms Does not require initial values Algorithms Require Initial values Globally concave; Non convergence is not an issue Non convergence is a common issue Normally solved using direct methods Usually an iterative process Solution is unique Multiple minima in the sum of squares Yields unbiased estimates even when errors are uncorrelated with predictor values Yields biased estimates

Model representation Linear regression with one variable Machine Learning

Housing Prices (Portland, OR) (in 1000s of dollars) Size (feet2) Supervised Learning Given the “right answer” for each example in the data. Regression Problem Predict real-valued output

Size in feet2 (x) Price ($) in 1000's (y) 2104 460 1416 232 1534 315 Training set of housing prices (Portland, OR) Size in feet2 (x) Price ($) in 1000's (y) 2104 460 1416 232 1534 315 852 178 … Notation: m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable

Training Set How do we represent h ? Learning Algorithm Size of house Estimated price Linear regression with one variable. Univariate linear regression.

Linear regression with one variable Cost function Machine Learning

Training Set Size in feet2 (x) Price ($) in 1000's (y) 2104 460 1416 232 1534 315 852 178 … Hypothesis: ‘s: Parameters How to choose ‘s ?

Idea: Choose so that is close to for our training examples y x Idea: Choose so that is close to for our training examples

Cost function intuition I Linear regression with one variable Cost function intuition I Machine Learning

Simplified Hypothesis: Parameters: Cost Function: Goal:

(for fixed , this is a function of x) (function of the parameter ) y x

(function of the parameter ) (for fixed , this is a function of x) y x

(function of the parameter ) (for fixed , this is a function of x) y x

Cost function intuition II Linear regression with one variable Cost function intuition II Machine Learning

Hypothesis: Parameters: Cost Function: Goal:

(for fixed , this is a function of x) (function of the parameters ) Price ($) in 1000’s Size in feet2 (x)

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

Linear regression with one variable Gradient descent Machine Learning

Have some function Want Outline: Start with some Keep changing to reduce until we hopefully end up at a minimum

J(0,1) 1 0

J(0,1) 1 0

Gradient descent algorithm Correct: Simultaneous update Incorrect:

Gradient descent intuition Linear regression with one variable Gradient descent intuition Machine Learning

Gradient descent algorithm

If α is too small, gradient descent can be slow. If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.

at local optima Current value of

Gradient descent can converge to a local minimum, even with the learning rate α fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

Gradient descent for linear regression Linear regression with one variable Gradient descent for linear regression Machine Learning

Gradient descent algorithm Linear Regression Model

Gradient descent algorithm update and simultaneously

J(0,1) 1 0

J(0,1) 1 0

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

(for fixed , this is a function of x) (function of the parameters )

“Batch”: Each step of gradient descent uses all the training examples. “Batch” Gradient Descent “Batch”: Each step of gradient descent uses all the training examples.

Multiple features Linear Regression with multiple variables Machine Learning

2104 460 1416 232 1534 315 852 178 … Multiple features (variables). Size (feet2) Price ($1000) 2104 460 1416 232 1534 315 852 178 …

Multiple features (variables). Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 2104 5 1 45 460 1416 3 2 40 232 1534 30 315 852 36 178 …

Multiple features (variables). Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 2104 5 1 45 460 1416 3 2 40 232 1534 30 315 852 36 178 … Notation: = number of features = input (features) of training example. = value of feature in training example. Pop-up Quiz

Hypothesis: Previously:

For convenience of notation, define . Multivariate linear regression.

Gradient descent for multiple variables Linear Regression with multiple variables Gradient descent for multiple variables Machine Learning

Hypothesis: Parameters: Cost function: Gradient descent: Repeat (simultaneously update for every ) Repeat Gradient descent:

Gradient Descent New algorithm : Repeat Previously (n=1): Repeat (simultaneously update for ) (simultaneously update )

Gradient descent in practice I: Feature Scaling Linear Regression with multiple variables Gradient descent in practice I: Feature Scaling Machine Learning

Idea: Make sure features are on a similar scale. Feature Scaling Idea: Make sure features are on a similar scale. size (feet2) E.g. = size (0-2000 feet2) = number of bedrooms (1-5) number of bedrooms

Feature Scaling Get every feature into approximately a range.

Mean normalization Replace with to make features have approximately zero mean (Do not apply to ). E.g.

Gradient descent in practice II: Learning rate Linear Regression with multiple variables Gradient descent in practice II: Learning rate Machine Learning

Gradient descent “Debugging”: How to make sure gradient descent is working correctly. How to choose learning rate .

Making sure gradient descent is working correctly. Example automatic convergence test: Declare convergence if decreases by less than in one iteration. No. of iterations

Making sure gradient descent is working correctly. Gradient descent not working. Use smaller . No. of iterations No. of iterations No. of iterations For sufficiently small , should decrease on every iteration. But if is too small, gradient descent can be slow to converge.

Summary: If is too small: slow convergence. If is too large: may not decrease on every iteration; may not converge. To choose , try

Features and polynomial regression Linear Regression with multiple variables Features and polynomial regression Machine Learning

Housing prices prediction

Polynomial regression Price (y) Size (x)

Choice of features Price (y) Size (x)

Normal equation Linear Regression with multiple variables Machine Learning

Gradient Descent Normal equation: Method to solve for analytically.

Intuition: If 1D (for every ) Solve for

Examples: Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 1 2104 5 45 460 1416 3 2 40 232 1534 30 315 852 36 178 Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 2104 5 1 45 460 1416 3 2 40 232 1534 30 315 852 36 178 Pop-up Quiz

Examples: Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 1 2104 5 45 460 1416 3 2 40 232 1534 30 315 852 36 178 Size (feet2) Number of bedrooms Number of floors Age of home (years) Price ($1000) 2104 5 1 45 460 1416 3 2 40 232 1534 30 315 852 36 178 3000 4 38 540 Pop-up Quiz

examples ; features. E.g. If

is inverse of matrix . Octave: pinv(X’*X)*X’*y

training examples, features. Gradient Descent Normal Equation Need to choose . Needs many iterations. No need to choose . Don’t need to iterate. Works well even when is large. Need to compute Slow if is very large.

Normal equation and non-invertibility (optional) Linear Regression with multiple variables Normal equation and non-invertibility (optional) Machine Learning

What if is non-invertible? (singular/ degenerate) Normal equation What if is non-invertible? (singular/ degenerate) Octave: pinv(X’*X)*X’*y

What if is non-invertible? Redundant features (linearly dependent). E.g. size in feet2 size in m2 Too many features (e.g. ). Delete some features, or use regularization.

Linear model Regression Function    

Linear model contd… Using matrix notation Where A is a m*n matrix

  Due to noise a small amount of error is added  

Least Square Estimator  

Problem on Least Square Estimator

Derivative Based Optimization Deals with gradient-based optimization techniques, capable of determining search directions according to an objective function’s derivative information Used in optimizing non-linear neuro-fuzzy models, Steepest descent Conjugate gradient

First-Order Optimality Condition x ( ) * D + Ñ T = 1 2 - ¼ For small Dx: If x* is a minimum, this implies: If then But this would imply that x* is not a minimum. Therefore Since this must be true for every Dx,

Second-Order Condition If the first-order condition is satisfied (zero gradient), then A strong minimum will exist at x* if for any Dx ° 0. Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if: for any z ° 0. This is a sufficient condition for optimality. A necessary condition is that the Hessian matrix be positive semidefinite. A matrix A is positive semidefinite if: for any z.

Basic Optimization Algorithm pk - Search Direction ak - Learning Rate