Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Neural networks Introduction Fitting neural networks
Regularization David Kauchak CS 451 – Fall 2013.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Easy Optimization Problems, Relaxation, Local Processing for a single variable.
x – independent variable (input)
Curve-Fitting Regression
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Nonlinear Regression Probability and Statistics Boris Gervits.
September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.
CS 4700: Foundations of Artificial Intelligence
Classification and Prediction: Regression Analysis
1 1 Slide Simple Linear Regression Chapter 14 BA 303 – Spring 2011.
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Curve-Fitting Regression
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Models for Classification
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Chapter 2-OPTIMIZATION
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
The simple linear regression model and parameter estimation
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
One-layer neural networks Approximation problems
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Probabilistic Models for Linear Regression
Collaborative Filtering Matrix Factorization Approach
Linear regression Fitting a straight line to observations.
Nonlinear regression.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Neural networks (1) Traditional multi-layer perceptrons
The Math of Machine Learning
Neural Network Training
Linear Discrimination
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Linear regression with one variable
Presentation transcript:

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University

Linear Regression Linear regression: involves a response variable y and a single predictor variable x  y = w0 + w1 x The weights w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: estimates the best-fitting straight line w0 and w1 are obtained by minimizing the sum of the squared errors (a.k.a. residuals) w1 can be obtained by setting the partial derivative of the SSE to 0 and solving for w1, ultimately resulting in:

Multiple Linear Regression Multiple linear regression: involves more than one predictor variable Features represented as x1, x2, …, xd Training data is of the form (x1, y1), (x2, y2),…, (xn, yn) (each xj is a row vector in matrix X, i.e. a row in the data) For a specific value of a feature xi in data item xj we use: Ex. For 2-D data, the regression function is: More generally: x1 x2 y

Least Squares Generalization Multiple dimensions To simplify add a new feature x0 = 1 to feature vector x: x0 x1 x2 y 1 1 1 1 1

Least Squares Generalization Calculate the error function (SSE) and determine w: Closed form solution to

Gradient Descent Optimization Linear regression can also be solved using Gradient Decent optimization approach GD can be used in a variety of settings to find the minimum value of functions (including non-linear functions) where a closed form solution is not available or not easily obtained Basic idea: Given an objective function J(w) (e.g., sum of squared errors), with w as a vector of variables w0, w1, …, wd, iteratively minimize J(w) by finding the gradient of the function surface in the variable-space and adjusting the weights in the opposite direction The gradient is a vector with each element representing the slope of the function in the direction of one of the variables Each element is the partial derivative of function with respect to one of variables

Optimization An example - quadratic function in 2 variables: f( x ) = f( x1, x2 ) = x12 + x1x2 + 3x22 f( x ) is minimum where gradient of f( x ) is zero in all directions

Optimization Gradient is a vector Each element is the slope of function along direction of one of variables Each element is the partial derivative of function with respect to one of variables Example:

Optimization Gradient vector points in direction of steepest ascent of function

Optimization This two-variable example is still simple enough that we can find minimum directly Set both elements of gradient to 0 Gives two linear equations in two variables Solve for x1, x2

Optimization Finding minimum directly by closed form analytical solution often difficult or impossible Quadratic functions in many variables system of equations for partial derivatives may be ill-conditioned example: linear least squares fit where redundancy among features is high Other convex functions global minimum exists, but there is no closed form solution example: maximum likelihood solution for logistic regression Nonlinear functions partial derivatives are not linear example: f( x1, x2 ) = x1( sin( x1x2 ) ) + x22 example: sum of transfer functions in neural networks

Gradient descent optimization Given an objective (e.g., error) function E(w) = E(w0, w1, …, wd) Process (follow the gradient downhill): Pick an initial set of weights (random): w = (w0, w1, …, wd) Determine the descent direction: -E(wt) Choose a learning rate:  Update your position: wt+1 = wt -  E(wt) Repeat from 2) until stopping criterion is satisfied Typical stopping criteria E( wt+1 ) ~ 0 some validation metric is optimized Note: this step involves simultaneous updating of each weight wi

Gradient descent optimization In Least Squares Regression: Process (follow the gradient downhill): Select initial w = (w0, w1, …, wd) Compute -E(w) Set  Update: w := w -  E(w) Repeat until E( wt+1 ) ~ 0

Illustration of Gradient Descent E(w) w1 w0

Illustration of Gradient Descent E(w) w1 w0

Illustration of Gradient Descent E(w) w1 Direction of steepest descent = direction of negative gradient w0

Illustration of Gradient Descent E(w) w1 Original point in weight space New point in weight space w0

Gradient descent optimization Problems: Choosing step size (learning rate) too small  convergence is slow and inefficient too large  may not converge Can get stuck on “flat” areas of function Easily trapped in local minima

Stochastic gradient descent Application to training a machine learning model: Choose one sample from training set: xi Calculate objective function for that single sample: Calculate gradient from objective function: Update model parameters a single step based on gradient and learning rate: Repeat from 1) until stopping criterion is satisfied Typically entire training set is processed multiple times before stopping Order in which samples are processed can be fixed or random.