Linear Regression.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Logistic Regression.
Chapter 8 – Logistic Regression
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
Section 4.2 Fitting Curves and Surfaces by Least Squares.
x – independent variable (input)
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Maximum likelihood (ML) and likelihood ratio (LR) test
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Collaborative Filtering Matrix Factorization Approach
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Biointelligence Laboratory, Seoul National University
Model Inference and Averaging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Chapter 4: Introduction to Predictive Modeling: Regressions
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Generalized Linear Models (GLMs) and Their Applications.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Logistic Regression William Cohen.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
10701 / Machine Learning.
Lecture 04: Logistic Regression
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Learning Theory Reza Shadmehr
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Mathematical Foundations of BME
Presentation transcript:

Linear Regression

Task: Learning a real valued function f: x->y where x=<x1,…,xn> as a linear function of the input features xi: Using x0=1, we can write as:

Linear Regression 3 3

Cost function We want to penalize from deviation from the target values: Cost function J(q) is a convex quadratic function, so no local minima. 4 4

Linear Regression – Cost function 5 5

Finding q that minimizes J(q) Gradient descent: Lets consider what happens for a single input pattern:

Gradient Descent Stochastic Gradient Descent (update after each pattern) vs Batch Gradient Descent (below):

Need for scaling input features 8 8

Finding q that minimizes J(q) Closed form solution: where X is the row vector of data points.:

If we assume with e(i) being iid and normally distributed around zero. we can see that the least-squares regression corresponds to finding the maximum likelihood estimate of θ:

Underfitting: What if a line isn’t a good fit? We can add more features => overfitting  Regularization

Regularized Linear Regression 13 13

Regularized Linear Regression 14 14

Skipped Locally weighted linear regression You can read more in: http://cs229.stanford.edu/notes/cs229-notes1.pdf

Logistic Regression

Logistic Regression - Motivation Lets now focus on the binary classification problem in which y can take on only two values, 0 and 1. x is a vector of real-valued features, < x1 … xn > We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn’t make sense for h(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}.

Logistic Function 18 18

Derivative of the Logistic Function 20 20

P(y | x; θ) = (h(x))y (1 − h(x))1−y Interpretation: hq(x) : estimate of probability that y=1 for a given x hq(x) = P(y = 1 | x; θ) Thus: P(y = 1 | x; θ) = hq(x) P(y = 0 | x; θ) = 1 − hq(x) Which can be written more compactly as: P(y | x; θ) = (h(x))y (1 − h(x))1−y 21 21

Logistic Regression 22 22

23 23

24 24

Mean Squared Error – Not Convex 25 25

Alternative cost function? 26 26

New cost function Make the cost function steeper: Intuitively, saying that p(malignant|x)=0 and being wrong should be penalized severely! 27 27

New cost function 28 28

New cost function 29 29

30 30

31 31

32 32

Minimizing the New Cost function Convex! 33 33

Fitting q

Fitting q Working with a single input and remembering h(x) = g(qTx):

Skipped Alternative: Maximizing l(q) using Newton’s method

From http://www. cs. cmu. edu/~tom/10701_sp11/recitations/Recitation_3 38 38

Regularized Logistic Regression 39 39

Multinomial Logistic Regression MaxEnt Classifier Softmax Regression Multinomial Logistic Regression MaxEnt Classifier

Softmax Regression Softmax regression model generalizes logistic regression to classification problems where the class label y can take on more than two possible values.  The response variable y can take on any one of k values, so y ∈ {1, 2, . . . , k}.

kx(n+1) matrix

45 45

Softmax Derivation from Logistic Regression 46 46

One fairly simple way to arrive at the multinomial logit model is to imagine, for K possible outcomes, running K-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other K-1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, if outcome K (the last outcome) is chosen as the pivot:

48 48

49 49

Cost Function We now describe the cost function that we'll use for softmax regression. In the equation below, 1{.} is the indicator function, so that 1{a true statement} = 1, and 1{a false statement} = 0. For example, 1{2 + 2 = 4} evaluates to 1; whereas 1{1 + 1 = 5} evaluates to 0.

Remember that for logistic regression, we had: which can be written similarly as:

Note also that in softmax regression, we have that The softmax cost function is similar, except that we now sum over the k different possible values of the class label. Note also that in softmax regression, we have that  : logistic : softmax                                        .

Fitting q

Over-parametrization Softmax regression has an unusual property that it has a "redundant" set of parameters. Suppose we take each of our parameter vectors θj, and subtract some fixed vector ψ from it, so that every θj is now replaced with θj − ψ (for every  ). Our hypothesis now estimates the class label probabilities as

Over-parametrization The softmax model is over-parameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function hθ mapping from inputs x to the predictions. Further, if the cost function J(θ) is minimized by some setting of the parameters  , then it is also minimized by    for any value of ψ.                  ,

Weight Decay

Softmax function The softmax function will return: a value close to 0 whenever xk is significantly less than the maximum of all the values, a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value.

which approximates the non-smooth function Thus, the softmax function can be used to construct a weighted average that behaves as a smooth function (which can be conveniently differentiated, etc.) and which approximates the non-smooth function 

Logit 59 59

Let's say that the probability of success of some event is. 8 Let's say that the probability of success of some event is .8.  Then the probability of failure is 1- .8 = .2.  The odds of success are defined as the ratio of the probability of success over the probability of failure.  In the above example, the odds of success are .8/.2 = 4.  That is to say that the odds of success are  4 to 1.  The transformation from probability to odds is a monotonic transformation, meaning the odds increase as the probability increases or vice versa.  Probability ranges from 0 and 1.  Odds range from 0 and positive infinity. 

logit(p) = log(p/(1-p))= β0 + β1*x1 + ... + βk*xk Why do we take all the trouble doing the transformation from probability to log odds?  One reason is that it is usually difficult to model a variable which has restricted range, such as probability.  This transformation is an attempt to get around the restricted range problem.  It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity.  Another reason is that among all of the infinitely many choices of transformation, the log of odds is one of the easiest to understand and interpret.  This transformation is called logit transformation.  The other common choice is the probit transformation, which will not be covered here. A logistic regression model allows us to establish a relationship between a binary outcome variable and a group of predictor variables.  It models the logit-transformed probability as a linear relationship with the predictor variables.  More formally, let y be the binary outcome variable indicating failure/success with 0/1 and p be the probability of y to be 1, p = prob(y=1). Let x1, .., xk be a set of predictor variables.  Then the logistic regression of y on x1, ..., xk estimates parameter values for β0, β1, . . . , βk via maximum likelihood method of the following equation. logit(p) = log(p/(1-p))= β0 + β1*x1 + ... + βk*xk In terms of probabilities, the equation above is translated into p= exp(β0 + β1*x1 + ... + βk*xk)/(1+exp(β0 + β1*x1 + ... + βk*xk))