Prof. Navneet Goyal CS & IS BITS, Pilani

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Linear Regression.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Logistic Regression.
Logit & Probit Regression
Chapter 4: Linear Models for Classification
Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani.
Models with Discrete Dependent Variables
Visual Recognition Tutorial
x – independent variable (input)
Visual Recognition Tutorial
Evaluating Hypotheses
Log-linear and logistic models
Visual Recognition Tutorial
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
Logistic regression for binary response variables.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
PATTERN RECOGNITION AND MACHINE LEARNING
Statistical Decision Theory
Random Sampling, Point Estimation and Maximum Likelihood.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Introduction to Behavioral Statistics Probability, The Binomial Distribution and the Normal Curve.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CpSc 881: Machine Learning Evaluating Hypotheses.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Linear Models for Classification
Generalized Linear Models (GLMs) and Their Applications.
Logistic Regression Analysis Gerrit Rooks
Machine Learning 5. Parametric Methods.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Logistic regression (when you have a binary response variable)
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Logistic Regression: Regression with a Binary Dependent Variable.
BINARY LOGISTIC REGRESSION
Probability Theory and Parameter Estimation I
Notes on Logistic Regression
Generalized Linear Models
Statistical Learning Dong Liu Dept. EEIS, USTC.
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Mathematical Foundations of BME Reza Shadmehr
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen, 2005 References:
Discrete Random Variables: Basics
Discrete Random Variables: Basics
Discrete Random Variables: Basics
Presentation transcript:

Prof. Navneet Goyal CS & IS BITS, Pilani Logistic Regression Prof. Navneet Goyal CS & IS BITS, Pilani

Perceptron

Logistic Regression In Linear regression, the dependent variable is continuous What if the dependent is dichotomous or binary? A person will vote for Reagan (1) or Carter (0)? A woman will give birth to a low weight baby(1) or not (0)? Does the person have a disease? Yes (1) or No (0) Outcome of a baseball game? Win (1) or loss (0) A linear regression model will not be able solve it with acceptable error Moreover, values above 1 and below 0 do not make any sense Also, we are more interested in probabilites (or odds ratio) than in 0 or 1 output What can be done?

Logistic Regression Actuary Example: Model P(death|X), the probability that a person X will die within the next 10 years X = {x1=age,x2=M/F,x3= cholestrol level} ∑wixi = wTX is not a probability!! Introduce logistic function: σ (a) = 1/(1+e-a)

Probit Regression 2-class Multi-class Parameters using Logistic Regression 2-class Multi-class Parameters using Maximum Likelihood Iterative Reweighted Least Squares (IRLS) Probit Regression

The Logistic Function Sigmoid fn is S-shaped and is also called the squashing fn because it maps the whole real line into a finite interval (maps real a ε (-∞, +∞) to finite (0,1) interval) Plays an important role in many classification algorithms Symmetry property: Inverse of the logistic sigmoid fn is given by & is called the logit or log odds function because it represents the log of the ratio of the probabilities

Probabilistic Discriminative Models Logistic Regression Example Data Set Response Variable –> Admission to Grad School (Admit) 0 if admitted, 1 if not admitted Predictor Variables GRE Score (gre) Continuous University Prestige (topnotch) 1 if prestigious, 0 otherwise Grade Point Average (gpa)

Probabilistic Discriminative Models First 10 Observations of the Data Set ADMIT GRE TOPNOTCH GPA 1 380 0 3.61 0 660 1 3.67 0 800 1 4 0 640 0 3.19 1 520 0 2.93 0 760 0 3 0 560 0 2.98 1 400 0 3.08 0 540 0 3.39 1 700 1 3.92

Dot-plot: Data from Table 2

Logistic regression (2) Table 3 Prevalence (%) of signs of CD according to age group

Dot-plot: Data from Table 3 Diseased % Age (years)

Logistic Regression Consider the linear probability model Issue: π(Xi) can take on values less than 0 or greater than 1 Issue: Predicted probability for some subjects fall outside of the [0,1] range. 12

Logistic Regression Consider the logistic regression model GLM with binomial random component and identity link g(μ) = logit(μ) Range of values for π(Xi) is 0 to 1 13

Logistic Regression Consider the logistic regression model And the linear probability model Then the graph of the predicted probabilities for different grade point averages: 14

What is Logistic Regression? In a nutshell: A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables. Used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used).

What is Logistic Regression? What is the “Logistic” component? Instead of modeling the outcome, Y, directly, the method models the log odds(Y) using the logistic function.

Logistic Regression Simple logistic regression = logistic regression with 1 predictor variable Multiple logistic regression = logistic regression with multiple predictor variables Multiple logistic regression = Multivariable logistic regression = Multivariate logistic regression

Logistic Regression predictor variables dichotomous outcome is the log(odds) of the outcome.

Logistic Regression is the log(odds) of the outcome. intercept model coefficients is the log(odds) of the outcome.

Odds & Probability

Maximum Likelihood Flipped a fair coin 10 times: T, H, H, T, T, H, H, T, H, H What is the Pr(Heads) given the data? 1/100? 1/5? 1/2? 6/10?

Maximum Likelihood T, H, H, T, T, H, H, T, H, H What is the Pr(Heads) given the data? Most reasonable data-based estimate would be 6/10. In fact, is the ML estimator of p.

Maximum Likelihood: Example Discrete distribution, finite parameter space How biased an unfair coin is? Call the probability of tossing a HEAD p. Determine p. Toss the coin 80 times Outcome is 49 HEADS and 31 TAILS, Suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3, one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3. NO labels on these coins Using maximum likelihood estimation the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values:

Maximum Likelihood: Example Discrete distribution, finite parameter space The likelihood is maximized when p = 2/3, and so this the maximum likelihood estimate for p.

Maximum Likelihood: Example Discrete distribution, continuous parameter space Now suppose that there was only one coin but its p could have been any value 0 ≤ p ≤ 1. The likelihood function to be maximized is: and the maximization is over all possible values 0 ≤ p ≤ 1. differentiating with respect to p (solutions p = 0, p = 1, and p = 49/80) The solution which maximizes the likelihood is clearly p = 49/80 Thus the maximum likelihood estimator for p is 49/80.

Maximum Likelihood: Example Continuous distribution, continuous parameter space Do it for Gaussian Distribution yourself! Two parameters, μ & σ Its expectation value is equal to the parameter μ of the given distribution, which means that the maximum-likelihood estimator μ is unbiased. This means that the estimator is biased. However, is consistent. In this case it could be solved individually for each parameter. In general, it may not be the case.

Maximum Likelihood The method of maximum likelihood estimation chooses values for parameter estimates (regression coefficients) which make the observed data “maximally likely.” Standard errors are obtained as a by-product of the maximization process

The Logistic Regression Model intercept model coefficients is the log(odds) of the outcome.

Maximum Likelihood We want to choose β’s that maximizes the probability of observing the data we have: Assumption: independent y’s

Linear probability model Obvious possibility is to use traditional linear regression model But this has problems Distribution of dependent variable hardly normal Predicted probabilities cannot be less than 0, greater than 1

Linear probability model predictions

Logistic regression model Instead, use logistic transformation (logit) of probability, log of the odds

Logistic regression model predictions

Estimation of logistic regression model Least-squares no longer best way of estimating parameters of logistic regression model Instead use maximum likelihood estimation Finds values of parameters that have greatest probability, given data

Space shuttle data Data on 24 space shuttle launches prior to Challenger Dependent variable, whether shuttle flight experienced thermal distress incident Independent variables Date – whether shuttle changes or age has effect Temperature – whether joint temperature on booster has effect

First model—date as single independent variable Any thermal distress on launch Independent variable Date (days since 1/1/60) SPSS procedure Regression, Binary logistic

Predicted probability of thermal distress using date

Exponential of B as change in odds

What does “odds” mean? Odds is the ratio of probability of success to probability of failure Like odds on horse races Even odds, odds = 1, implies probability equals 0.5 Odds = 2 means 2 to 1 in favor of success, implies probability of 0.667 Odds = 0.5 means 1 to 2 in favor (or 2 to 1 against) success, implies probability of 0.333

Multiple logistic regression Logistic regression can be extended to use multiple independent variables exactly like linear regression

Adding joint temperature to the logistic regression model Dependent variable Any thermal distress on launch Independent variables Date (days since 1/1/60) Joint temperature, degrees F

Probabilistic Discriminative Models Find posterior class probabilities directly Use functional form of generalized linear model and determine its parameters directly by using maximum likelihood principle Iterative Reweighted Least Squares (IRLS) Maximize a likelihood fn defined through the conditional distribution p(Ck|x), which represents a form of discriminative training Advantage of discriminative approach – fewer no. of adaptive parameters to be determined (linear in M) Improved predictive performance particularly when the class-conditional density assumptions give a poor approximation of the true distributions

Probabilistic Discriminative Models Classification methods work directly with the original input vector All such algorithms are still applicable if we first make a fixed NL transformation of the inputs using a vector of basis fns ϕ(x) Decision boundaries linear in the feature space ϕ In linear models of regression, one of the basis fn is typically set to a constant say , so that the corresponding parameter plays the role of bias. A fixed basis fn transformation ϕ(x) will be used in

Probabilistic Discriminative Models Original Input Space (x1,x2) Feature Space (φ1,φ2) Although we use linear classification models Linear-separability in feature space does not imply linear-separability in input space