Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Bayes rule, priors and maximum a posteriori
Linear Regression.
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
x – independent variable (input)
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
An Introduction to Logistic Regression
Maximum likelihood (ML)
Logistic regression for binary response variables.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
PATTERN RECOGNITION AND MACHINE LEARNING
Principles of Pattern Recognition
Statistical Decision Theory
Random Sampling, Point Estimation and Maximum Likelihood.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Generalized Linear Models (GLMs) and Their Applications.
Lecture 2: Statistical learning primer for biologists
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Machine Learning 5. Parametric Methods.
Logistic regression (when you have a binary response variable)
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Latent Variables, Mixture Models and EM
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Presentation transcript:

Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Perceptron

Approaches to Classification  Probabilistic  Inference stage – use training data to learn a model for p(C k |x)  Decision stage – use posterior probabilities p(C k |x) to make optimal class assignments  Discriminant fn.  Solve both problems together and simply learn a fn. that maps x directly to a class  Probabilistic  Generative  First solve inference problem of determining class conditional densities p(x|C k ) for each class C k and then determine posterior probabilities  Discriminative

Probabilistic Models  Probabilistic view of classification!  Models with linear decision boundaries arise from simple assumptions about the distribution of data  Two approaches:  Generative Models (2 steps)  Discriminative Models (1 step)  In both cases use decision theory to assign a new x to a class

Probabilistic Generative Models  2 step process  Model class conditional densities p(x|Ck) & class priors p(Ck)  Use them to compute class posterior probabilities p(Ck|x) according to Bayes’ theorem  2 class case  Posterior prob. for class C1 = p(C1|x) = σ (a), the logistic sigmoid fn.  a = &  Sigmoid fn is S-shaped and is also called the squashing fn because it maps the whole real line into a finite interval (maps real a ε (-∞, +∞) to finite (0,1) interval)  Plays an important role in many classification algorithms

Probabilistic Generative Models  Symmetry property   Inverse of the logistic sigmoid fn is given by & is called the logit or log odds function because it represents the log of the ratio of the probabilities for the 2 classes

Probabilistic Generative Models  Posterior probabilities have been written in an equivalent form using σ! p(C1|x) = σ (a)  What’s the significance of doing so?  We shall see this shortly when a(x) takes a simple functional form  If a(x) is a linear fn. of x, then then posterior prob. is governed by a generalized linear model  Generalized Linear Model?

Probabilistic Generative Models Generalized Linear Model  In linear regression models, the model prediction y(x,w)was given by a linear function of the parameters w  In the simplest case, the model is also linear in the input variables x so that y is a real no.  In classification, we wish to predict discrete class labels, or more generally posterior probabilities in (0,1)  To achieve this, we consider a generalization of this model in which we transform the linear fn of w using a non-linear fn f(.) so that In ML, f(.) is know as an activation fn., whereas its inverse is called a link fn. in statistical literature

Probabilistic Generative Models Generalized Linear Model  Decision surface corresponds to so th  Decision surfaces are linear fns of x, even if the fn f(.) is non-linear  Generalized linear models are no longer linear in the parameters due to non-linear fn f(.)  More complex in terms of analytical and computational properties!  Still simpler that the more general non-linear models

Probabilistic Generative Models K>2 classes: Softmax Function

Probabilistic Generative Models K>2 classes: Softmax Function

Probabilistic Generative Models Softmax Function

Probabilistic Generative Models Softmax Function

Probabilistic Generative Models Softmax Function The soft maximum approximates the hard maximum and is a convex function just like the hard maximum. But the soft maximum is smooth. It has no sudden changes in direction and can be differentiated as many times as you like. These properties make it easy for convex optimization algorithms to work with the soft maximum. In fact, the function may have been invented for optimization;

Probabilistic Generative Models Softmax Function Accuracy of the soft maximum approximation depends on scale Multiplying x and y by a large constant brings the soft maximum closer to the hard maximum For example, g(1, 2) = 2.31, but g(10, 20) = “hardness” of the soft maximum can be controlled by generalizing the soft maximum to depend on a parameter k. g(x, y; k) = log( exp(kx) + exp(ky) ) / k Soft maximum can be made as close to the hard maximum as desired by making k large enough For every value of k the soft maximum is differentiable, but the size of the derivatives increase as k increases. In the limit the derivative becomes infinite as the soft maximum converges to the hard maximum

Probabilistic Generative Models  Forms of class-conditional densities  Continuous Inputs (x follows Gaussian distribution)  Discrete Inputs (for example )

Probabilistic Generative Models  Continuous Inputs (x follows Gaussian distribution)  All classes share the same covariance matrix  All classes do not share the same covariance matrix

Probabilistic Discriminative Models  Logistic Regression  2-class  Multi-class  Parameters using  Maximum Likelihood  Iterative Reweighted Least Squares (IRLS)  Probit Regression

Probabilistic Discriminative Models  Logistic Regression  Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous  Consider a binary response variable  Variable with two outcomes  One outcome represented by a 1 and the other represented by a 0  Examples: Does the person have a disease? Yes or No Who is the person voting for? McCain or Obama Outcome of a baseball game? Win or loss

Probabilistic Discriminative Models  Logistic Regression Example Data Set  Response Variable –> Admission to Grad School (Admit)  0 if admitted, 1 if not admitted  Predictor Variables  GRE Score (gre)  Continuous  University Prestige (topnotch)  1 if prestigious, 0 otherwise  Grade Point Average (gpa)  Continuous

Probabilistic Discriminative Models  First 10 Observations of the Data Set ADMITGRETOPNOTCHGPA

Logistic Regression  Consider the linear probability model  Issue: π (X i ) can take on values less than 0 or greater than 1  Issue: Predicted probability for some subjects fall outside of the [0,1] range.

Logistic Regression  Consider the logistic regression model  GLM with binomial random component and identity link g( μ ) = logit( μ )  Range of values for π (X i ) is 0 to 1

Logistic Regression  Consider the logistic regression model And the linear probability model Then the graph of the predicted probabilities for different grade point averages:

What is Logistic Regression?  In a nutshell: A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables. Used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used).

What is Logistic Regression?  What is the “Logistic” component? Instead of modeling the outcome, Y, directly, the method models the log odds(Y) using the logistic function.

Logistic Regression  Simple logistic regression = logistic regression with 1 predictor variable  Multiple logistic regression = logistic regression with multiple predictor variables  Multiple logistic regression = Multivariable logistic regression = Multivariate logistic regression

Logistic Regression predictor variables is the log(odds) of the outcome. dichotomous outcome

Logistic Regression intercept is the log(odds) of the outcome. model coefficients

Odds & Probability

Maximum Likelihood  Flipped a fair coin 10 times: T, H, H, T, T, H, H, T, H, H  What is the Pr(Heads) given the data? 1/100? 1/5? 1/2? 6/10?

T, H, H, T, T, H, H, T, H, H  What is the Pr(Heads) given the data?  Most reasonable data-based estimate would be 6/10.  In fact, is the ML estimator of p. Maximum Likelihood

Discrete distribution, finite parameter space How biased an unfair coin is?unfair coin Call the probability of tossing a HEAD p. Determine p. Toss the coin 80 times Outcome is 49 HEADS and 31 TAILS, Suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3, one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3. NO labels on these coins Using maximum likelihood estimation the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values:probability mass functionbinomial distribution Maximum Likelihood: Example

The likelihood is maximized when p = 2/3, and so this the maximum likelihood estimate for p. Maximum Likelihood: Example Discrete distribution, finite parameter space

Discrete distribution, continuous parameter space Now suppose that there was only one coin but its p could have been any value 0 ≤ p ≤ 1. The likelihood function to be maximized is: and the maximization is over all possible values 0 ≤ p ≤ 1. differentiatingdifferentiating with respect to p (solutions p = 0, p = 1, and p = 49/80) The solution which maximizes the likelihood is clearly p = 49/80 Thus the maximum likelihood estimator for p is 49/80. Maximum Likelihood: Example

Continuous distribution, continuous parameter space  Do it for Gaussian Distribution yourself!  Two parameters, μ & σ Maximum Likelihood: Example Its expectation value is equal to the parameter μ of the given distribution,expectation value which means that the maximum-likelihood estimator μ is unbiased. This means that the estimator is biased. However, is consistent. In this case it could be solved individually for each parameter. In general, it may not be the case.

 The method of maximum likelihood estimation chooses values for parameter estimates (regression coefficients) which make the observed data “maximally likely.”  Standard errors are obtained as a by-product of the maximization process Maximum Likelihood

The Logistic Regression Model intercept is the log(odds) of the outcome. model coefficients

Maximum Likelihood  We want to choose β ’s that maximizes the probability of observing the data we have: Assumption: independent y’s

 Obvious possibility is to use traditional linear regression model  But this has problems  Distribution of dependent variable hardly normal  Predicted probabilities cannot be less than 0, greater than 1 Linear probability model

Linear probability model predictions

 Instead, use logistic transformation (logit) of probability, log of the odds Logistic regression model

Logistic regression model predictions

 Least-squares no longer best way of estimating parameters of logistic regression model  Instead use maximum likelihood estimation  Finds values of parameters that have greatest probability, given data Estimation of logistic regression model

 Data on 24 space shuttle launches prior to Challenger  Dependent variable, whether shuttle flight experienced thermal distress incident  Independent variables  Date – whether shuttle changes or age has effect  Temperature – whether joint temperature on booster has effect Space shuttle data

 Dependent variable  Any thermal distress on launch  Independent variable  Date (days since 1/1/60)  SPSS procedure  Regression, Binary logistic First model—date as single independent variable

Predicted probability of thermal distress using date

Exponential of B as change in odds

 Odds is the ratio of probability of success to probability of failure  Like odds on horse races  Even odds, odds = 1, implies probability equals 0.5  Odds = 2 means 2 to 1 in favor of success, implies probability of  Odds = 0.5 means 1 to 2 in favor (or 2 to 1 against) success, implies probability of What does “odds” mean?

 Logistic regression can be extended to use multiple independent variables exactly like linear regression Multiple logistic regression

 Dependent variable  Any thermal distress on launch  Independent variables  Date (days since 1/1/60)  Joint temperature, degrees F Adding joint temperature to the logistic regression model

Probabilistic Discriminative Models  Find posterior class probabilities directly  Use functional form of generalized linear model and determine its parameters directly by using maximum likelihood principle  Iterative Reweighted Least Squares (IRLS)  Maximize a likelihood fn defined through the conditional distribution p(Ck|x), which represents a form of discriminative training  Advantage of discriminative approach – fewer no. of adaptive parameters to be determined (linear in M)  Improved predictive performance particularly when the class-conditional density assumptions give a poor approximation of the true distributions

Probabilistic Discriminative Models  Classification methods work directly with the original input vector  All such algorithms are still applicable if we first make a fixed NL transformation of the inputs using a vector of basis fns ϕ(x)  Decision boundaries linear in the feature space ϕ  In linear models of regression, one of the basis fn is typically set to a constant say, so that the corresponding parameter plays the role of bias.  A fixed basis fn transformation ϕ(x) will be used in

Probabilistic Discriminative Models Original Input Space (x1,x2) Feature Space ( φ 1, φ 2) Although we use linear classification models Linear-separability in feature space does not imply linear-separability in input space