Linear Methods for Classification

Slides:



Advertisements
Similar presentations
Lecture 3. Linear Models for Classification
Advertisements

Component Analysis (Review)
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
The General Linear Model. The Simple Linear Model Linear Regression.
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
x – independent variable (input)
Maximum likelihood (ML) and likelihood ratio (LR) test
Lecture Notes for CMPUT 466/551 Nilanjan Ray
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Discriminant Analysis (Part II) Lucian, Joy, Jie.
Linear Methods for Classification
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Linear Models for Classification
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Linear Classifier Team teaching.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 2. Bayesian Decision Theory
Linear Methods for Classification, Part 1
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
CH 5: Multivariate Methods
Classification Discriminant Analysis
Classification Discriminant Analysis
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Linear Discrimination
Presentation transcript:

Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray

Linear Classification What is meant by linear classification? The decision boundaries in the in the feature (input) space is linear Should the regions be contiguous? R1 R2 X2 R3 R4 X1 Piecewise linear decision boundaries in 2D input space

Linear Classification… There is a discriminant function k(x) for each class k Classification rule: In higher dimensional space the decision boundaries are piecewise hyperplanar Remember that 0-1 loss function led to the classification rule: So, can serve as k(x)

Linear Classification… All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear An example: So that Linear

Linear Classification as a Linear Regression 2D Input space: X = (X1, X2) Number of classes/categories K=3, So output Y = (Y1, Y2, Y3) Training sample, size N=5, Each row has exactly one 1 indicating the category/class Indicator Matrix Regression output: Or, Classification rule:

The Masking Linear regression of the indicator matrix can lead to masking 2D input space and three classes Masking Viewing direction LDA can avoid this masking

Linear Discriminant Analysis Essentially minimum error Bayes’ classifier Assumes that the conditional class densities are (multivariate) Gaussian Assumes equal covariance for every class Posterior probability Application of Bayes rule k is the prior probability for class k fk(x) is class conditional density or likelihood density

LDA… Classification rule: is equivalent to: The good old Bayes classifier!

LDA… When are we going to use the training data? Total N input-output pairs Nk number of pairs in class k Total number of classes: K Training data utilized to estimate Prior probabilities: Means: Covariance matrix:

LDA: Example LDA was able to avoid masking here

Quadratic Discriminant Analysis Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices The class decision boundaries are not linear rather quadratic

QDA and Masking Better than Linear Regression in terms of handling masking: Usually computationally more expensive than LDA

Fisher’s Linear Discriminant [DHS] From training set we want to find out a direction where the separation between the class means is high and overlap between the classes is small

Fisher’s LD… Projection of a vector x on a unit vector w: Geometric interpretation: From training set we want to find out a direction w where the separation between the projections of class means is high and the projections of the class overlap is small

Fisher’s LD… Class means: Projected class means: Difference between projected class means: Scatter of projected data (this will indicate overlap between the classes):

Fisher’s LD… Ratio of difference of projected means over total scatter: Rayleigh quotient where We want to maximize r(w). The solution is

Fisher’s LD: Classifier So far so good. However, how do we get the classifier? All we know at this point is that the direction separates the projected data very well Since we know that the projected class means are well separated, we can choose average of the two projected means as a threshold for classification Classification rule: x in R2 if y(x)>0, else x in R1, where

Fisher’s LD and LDA They become same when Prior probabilities are same Common covariance matrix for the class conditional densities Both class conditional densities are multivariate Gaussian Ex. Show that Fisher’s LD classifier and LDA produce the same rule of classification given the above assumptions Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario

Logistic Regression The output of regression is the posterior probability i.e., Pr(output | input) Always ensures that the sum of output variables is 1 and each output is non-negative A linear classification method We need to know about two concepts to understand logistic regression Newton-Raphson method Maximum likelihood estimation

Newton-Raphson Method A technique for solving non-linear equation f(x)=0 Taylor series: After rearrangement: If xn+1 is a root or very close to the root, then: So: Rule for iteration Need an initial guess x0

Newton-Raphson in Multi-dimensions We want to solve the equations: Taylor series: After some rearrangement etc. the rule for iteration: (Need an initial guess) Jacobian matrix

Newton-Raphson : Example Solve: Iteration rule need initial guess

Maximum Likelihood Parameter Estimation Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it. Samples: x1,….,xN Form the likelihood function: Estimate the parameters that maximize the likelihood function Let’s find out

Logistic Regression Model The method directly models the posterior probabilities as the output of regression x is p-dimensional input vector k is a p-dimensional vector for each k Total number of parameters is (K-1)(p+1) Note that the class boundaries are linear How can we show this linear nature? What is the discriminant function for every class in this model?

Logistic Regression Computation Let’s fit the logistic regression model for K=2, i.e., number of classes is 2 Training set: (xi, gi), i=1,…,N Log-likelihood: xi are (p+1)-dimensional input vector with leading entry 1 is a (p+1)-dimensional vector yi = 1 if gi =1; yi = 0 if gi =2 We want to maximize the log-likelihood in order to estimate 

Newton-Raphson for LR (p+1) Non-linear equations to solve for (p+1) unknowns Solve by Newton-Raphson method:

Newton-Raphson for LR… So, NR rule becomes: W is a N-by-N diagonal matrix with ith diagonal entry:

Newton-Raphson for LR… Adjusted response Iteratively reweighted least squares (IRLS)

Example: South African Heart Disease

Example: South African Heart Disease… After data fitting in the logistic regression model: Coefficient Std. Error Z Score (Intercept) -4.130 0.964 -4.285 sbp 0.006 1.023 tobacco 0.080 0.026 3.034 ldl 0.185 0.057 3.219 famhist 0.939 0.225 4.178 obesity -0.035 0.029 -1.187 alcohol 0.001 0.004 0.136 age 0.043 0.010 4.184

Example: South African Heart Disease… After ignoring negligible coefficients: What happened to systolic blood pressure? Obesity?

Multi-Class Logistic Regression NR update:

Multi-Class LR… is a N(K-1) dimension vector: (z) is a indicator function: is a N(K-1) dimension vector:

MC-LR…

LDA vs. Logistic Regression LDA (Generative model) Assumes Gaussian class-conditional densities and a common covariance Model parameters are estimated by maximizing the full log likelihood, parameters for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters Makes use of marginal density information Pr(X) Easier to train, low variance, more efficient if model is correct Higher asymptotic error, but converges faster Logistic Regression (Discriminative model) Assumes class-conditional densities are members of the (same) exponential family distribution Model parameters are estimated by maximizing the conditional log likelihood, simultaneous consideration of all other classes, (K-1)(p+1) parameters Ignores marginal density information Pr(X) Harder to train, robust to uncertainty about the data generation process Lower asymptotic error, but converges more slowly

Generative vs. Discriminative Learning Example Linear Discriminant Analysis Logistic Regression Objective Functions Full log likelihood: Conditional log likelihood Model Assumptions Class densities: e.g. Gaussian in LDA Discriminant functions Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization Advantages More efficient if model correct, borrows strength from p(x) More flexible, robust because fewer assumptions Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)