Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Slides:



Advertisements
Similar presentations
Lecture 3. Linear Models for Classification
Advertisements

Component Analysis (Review)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
x – independent variable (input)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Linear Discriminant Analysis (Part II) Lucian, Joy, Jie.
Linear Methods for Classification
Decision Theory Naïve Bayes ROC Curves
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Linear Discriminant Functions Chapter 5 (Duda et al.)
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Lecture 4 Linear machine
Linear Models for Classification
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Machine Learning 5. Parametric Methods.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Probability Theory and Parameter Estimation I
LECTURE 10: DISCRIMINANT ANALYSIS
10701 / Machine Learning.
Classification Discriminant Analysis
Statistical Learning Dong Liu Dept. EEIS, USTC.
Classification Discriminant Analysis
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Support Vector Machines
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006

Summary Bayes Classifiers Linear Classifiers Linear regression of an indicator matrix Linear discriminant analysis (LDA) Logistic regression Separating hyperplanes Reading (ch4, ELS)

Bayes Classifier The marginal distributions of G are specified as PMF p G (g), g=1,2,…,K f X|G (x|G=g) shows the conditional distribution of X for G=g The training set (x i,g i ),i=1,..,N has independent samples from the joint distribution f X,G (x,g) –f X,G (x,g) = p G (g)f X|G (x|G=g) The loss of predicting G * for G is L(G *,G) Classification goal: minimize the expected loss –E X,G L(G(X),G)=E X (E G|X L(G(X),G))

Bayes Classifier (cont’d) It suffices to minimize E G|X L(G(X),G) for each X. The optimal classifier is: –G(x) = argmin g E G|X=x L(g,G) The Bayes rule is also known as the rule of maximum a posteriori probability –G(x) = argmax g Pr(G=g|X=x) Many classification algorithms estimate the Pr(G=g|X=x) and then apply the Bayes rule Bayes classification rule

More About Linear Classification Since predictor G(x) take values in a discrete set G, we can divide the input space into a collection of regions labeled according to classification For K classes (1,2,…,K), and the fitted linear model for k-th indicator response variable is The decision boundary b/w k and l is: An affine set or hyperplane: Model discriminant function  k (x) for each class, then classify x to the class with the largest value for  k (x)

Linear Decision Boundary We require that monotone transformation of  k or Pr(G=k|X=x) be linear Decision boundaries are the set of points with log-odds=0 Prob. of class 1: , prob. of class 2: 1-  Apply a transformation:: log[  /(1-  )]=  0 +  T x Two popular methods that use log-odds –Linear discriminant analysis, linear logistic regression Explicitly model the boundary b/w two classes as linear. For a two-class problem with p-dimensional input space, this is modeling decision boundary as a hyperplane Two methods using separating hyperplanes –Perceptron - Rosenblatt, optimally separating hyperplanes - Vapnik

Generalizing Linear Decision Boundaries Expand the variable set X 1,…,X p by including squares and cross products, adding up to p(p+1)/2 additional variables

Linear Regression of an Indicator Matrix For K classes, K indicators Y k, k=1,…,K, with Y k =1, if G=k, else 0 Indicator response matrix

Linear Regression of an Indicator Matrix (Cont’d) For N training data, form N  K indicator response matrix Y, a matrix of 0’s and 1’s A new observation is classified as follows: –Compute the fitted output (K vector) - –Identify the largest component and classify accordingly: But… how good is the fit? –Verify  k  G f k (x)=1 for any x –f k (x) can be negative or larger than 1 We can allow linear regression into basis expansion of h(x) As the size of training set increases, adaptively add more basis

Linear Regression - Drawback For K  3, especially for large K

Linear Regression - Drawback For large K and small p, masking can naturally occur E.g. Vowel recognition data in 2D subspace, K=11, p=10 dimensions

Linear Regression and Projection * A linear regression function (here in 2D) Projects each point x=[x 1 x 2 ] T to a line parallel to W 1 We can study how well the projected points {z 1,z 2,…,z n }, viewed as functions of w 1, are separated across the classes * Slides Courtesy of Tommi S. Jaakkola, MIT CSAIL

Linear Regression and Projection A linear regression function (here in 2D) Projects each point x=[x 1 x 2 ] T to a line parallel to W 1 We can study how well the projected points {z 1,z 2,…,z n }, viewed as functions of w 1, are separated across the classes

Projection and Classification By varying w 1 we get different levels of separation between the projected points

Optimizing the Projection We would like to find the w 1 that somehow maximizes the separation of the projected points across classes We can quantify the separation (overlap) in terms of means and variations of the resulting 1-D class distribution

Fisher Linear Discriminant: Preliminaries Class description in  d –Class 0: n 0 samples, mean  0, covariance  0 –Class 1: n 1 samples, mean  1, covariance  1 Projected class descriptions in  –Class 0: n 0 samples, mean  0 T w 1, covariance w 1 T  0 w 1 –Class 1: n 1 samples, mean  1 T w 1, covariance w 1 T  1 w 1

Fisher Linear Discriminant Estimation criterion: find w 1 that maximizes The solution (class separation) is decision theoretically optimal for two normal populations with equal covariances (  1 =  0 )

Linear Discriminant Analysis (LDA)  k class prior Pr(G=k) Function f k (x)=density of X in class G=k Bayes Theorem: Leads to LDA, QDA, MDA (mixture DA), Kernel DA, Naïve Bayes Suppose that we model density as a MVG: LDA is when we assume the classes have a common covariance matrix:  k =  k. It’s sufficient to look at log-odds

LDA Log-odds function implies decision boundary b/w k and l: Pr(G=k|X=x)=Pr(G=l|X=x) – linear in x; in p dimensions a hyperplane Example: three classes and p=2

LDA (Cont’d)

In practice, we do not know the parameters of Gaussian distributions. Estimate w/ training set –N k is the number of class k data – For two classes, this is like linear regression

QDA If  k ’s are not equal, the quadratic terms in x remain; we get quadratic discriminant functions (QDA)

QDA (Cont’d) The estimates are similar to LDA, but each class has a separate covariance matrices For large p  dramatic increase in parameters In LDA, there are (K-1)(p+1) parameters For QDA, there are (K-1)  {1+p(p+3)/2} LDA and QDA both work really well This is not because the data is Gaussian, rather, for simple decision boundaries, Gaussian estimates are stable Bias-variance trade-off

Regularized Discriminent Analysis A compromise b/w LDA and QDA. Shrink separate covariances of QDA towards a common covariance (similar to Ridge Reg.)

Example - RDA

Computations for LDA Suppose we compute the eigen decomposition for  k, i.e. U k is p  p orthonormal, D k diagonal matrix of positive eigenvalues d kl. Then, The LDA classifier is implemented as: X*  D -1/2 U T X, where  =UDU T. The common covariance estimate of X* is identity Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities  k

Background: Simple Decision Theory * Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error? * Courtesy of Tommi S. Jaakkola, MIT CSAIL

Background: Simple Decision Theory Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error?

2-Class Logistic Regression The optimal decisions are based on the posterior class probabilities P(y|x). For binary classification problems, we can write these decisions as We generally don’t know P(y|x) but we can parameterize the possible decisions according to

2-Class Logistic Regression (Cont’d) Our log-odds model Gives rise to a specific form for the conditional probability over the labels (the logistic model): Where Is a logistic squashing function That turns linear predictions into probabilities

2-Class Logistic Regression: Decisions Logistic regression models imply a linear decision boundary

K-Class Logistic Regression The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) The choice of denominator is arbitrary, typically last class …..

K-Class Logistic Regression (Cont’d) The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) A simple calculation shows that To emphasize the dependence on the entire parameter set  ={  10,  1 T,…,  (K-1)0,  T (K-1) }, we denote the probabilities as Pr(G=k|X=x) = p k (x;  )

Fitting Logistic Regression Models

IRLS is equivalent to Newton-Raphson procedure

Fitting Logistic Regression Models IRLS algorithm (equivalent to Newton-Raphson) –Initialize . –Form Linearized response: –Form weights w i =p i (1-p i ) –Update  by weighted LS of z i on x i with weights w i –Steps 2-4 repeated until convergence

Example – Logistic Regression South African Heart Disease: –Coronary risk factor study (CORIS) baseline survey, carried out in three rural areas. –White males b/w 15 and 64 –Response: presence or absence of myocardial infarction –Maximum likelihood fit:

Example – Logistic Regression South African Heart Disease:

Logistic Regression or LDA? LDA: This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. Logistic model They use the same form for the logit function

Logistic Regression or LDA? Discriminative vs informative learning: logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality). If normality holds, LDA is up to 30% more efficient; o/w logistic regression can be more robust. But the methods are similar in practice.

Separating Hyperplanes

Perceptrons: compute a linear combination of the input features and return the sign For x 1,x 2 in L,  T (x 1 -x 2 )=0  *=  /||  || normal to surface L For x 0 in L,  T x 0 = -  0 The signed distance of any point x to L is given by

Rosenblatt's Perceptron Learning Algorithm Finds a separating hyperplane by minimizing the distance of misclassified points to the decision boundary If a response y i =1 is misclassified, then x i T  +  0 <0, and the opposite for misclassified point y i =-1 The goal is to minimize

Rosenblatt's Perceptron Learning Algorithm (Cont’d) Stochastic gradient descent The misclassified observations are visited in some sequence and the parameters  updated  is the learning rate, can be 1 w/o loss of generality It can be shown that algorithm converges to a separating hyperplane in a finite number of steps

Optimal Separating Hyperplanes Problem

Example - Optimal Separating Hyperplanes