Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Similar presentations


Presentation on theme: "Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006."— Presentation transcript:

1 Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006

2 Summary Bayes Classifiers Linear Classifiers Linear regression of an indicator matrix Linear discriminant analysis (LDA) Logistic regression Separating hyperplanes Reading (ch4, ELS)

3 Bayes Classifier The marginal distributions of G are specified as PMF p G (g), g=1,2,…,K f X|G (x|G=g) shows the conditional distribution of X for G=g The training set (x i,g i ),i=1,..,N has independent samples from the joint distribution f X,G (x,g) –f X,G (x,g) = p G (g)f X|G (x|G=g) The loss of predicting G * for G is L(G *,G) Classification goal: minimize the expected loss –E X,G L(G(X),G)=E X (E G|X L(G(X),G))

4 Bayes Classifier (cont’d) It suffices to minimize E G|X L(G(X),G) for each X. The optimal classifier is: –G(x) = argmin g E G|X=x L(g,G) The Bayes rule is also known as the rule of maximum a posteriori probability –G(x) = argmax g Pr(G=g|X=x) Many classification algorithms estimate the Pr(G=g|X=x) and then apply the Bayes rule Bayes classification rule

5 More About Linear Classification Since predictor G(x) take values in a discrete set G, we can divide the input space into a collection of regions labeled according to classification For K classes (1,2,…,K), and the fitted linear model for k-th indicator response variable is The decision boundary b/w k and l is: An affine set or hyperplane: Model discriminant function  k (x) for each class, then classify x to the class with the largest value for  k (x)

6 Linear Decision Boundary We require that monotone transformation of  k or Pr(G=k|X=x) be linear Decision boundaries are the set of points with log-odds=0 Prob. of class 1: , prob. of class 2: 1-  Apply a transformation:: log[  /(1-  )]=  0 +  T x Two popular methods that use log-odds –Linear discriminant analysis, linear logistic regression Explicitly model the boundary b/w two classes as linear. For a two-class problem with p-dimensional input space, this is modeling decision boundary as a hyperplane Two methods using separating hyperplanes –Perceptron - Rosenblatt, optimally separating hyperplanes - Vapnik

7 Generalizing Linear Decision Boundaries Expand the variable set X 1,…,X p by including squares and cross products, adding up to p(p+1)/2 additional variables

8 Linear Regression of an Indicator Matrix For K classes, K indicators Y k, k=1,…,K, with Y k =1, if G=k, else 0 Indicator response matrix

9 Linear Regression of an Indicator Matrix (Cont’d) For N training data, form N  K indicator response matrix Y, a matrix of 0’s and 1’s A new observation is classified as follows: –Compute the fitted output (K vector) - –Identify the largest component and classify accordingly: But… how good is the fit? –Verify  k  G f k (x)=1 for any x –f k (x) can be negative or larger than 1 We can allow linear regression into basis expansion of h(x) As the size of training set increases, adaptively add more basis

10 Linear Regression - Drawback For K  3, especially for large K

11 Linear Regression - Drawback For large K and small p, masking can naturally occur E.g. Vowel recognition data in 2D subspace, K=11, p=10 dimensions

12 Linear Regression and Projection * A linear regression function (here in 2D) Projects each point x=[x 1 x 2 ] T to a line parallel to W 1 We can study how well the projected points {z 1,z 2,…,z n }, viewed as functions of w 1, are separated across the classes * Slides Courtesy of Tommi S. Jaakkola, MIT CSAIL

13 Linear Regression and Projection A linear regression function (here in 2D) Projects each point x=[x 1 x 2 ] T to a line parallel to W 1 We can study how well the projected points {z 1,z 2,…,z n }, viewed as functions of w 1, are separated across the classes

14 Projection and Classification By varying w 1 we get different levels of separation between the projected points

15 Optimizing the Projection We would like to find the w 1 that somehow maximizes the separation of the projected points across classes We can quantify the separation (overlap) in terms of means and variations of the resulting 1-D class distribution

16 Fisher Linear Discriminant: Preliminaries Class description in  d –Class 0: n 0 samples, mean  0, covariance  0 –Class 1: n 1 samples, mean  1, covariance  1 Projected class descriptions in  –Class 0: n 0 samples, mean  0 T w 1, covariance w 1 T  0 w 1 –Class 1: n 1 samples, mean  1 T w 1, covariance w 1 T  1 w 1

17 Fisher Linear Discriminant Estimation criterion: find w 1 that maximizes The solution (class separation) is decision theoretically optimal for two normal populations with equal covariances (  1 =  0 )

18 Linear Discriminant Analysis (LDA)  k class prior Pr(G=k) Function f k (x)=density of X in class G=k Bayes Theorem: Leads to LDA, QDA, MDA (mixture DA), Kernel DA, Naïve Bayes Suppose that we model density as a MVG: LDA is when we assume the classes have a common covariance matrix:  k =  k. It’s sufficient to look at log-odds

19 LDA Log-odds function implies decision boundary b/w k and l: Pr(G=k|X=x)=Pr(G=l|X=x) – linear in x; in p dimensions a hyperplane Example: three classes and p=2

20 LDA (Cont’d)

21 In practice, we do not know the parameters of Gaussian distributions. Estimate w/ training set –N k is the number of class k data – For two classes, this is like linear regression

22 QDA If  k ’s are not equal, the quadratic terms in x remain; we get quadratic discriminant functions (QDA)

23 QDA (Cont’d) The estimates are similar to LDA, but each class has a separate covariance matrices For large p  dramatic increase in parameters In LDA, there are (K-1)(p+1) parameters For QDA, there are (K-1)  {1+p(p+3)/2} LDA and QDA both work really well This is not because the data is Gaussian, rather, for simple decision boundaries, Gaussian estimates are stable Bias-variance trade-off

24 Regularized Discriminent Analysis A compromise b/w LDA and QDA. Shrink separate covariances of QDA towards a common covariance (similar to Ridge Reg.)

25 Example - RDA

26 Computations for LDA Suppose we compute the eigen decomposition for  k, i.e. U k is p  p orthonormal, D k diagonal matrix of positive eigenvalues d kl. Then, The LDA classifier is implemented as: X*  D -1/2 U T X, where  =UDU T. The common covariance estimate of X* is identity Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities  k

27 Background: Simple Decision Theory * Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error? * Courtesy of Tommi S. Jaakkola, MIT CSAIL

28 Background: Simple Decision Theory Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y) How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error?

29 2-Class Logistic Regression The optimal decisions are based on the posterior class probabilities P(y|x). For binary classification problems, we can write these decisions as We generally don’t know P(y|x) but we can parameterize the possible decisions according to

30 2-Class Logistic Regression (Cont’d) Our log-odds model Gives rise to a specific form for the conditional probability over the labels (the logistic model): Where Is a logistic squashing function That turns linear predictions into probabilities

31 2-Class Logistic Regression: Decisions Logistic regression models imply a linear decision boundary

32 K-Class Logistic Regression The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) The choice of denominator is arbitrary, typically last class …..

33 K-Class Logistic Regression (Cont’d) The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one) A simple calculation shows that To emphasize the dependence on the entire parameter set  ={  10,  1 T,…,  (K-1)0,  T (K-1) }, we denote the probabilities as Pr(G=k|X=x) = p k (x;  )

34 Fitting Logistic Regression Models

35 IRLS is equivalent to Newton-Raphson procedure

36 Fitting Logistic Regression Models IRLS algorithm (equivalent to Newton-Raphson) –Initialize . –Form Linearized response: –Form weights w i =p i (1-p i ) –Update  by weighted LS of z i on x i with weights w i –Steps 2-4 repeated until convergence

37 Example – Logistic Regression South African Heart Disease: –Coronary risk factor study (CORIS) baseline survey, carried out in three rural areas. –White males b/w 15 and 64 –Response: presence or absence of myocardial infarction –Maximum likelihood fit:

38 Example – Logistic Regression South African Heart Disease:

39 Logistic Regression or LDA? LDA: This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. Logistic model They use the same form for the logit function

40 Logistic Regression or LDA? Discriminative vs informative learning: logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality). If normality holds, LDA is up to 30% more efficient; o/w logistic regression can be more robust. But the methods are similar in practice.

41 Separating Hyperplanes

42 Perceptrons: compute a linear combination of the input features and return the sign For x 1,x 2 in L,  T (x 1 -x 2 )=0  *=  /||  || normal to surface L For x 0 in L,  T x 0 = -  0 The signed distance of any point x to L is given by

43 Rosenblatt's Perceptron Learning Algorithm Finds a separating hyperplane by minimizing the distance of misclassified points to the decision boundary If a response y i =1 is misclassified, then x i T  +  0 <0, and the opposite for misclassified point y i =-1 The goal is to minimize

44 Rosenblatt's Perceptron Learning Algorithm (Cont’d) Stochastic gradient descent The misclassified observations are visited in some sequence and the parameters  updated  is the learning rate, can be 1 w/o loss of generality It can be shown that algorithm converges to a separating hyperplane in a finite number of steps

45 Optimal Separating Hyperplanes Problem

46 Example - Optimal Separating Hyperplanes


Download ppt "Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006."

Similar presentations


Ads by Google