LECTURE 10: DISCRIMINANT ANALYSIS

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Chapter 5: Linear Discriminant Functions
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
Techniques for studying correlation and covariance structure
SVD(Singular Value Decomposition) and Its Applications
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Discriminant Analysis
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Lecture 2. Bayesian Decision Theory
Principal Component Analysis (PCA)
Chapter 3: Maximum-Likelihood Parameter Estimation
Object Orie’d Data Analysis, Last Time
LECTURE 11: Advanced Discriminant Analysis
LECTURE 04: DECISION SURFACES
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Classification Discriminant Analysis
Classification Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Intelligence: Methods and Applications
Principal Component Analysis
Introduction PCA (Principal Component Analysis) Characteristics:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
Symmetric Matrices and Quadratic Forms
Principal Components What matters most?.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Principal Component Analysis
LECTURE 11: Exam No. 1 Review
Test #1 Thursday September 20th
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Symmetric Matrices and Quadratic Forms
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

LECTURE 10: DISCRIMINANT ANALYSIS • Objectives: Mean-Square Error Objective Function Projection Operator Whitening Transformation Principal Components Analysis Fisher Linear Discriminant Analysis Multiple Discriminant Analysis Examples Resources: D.H.S.: Chapter 3 (Part 2) JL: Layman’s Explanation QT: PCA Introduction JP: Speech Processing KF: Fukunaga, Pattern Recognition

Component Analysis Component analysis is a technique that combines features to reduce the dimension of the feature space. It can also be the basis for a classification algorithm. Features of this approach include: Statistical decorrelation of the data. Linear combinations are simple to compute and tractable. Project a high dimensional space onto a lower dimensional space. Gain insight into the relative importance of each feature. Three classical approaches for finding the optimal transformation: Principal Components Analysis (PCA): projection that best represents the data in a least-square sense. Multiple Discriminant Analysis (MDA): projection that best separates the data in a least-squares sense. Independent Component Analysis (IDA): projection that minimizes the mutual information of the components.

Principal Component Analysis Consider representing a set of n d-dimensional samples x1,…,xn by a single vector, x0. Define a squared-error criterion: The solution to this problem is given by: The sample mean is a zero-dimensional representation of the data set. Consider a one-dimensional solution: In other words, the best one-dimensional projection of the data (in the least mean-squared error sense) is the projection of the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix having the largest eigenvalue (hence the name Principal Component). For the Gaussian case, the eigenvectors are the principal axes of the hyperellipsoidally shaped support region!

Alternate View: Coordinate Transformations Why is it convenient to convert an arbitrary distribution into a spherical one? (Hint: Euclidean distance) Consider the transformation: Aw= Φ Λ-1/2 where Φ is the matrix whose columns are the orthonormal eigenvectors of Σ and Λ is a diagonal matrix of eigenvalues (Σ= Φ Λ Φt). Note that Φ is unitary. What is the covariance of y=Awx? E[yyt] = (Awx)(Awx)t =(Φ Λ-1/2x) (Φ Λ-1/2x)t = Φ Λ-1/2x xt Λ-1/2 Φt = Φ Λ-1/2 Σ Λ-1/2 Φt = Φ Λ-1/2 Φ Λ Φt Λ-1/2 Φt = (Φ Φt) (Λ-1/2 Λ Λ-1/2 )(ΦΦt) = I Why is this significant?

Examples of Maximum Likelihood Classification For the Gaussian case, the eigenvectors are the principal axes of the hyperellipsoidally shaped support region! Let’s work some examples (class-independent and class-dependent PCA). In class-independent PCA, a global covariance matrix is computed: In this case, the covariance will be an ellipsoid. The first principal component will be the major axis of this ellipsoid. The second principal component will be the minor axis. In class-dependent PCA, which requires labeled training data, transformations are computed independently for each class. The ML decision surface is a line (or a parabola in this case).

Discriminant Analysis Discriminant analysis seeks directions that are efficient for discrimination. Consider the problem of projecting data from d dimensions onto a line with the hope that we can optimize the orientation of the line to minimize error. Consider a set of n d-dimensional samples x1,…,xn in the subset D1 labeled 1 and n2 in the subset D2 labeled 2. Define a linear combination of x: and a corresponding set of n samples y1, …, yn divided into Y1 and Y2. Our challenge is to find w that maximizes separation. This can be done by considering the ratio of the between-class scatter to the within-class scatter.

Separation of the Means and Scatter Define a sample mean for class i: The sample mean for the projected points are: The sample mean for the projected points is just the projection of the mean (which is expected since this is a linear transformation). It follows that the distance between the projected means is: Define a scatter for the projected samples: An estimate of the variance of the pooled data is: and is called the within-class scatter.

Fisher Linear Discriminant and Scatter The Fisher linear discriminant maximizes the criteria: Define a scatter for class I, Si : The total scatter, Sw, is: We can write the scatter for the projected samples as: Therefore, the sum of the scatters can be written as:

Separation of the Projected Means The separation of the projected means obeys: Where the between class scatter, SB, is given by: Sw is the within-class scatter and is proportional to the covariance of the pooled data. SB , the between-class scatter, is symmetric and positive definite, but because it is the outer product of two vectors, its rank is at most one. This implies that for any w, SBw is in the direction of m1-m2. The criterion function, J(w), can be written as:

Linear Discriminant Analysis This ratio is well-known as the generalized Rayleigh quotient and has the well-known property that the vector, w, that maximizes J(), must satisfy: The solution is: This is Fisher’s linear discriminant, also known as the canonical variate. This solution maps the d-dimensional problem to a one-dimensional problem (in this case). From Chapter 2, when the conditional densities, p(x|i), are multivariate normal with equal covariances, the optimal decision boundary is given by: where , and w0 is related to the prior probabilities. The computational complexity is dominated by the calculation of the within- class scatter and its inverse, an O(d2n) calculation. But this is done offline! Let’s work some examples (class-independent PCA and LDA).

Multiple Discriminant Analysis For the c-class problem in a d-dimensional space, the natural generalization involves c-1 discriminant functions. The within-class scatter is defined as: Define a total mean vector, m: and a total scatter matrix, ST, by: The total scatter is related to the within-class scatter (derivation omitted): We have c-1 discriminant functions of the form:

Multiple Discriminant Analysis (Cont.) The criterion function is: The solution to maximizing J(W) is once again found via an eigenvalue decomposition: Because SB is the sum of c rank one or less matrices, and because only c-1 of these are independent, SB is of rank c-1 or less. An excellent presentation on applications of LDA can be found at PCA Fails!

Summary Principal Component Analysis (PCA): represents the data by minimizing the squared error (representing data in directions of greatest variance). PCA is commonly applied in a class-dependent manner where a whitening transformation is computed for each class, and the distance from the mean of that class is measured using this class-specific transformation. PCA gives insight into the important dimensions of your problem by examining the direction of the eigenvectors. Linear DIscriminant Analysis (LDA): attempts to project the data onto a line that represents the direction of maximum discrimination. LDA can be generalized to a multi-class problem through the use of multiple discriminant functions (c classes require c-1 discriminant functions).