Generally Discriminant Analysis

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Component Analysis (Review)
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Chapter 4: Linear Models for Classification
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Visual Recognition Tutorial
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Applied statistics Usman Roshan.
Lecture 2. Bayesian Decision Theory
Principal Component Analysis (PCA)
Usman Roshan CS 675 Machine Learning
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
LECTURE 04: DECISION SURFACES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Latent Variables, Mixture Models and EM
Classification Discriminant Analysis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Mathematical Foundations of BME
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Generally Discriminant Analysis 報告 : 張志豪 日期 : 2004/10/29

Outline Introduction Classification Feature Space Transformation Criterion

Introduction Discriminant Analysis discriminant (classification, predictor) 當已知class的分佈時, 如果有新的樣本進來, 則可以利用所選定的鑑別方法, 來判定新樣本歸於哪個類別中 事實上, 我們沒有辦法知道class真實的分佈, 只好利用train data來統計class的分佈 Train data越多, 相對的, 統計的越準 在純統計的情況下, discriminant analysis只是用來做predictor, 並沒有使用到feature space transformation ?? 都是兩個class的判別

Introduction Component of Discriminant Analysis Classification Linear, Quadratic, Nonlinear (Kernel) Feature Space Transformation Linear Criterion ML, MMI, MCE

Introduction Exposition 先統計labeled資料間class的分佈 在feature space中, class的分佈有所重疊, 所以利用feature space transformation來改變feature space 利用某種criterion來找出最合適的轉換基底 當有新pattern進來後, 可利用統計到(轉換後)的分佈來做predictor feature space transformation 以LDA為例

Introduction Criterion linear quadratic nonlinear ML LDA HDA FDA Kernel LDA MMI MMI LDA MCE MCE LDA

Classification Outline Linear Discriminant Analysis & Quadratic Discriminant Analysis Linear Discriminant Analysis Quadratic Discriminant Analysis Problem Practice Flexible Discriminant Analysis Linear Discriminant Analysis → Multivariable Linear Regressions Parametric → Non-Parametric Kernel LDA

Classification Linear Discriminant Analysis A simple application of Bayes theorem gives us Assumption : Class is single Gaussian distribution.

Classification Linear Discriminant Analysis Classification (count.) In comparing two classes k and l, it is sufficient to look at the log-ratio Assumption : Common covariance Intuition : Classify Two class

Classification Linear Discriminant Analysis Classification (count.) This linear log-odds function implies that the decision boundary between classes k and l is linear in x. This is of course true for any pair of classes, so all the decision boundaries are linear. The linear discriminant functions

Classification Quadratic Discriminant Analysis If the covariance are not assumed to be equal, then the convenient cancellations do not occur; in particular the pieces quadratic in x remain. The quadratic discriminant functions

Classification LDA & QDA Parametric LDA : P2 QDQ : JP2 Accuracy LDA is mismatch. 圓圈的大小代表著分佈散設的程度 LDA QDA

Classification Problem How do we use a linear discriminant when we have more than two classes ? There are two approaches : Learn one discriminant function for each class Learn a discriminant function for all pairs of classes => If c is the number of classes, in the first case we have c functions and in the second c(c-1) / 2 functions. => In both cases we are left with ambiguous regions.

Classification Problem

Classification Problem ambiguous regions we can use linear machines: We define c linear discriminant functions and choose the one with highest value for a given x.

Classification Conclusion : LDA & QDA Classify x to kth class : common covariance posterior prob. 為最大 linear discriminant function score 為最大 QDA Classify x to kth class : variant covariance quadratic discriminant function score 為最大

Classification Practice : LDA Mind 希望經過特徵空間轉換後, class間可以較容易做線性鑑別 Component Classification Linear decision boundaries Feature Space Transformation Linear : Criterion ML

Classification Practice : LDA Linear transformation Likelihood is same, but scale is larger.

Classification Practice : LDA Maximum likelihood criterion => assumed Single Gaussian distribution Class prior prob. are the same Diagonal and Common covariance (within-class) Lack of classification information is equivalent distribution (total-class) Lack of classification information : 假設所有維度中, 有一部分是不含有classification information. 藉由這個假設, 我們可以把其拿掉 JHU有證明, Appendix C

Classification Practice : LDA Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix.

Classification Practice : HDA Mind 希望經過特徵空間轉換後, class間可以較容易做二次式的鑑別 Component Classification Quadratic decision boundaries Feature Space Transformation Linear : Criterion ML

Classification Practice : HDA Maximum likelihood criterion => assumed Single Gaussian distribution Class prior prob. are the same Diagonal covariance Lack of classification information is equivalent distribution JHU use the steepest-descent algorithm Cambridge useing semi-tied is guaranteed to find a locally optimal solution and to be stable.

Classification Practice : HDA Intuition T = B+W B is between-class covariance. W is within-class covariance. is transformation matrix. HDA is worse than LDA.

Classification Practice Problem Linear transformation為何有效? Information theory It is impossible to create new information by transforming data, transformations can only lead to information loss. => One finds the K < P dimensional subspace of Rp in which the group centroids are most separated. Single muti-dimensional gaussian 當每個class只用一個Gaussian來紀錄時, 可以classify的好, 那麼當每個class使用mixture Gaussian來紀錄, 直覺的, 可以classify更好 ?? Observation probability is classification ?

Classification LDA : Linear Regression mind Linear discriminant analysis is equivalent to multi-response linear regression using optimal scorings to represent the groups. In this way, any multi-response regression technique can be post-processed to improve their classification performance. We obtain nonparametric versions of discriminant analysis by replacing linear regression by any nonparametric regression method. 迴歸分析為迴歸分析 : 探討各變數之間的關係, 並找出一適當的數學方程式表示其關係, 進而藉由該方程式預測未來 根據某變數來預測另一變數. 迴歸分析是以相關分析為基礎, 因任何預測的可靠性是依變數間關係的強度而有所不同

Classification LDA : Linear Regression Suppose is a function that assigns scores to the classes, such that the transformed class labels are optimally predicted by linear regression on X. So we have to choose and such that It gives a one-dimensional separation between classes. Least squares estimator

Classification LDA : Linear Regression Multi-Response Linear Regression Independent scoring labels : Linear maps the scores and the maps are chosen to minimum (1) 第i個observation投影到第k維的值 第i個observation的label在第k維的分數(mean ??) The set of scores are assumed to be mutually orthogonal and normalized.

Classification LDA : Linear Regression Multi-Response Linear Regression (count.) It can be show [Mardia79, Hastie95] that The are equivalent up to a constant to the fisher discriminant coordinates The Mahalanobis distances can be derived from the ASR solutions LDA can be performed by a sequence of linear regressions, followed by a classification in the space of fits (Mardia, Kent and Bibby, 1979)

Classification LDA : Linear Regression Multi-Response Linear Regression (count.) Let Y be the N*J indicator matrix corresponding to the dummy-variable coding for the classes. That is, the ijth element of Y is 1 if the ith observation falls in class j, and 0 otherwise. Let , be a matrix of K score vectors for the J classes. be the N*K matrix of transformed values of the classes with ikth element . Y

Classification LDA : Linear Regression Solution 1 Looking at (1), it is clear that if the scores were fixed we could minimize ASR by regressing on x. If we let project onto the column space of the predictors, this says (2)

Classification LDA : Linear Regression Solution 1 (count.) If we assume the scores have mean zero, unit variance and are uncorrelated for the N observations. Minimizing (2) amounts to finding the K largest eigenvectors , with normalization , where , a diagonal matrix of the sample class proportions . We could do this by constructing the matrix , computing , and then calculating its eigenvectors. But a more convenient approach avoids explicit construction of and takes advantage of the fact that computes a linear regression. 為(N*N) matrix, 太大了, 沒有辦法建構

Classification LDA : Linear Regression Solution 2 Y : (N*J), 正確答案 -> class與observation的關係 : (N*J), 預測的結果 -> observation與class的關係 YT : (J*J), covariance?? -> class與class的關係 B : (P*J), coefficient matrix -> 維度與class的關係 XTY X : (N*P), training data -> observation與維度的關係 It turns out that LDA amounts to the regression followed by an eigen-decomposition of .

Classification LDA : Linear Regression Solution 2 (count.) The final coefficient matrix B is, up to a diagonal scale matrix, the same as the discriminant analysis coefficient matrix. is the kth largest eigenvalue computed in step 3 above. LDA transformation matrix

Classification Flexible Discriminants Analysis Nonparametric version We replace the linear-projection operator by a nonparametric regression procedure, which we denote by the linear operator S. One simple and effective approach to this end is to expand X into a larger set of basis variables h(X), and then simply use in place of . 凡是有內積運算都可以套用kernel fuction

Classification Flexible Discriminants Analysis

Classification Flexible Discriminants Analysis Non-Parametric Algorithm (count.)

Classification Flexible Discriminants Analysis

Classification Kernel LDA Linear Discriminant Analysis

Classification Kernel LDA Kernel Linear Discriminant Analysis

Classification Kernel LDA Kernel Linear Discriminant Analysis (count.)

Classification Kernel LDA Kernel Linear Discriminant Analysis (count.) This problem can be solved by finding the leading eigenvector of N-1M. The projection of a new pattern x onto w is given by