Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Dimension reduction (1)
Chapter 4: Linear Models for Classification
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Lecture 4 Linear machine
Linear Models for Classification
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Computational Intelligence: Methods and Applications
Background on Classification
Computational Intelligence: Methods and Applications
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Classification Discriminant Analysis
Classification Discriminant Analysis
Computational Intelligence: Methods and Applications
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Support Vector Machines
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Computational Intelligence: Methods and Applications
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

LDA Wine example LDA with WEKA, using Wine data. Classification via regression is used, regression may also be used. In each crossvalidation regression coefficients may be quite different, discriminant functions (one per class) may use different features, ex: C1: * alcohol Weka “ridge” parameter - C2: * alcohol removes small weights C3: 0 * alcohol Good results: 2-3 errors are expected in 10x CV using a one hyperplane: a b c <= classified as | a = | b = | c = 3 Not much knowledge is gained, not a stable solution.

LDA voting example Not all LD schemes work with more than two classes in WEKA. LDA with WEKA (choose “Classification via Regression” + Linear Regression), using Vote data: predict who belongs to democratic and who to republican party, for all 435 members of USA Congress, using results of voting on 16 issues. Largest components: 0.72 * physician-fee-freeze=y * adoption-of-the-budget-resolution=n * synfuels-corporation-cutback=n... mx-missiles,... a b <= classified as | a = democrat | b = republicanOverall 95.6% accuracy Unfortunately CV and variance of CV are not so easy to calculate, C 4.5 makes 12 (6+6) errors, same attributes as most important.

LDA conditions Simplification of conditions defining linear discrimination. Conditions: using (d+1) dimensional vectors, extended by X 0 =1 and W 0. Using negative vectors for the second class leads to simpler form: Instead of >0 take some small positive values > b (i) This will increase the margin of classification. If b is maximized it will provide best solution.

Solving linear equations Linear equations W T X=b T may be solved in the least-square sense using the pseudoinverse matrix, like in the regression case. If X is a singular matrix or not a square matrix then the inverse X -1 does not exist, therefore a pseudoinverse matrix is used: Singular Value Decomposition is another, very good method for solving such equation: see discussion and the algorithm in See Numerical Recipes, chapter 2.6, on-line version is at: XX T is square, has d dim. Multiplying by pseudoinverse of X T will leave W on the left.

LDA perceptron algorithm Many algorithms for creating linear decision surfaces have been inspired by perceptrons, very simplified models of neurons. Criterion: where summation goes over the set Er of misclassified samples, for which W T X<0. This criterion J p (W) measures the sum of the misclassified distances from the decision boundary. Minimization by gradient descent may be used: where a learning constant  k is introduced, dependent on the number of iterations k. This solves any linearly separable problem! No large matrices, no problems with singularity, on-line learning OK.

Perceptron J p (W) Note that the perceptron criterion is piecewise linear. Left side: number of errors; right side: perceptron criterion; zero J(W) values in the solution region are possible only for linearly separable data. (after Duda et all, fig. 5.11)

Back to Fisher DA Fisher linear discrimination (FDA) was used to find canonical coordinates for visualization; they may also be used for classification. Fisher projection: Y=W T X, criterion where the between-class scatter matrix is: and the within-class scatter matrix is: How is this connected to LDA? W T X defines a projection on a line, this line is perpendicular to the discrimination hyperplane going through 0 (since here W 0 =0, W is d-dimensional here).

Relation to FDA Linear discrimination with augmented d+1 dimensional vectors, and with -X taken for X from the second class, with d  n data matrix X, is: with special choice for b; 1 i – a row with n i elements =1 for all n 1 samples from  1 class n/n 1 for all n 2 samples from  2 class n/n 2 With this choice we can show that W is the Fisher solution:  is a scaling constant decision rule; in practice W 0 threshold is estimated from data. the mean of all X

LD solutions FDA provides one particular way of finding the linear discriminating plane; detailed proofs are in the Duda, Hart and Stork (2000) book. MethodCriterion Solution LMS with pseudoinverse: Fisher Relaxation Perceptron: Use only vectors closer than b/||W|| to the border.

Differences LMS solution – orange line; two perceptron solutions, with different random starts – blue lines. Optimal linear solution: should optimize both W to be as far as possible from the data on both sides; it should have large margin b.

Quadratic discrimination Bayesian approach to optimal decisions for Gaussian distributions leads to the quadratic discriminant analysis (QDA) Hyperquadratic decision boundaries: g i (X)=g j (X); For equal a priori probabilities Mahalanobis distance to the class center is used as discriminant. Decision regions do not have to be simply connected. Estimate covariance matrices from data, or fit d(d+2)/2 parameters to the data directly to obtain QDA – both ways are rather inaccurate.

QDA example QDA for some datasets works quite well. Left: LDA solution with 5 features X 1, X 2, X 1 2, X 2 2 and X 1 X 2 ; Right – QDA solution with X 1, X 2; Differences are rather small, in both cases 6 parameters are estimated, but in real calculations QDA was usually worse than LDA. Why? Perhaps it is already too complex.

Regularized discrimination (RDA) Sometimes linear solution is almost sufficient, while quadratic discriminant has too many degrees of freedom and may overfit the data preventing generalization. RDA interpolates between the class covariance matrices  k and the average covariance matrices for the whole data  : Best  is estimated using crossvalidation; classification rules are based on QDA, for  = 0, QDA is reduced to LDA – simpler decision borders. RDA gives good results, is implemented in RM, S and R.