Basics of discriminant analysis

Slides:



Advertisements
Similar presentations
Visual Recognition Tutorial
Advertisements

Maximum likelihood (ML) and likelihood ratio (LR) test
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 5: Linear Discriminant Functions
Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.
Factor Analysis Purpose of Factor Analysis
Point estimation, interval estimation
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Principal component analysis (PCA)
Elementary hypothesis testing
Resampling techniques
Maximum likelihood (ML) and likelihood ratio (LR) test
Procrustes analysis Purpose of procrustes analysis Algorithm Various modifications.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Evaluating Hypotheses
Canonical correlations
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Log-linear and logistic models
Statistical Background
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Principal component analysis (PCA)
Basics of regression analysis
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Maximum likelihood (ML)
Classification with several populations Presented by: Libin Zhou.
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Classification (Supervised Clustering) Naomi Altman Nov '06.
Principles of Pattern Recognition
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #23.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Interval Estimation and Hypothesis Testing Prepared by Vera Tabakova, East Carolina University.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
ES 07 These slides can be found at optimized for Windows)
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Lecture 2. Bayesian Decision Theory
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
Presentation transcript:

Basics of discriminant analysis Purpose Various situations Examples R commands Demonstration

Purpose Assume that we have several groups and we know that our observations must belong to one of these groups. For example we may know that there are several diseases and by symptoms we want to decide which disease we are dealing with. Or we may have several species of plants. When we observe various characteristics of some specie we want to know to which specie it belongs to. We want to divide our space into regions and when we observe an observation then we decide which region it belongs to. Each region is assigned to one of the classes. If an observation belongs to the region number k then we say that this observation belongs to class number k. In the picture we have 3 regions. If an observation belongs to the region 1 then we decide that it is a member of the class 1. Discriminant analysis is widely used in many fields. For example it is an integral part of neural networks. 2D example 3 1 2

Various situations There can be several situations: We know the distribution for each class (it is an unrealistic assumption). Then the problem becomes easy. If we have an observation then calculate probability of this observation using formula for each class. Whichever has the maximum value that wins. We know the form of the distributions but we do not know their parameters. For example we may know that distribution for each class is normal but we do not know mean and variances for these distributions. Then we need to have representatives for each class. Once we have representatives we can estimate parameters of the distributions (mean and variance for the normal case). When we have new observation we can use these parameters as true parameters and calculate probabilities. Again the largest probability wins When we have prior probabilities. E.g. in the case of diseases we may know that one of them has prior probability of 0.7 and another one may have prior probability 0.3. In this case we can use these probabilities when we calculate probability of the observation by simple multiplications

Various situations: Unknown parameters If we know that probability distributions are normal then we have two cases Variances of these distributions are same In this case space is divided by hyperplanes. In one dimensional case with two classes we have one point that divides line into two regions. This point is in the middle of means for two distributions. In two dimensional case with two classes we have a line that divides space into two regions. This lines intersects line segment joining to means of distributions in the middle. In three dimensional space we will have planes. Variances are different In this case we will have shapes defined by quadratic forms that divide space into regions. In one dimensional case we will have two points. In two dimensional case we may have ellipse, hyperbola, parabola, two lines. Form of these lines are dependent on the differences of variances. In three dimensional case we can have ellipsoid, hyrperboloid etc.

Maximum likelihood discriminant analysis Let us assume that we have g populations (groups). Each of the population has the probability distribution Li(x). Then for an observation likelihood of all populations is calculated and the population with the largest likelihood is taken. If two of the populations have the same likelihood then one of them can be chosen. Let us assume we are dealing with one dimensional populations and their distributions are normal. Moreover let us assume that we have only two populations then we will have This quadratic inequality divides real numbers line into two regions. When this inequality is satisfied then the observation belongs to the class 1 otherwise it belongs to the class 2. When variances are equal then we have a linear inequality. Then if 1 >  2 and x>(1+ 2)/2 then this rule puts x into group 1. Multidimensional cases are similar to one dimensional cases except inequalities are multidimensional. When variances are equal then the space is divided by a hyperplane (line in two dimensional case) If parameters of the distributions are not known they are calculated using given observations

Distributions with equal and known variances: 1D example The probability distributions for classes are known. They are normal. Variances for both of them are 1. One of them has mean value 5 and another one has 8. Anything below 6.5 belongs to class 1 and anything above 6.5 belongs to class 2. Observation with value 6.5 can belong to both classes The observations a and b will be assigned to class 1 and the observations c and d will be assigned to class 2. Anything smaller than the middle of two means will be assigned to the call 1 and anything bigger than this value will belong to class 2. class 1 class 2 2 1 distrimination point a b c d new observations

Distributions with known but different variances: 1D example Assume that we have two classes. Probability distributions for both of them is normal. Means and variances of distributions are known. One of the distributions is much sharper than another one. In this case the probability of the observation b for the class 2 is higher than that for the class 1. Probability of c for the class 1 is higher than for the class 2. Probability of observation a, although very small, for the class 1 is higher than for the class 1. Thus the observations a, c ,d will be assigned to the class 1 and the observation b to the class 2. Very small and large observation will belong to class 1 and medium observations to class 2. Interval for class 1 class 2 class 1 a b c d new observations

Two dimensional example In two dimensional case we want to divide the whole plane into two two (or more) sections. If new observations belong to one of these regions then we decide its class number. Red dot is on the region corresponding to class 1 and Blue dot is on the region belonging to class 1. Parameters of the distributions are calculated using sample points (shown by small black dots). There are 50 observations for each class. If it turns out that variances of distributions are equal then we will have linear discriminations. If variances would be unequal then we would have quadratic discriminations (lines would be quadratic). Discrimination line class 1 class 2 new observations

Likelihood ratio discriminant analysis Likelihood ratio discriminant rule is a technique that puts a given observation to the group that is being tested and parameters are re-estimated. It is done for each group. Observation is allocated to a group that has the largest likelihood. This technique tend to put an observation to a population that has larger sample size.

Fisher’s discriminant function Fisher’s discrimination rule maximises the ratio of between groups sum of squares to within group sum of squares: Where W is the within group sum of squares: n is the total number of observations, g is the number of groups, i, ni is the number of observations in the group i. There are several ways of calculating between groups sum of squares. One popular way is a weighted sum of squares. Then problem of finding discrimination rule reduces to finding maximum eigenvalue and corresponding eigenvector of the matrix W-1B. New observation x is put into the group i if the following inequality holds

When parameters of distributions are unknown In general the problem consists of two parts Classification. At this stage space is divided into regions and each region belongs to one class. In some sense it means that we need to find a function or inequalities to divide space into parts. It is done usually by probability distribution for each class. In a way this stage can be considered as a rule generation. Discrimination. Once space has been partitioned or rules have been generated then using these rules new observations are assigned to classes Note that if each observation belongs to one class only, then it is a deterministic rule. There are other rules also. One of them is fuzzy rules. Each observation has degree of belongness to a class. For example observation may belong to class 1 with degree equal to 0.7 and to class 2 with degree 0.3.

Probability of misclassification Let us assume we have g groups (classes). Probability of misclassification is defined as probality of putting an observation to the class i when it is from the class j. It is denoted as pij. In particular probability of correct allocation for the class i is pii and probability of misclassification for this call is 1-pii. Assume that we have two discriminant rules - d and d’. It is said that discriminant rule d is as good as d’ if: pii p’ii for i=1,,,g d is better than d’ if at least in one of the cases inequality is strict. If there is no better rule than d then it is called an admissible rule. In general it may happen that it is not possible to compare two rules. For example it may happen that p11>p’11 but p22<p’22.

Resampling and misclassification Resubstitution: Estimate disciminant rule and then for each observation calculate probability of misclassification. Problem with this technique is that it gives, as expected, optimistic estimation. Jacknife: From each class one observation in turn removed, discriminant rule is defined. Removed observation is predicted. Then probability of misclassification is calculated using ni1/n1. Where n1 is the number of observation in the first group, ni1 is number of cases when observation from group 1 was classified as belonging to group i. Similar misclassification probability is calculated for each class. Bootstrap: Resample the sample of observations. There several techniques that applies bootstrap. One of them is described here. First calculate misclassification probabilities using resubstitution. Denote it by eai. There are two ways: Resample all observations simultaneously or resample each group (i.e. take a sample of n1 points from the group 1 etc). Then define discrimination rule and then estimate probabilities of misclassification for bootstrap sample and for the original sample. Denote them epib and pib. Calculate differences dib=epib-pib. Repeat it B times and averag. It is the bootstrap bias correction. Then probability of misclassification is eai-<d>

R commands for discriminant analysis Commands for discrimnant analyses are in the library MASS. This library should be loaded. library(MASS) Necessary commands are: lda – linear discriminant analysis. Using given observation this command calculates discrimination lines (hyperplanes) qda – quadratic discrimination analysis. This command calculates necessary equations. It does not assume equality of the variances. predict – For new observation it makes decision to which class it belongs. Example of uses: z = lda(data,grouping=groupings) predict(z,newobservations) Similarly for quadratic discriminant analysis z = qda(data,groupins=groupings) predict(z,newobservations)$class data are data matrix given us for discrimination rule calculations. They can be considered as a data set for training. grooupings defines which observation belongs to which class.

References Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Kendall’s library of statistics Mardia, K.V. Kent, J.T. and Bibby, J.M. (2003) Multivariate analysis