ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminant Functions
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Bayesian Decision Theory (Classification) 主講人:虞台文.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Discriminant Analysis
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
LECTURE 04: DECISION SURFACES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
LECTURE 01: COURSE OVERVIEW
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 01: COURSE OVERVIEW
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources : D.H.S: Chapter 2 (Part 3) K.F.: Intro to PR X. Z.: PR Course M.B. : Gaussian Discriminants E.M. : Linear Discriminants D.H.S: Chapter 2 (Part 3) K.F.: Intro to PR X. Z.: PR Course M.B. : Gaussian Discriminants E.M. : Linear Discriminants Audio: URL:

ECE 8443: Lecture 03, Slide 1 Define a set of discriminant functions: g i (x), i = 1,…, c Define a decision rule: choose  i if: g i (x) > g j (x)  j  i For a Bayes classifier, let g i (x) = -R(  i |x) because the maximum discriminant function will correspond to the minimum conditional risk. For the minimum error rate case, let g i (x) = P(  i |x), so that the maximum discriminant function corresponds to the maximum posterior probability. Choice of discriminant function is not unique:  multiply or add by same positive constant  Replace g i (x) with a monotonically increasing function, f(g i (x)). Multicategory Decision Surfaces

ECE 8443: Lecture 03, Slide 2 A classifier can be visualized as a connected graph with arcs and weights: What are the advantages of this type of visualization? Network Representation of a Classifier

ECE 8443: Lecture 03, Slide 3 Some monotonically increasing functions can simplify calculations considerably: What are some of the reasons (3) is particularly useful?  Computational complexity (e.g., Gaussian)  Numerical accuracy (e.g., probabilities tend to zero)  Decomposition (e.g., likelihood and prior are separated and can be weighted differently)  Normalization (e.g., likelihoods are channel dependent). Log Probabilities

ECE 8443: Lecture 03, Slide 4 We can visualize our decision rule several ways: choose  i if: g i (x) > g j (x)  j  i Decision Surfaces

ECE 8443: Lecture 03, Slide 5 A classifier that places a pattern in one of two classes is often referred to as a dichotomizer. We can reshape the decision rule: If we use log of the posterior probabilities: A dichotomizer can be viewed as a machine that computes a single discriminant function and classifies x according to the sign (e.g., support vector machines). Two-Category Case

ECE 8443: Lecture 03, Slide 6 Mean: Covariance: Statistical independence? Higher-order moments? Occam’s Razor? Entropy? Linear combinations of normal random variables? Central Limit Theorem? Recall the definition of a normal distribution (Gaussian): Why is this distribution so important in engineering? Normal Distributions

ECE 8443: Lecture 03, Slide 7 A normal or Gaussian density is a powerful model for modeling continuous- valued feature vectors corrupted by noise due to its analytical tractability. Univariate normal distribution: where the mean and covariance are defined by: The entropy of a univariate normal distribution is given by: Univariate Normal Distribution

ECE 8443: Lecture 03, Slide 8 A normal distribution is completely specified by its mean and variance: A normal distribution achieves the maximum entropy of all distributions having a given mean and variance. Central Limit Theorem: The sum of a large number of small, independent random variables will lead to a Gaussian distribution. The peak is at: 66% of the area is within one  ; 95% is within two  ; 99% is within three . Mean and Variance

ECE 8443: Lecture 03, Slide 9 A multivariate distribution is defined as: where  represents the mean (vector) and  represents the covariance (matrix). Note the exponent term is really a dot product or weighted Euclidean distance. The covariance is always symmetric and positive semidefinite. How does the shape vary as a function of the covariance? Multivariate Normal Distributions

ECE 8443: Lecture 03, Slide 10 A support region is the obtained by the intersection of a Gaussian distribution with a plane. For a horizontal plane, this generates an ellipse whose points are of equal probability density. The shape of the support region is defined by the covariance matrix. Support Regions

ECE 8443: Lecture 03, Slide 11 Derivation

ECE 8443: Lecture 03, Slide 12 Identity Covariance

ECE 8443: Lecture 03, Slide 13 Unequal Variances

ECE 8443: Lecture 03, Slide 14 Nonzero Off-Diagonal Elements

ECE 8443: Lecture 03, Slide 15 Unconstrained or “Full” Covariance

ECE 8443: Lecture 03, Slide 16 Why is it convenient to convert an arbitrary distribution into a spherical one? (Hint: Euclidean distance) Consider the transformation: A w =   -1/2 where  is the matrix whose columns are the orthonormal eigenvectors of  and  is a diagonal matrix of eigenvalues (  =    t ). Note that  is unitary. What is the covariance of y=A w x? E[yy t ]= (A w x)(A w x) t =(   -1/2 x) (   -1/2 x) t =   -1/2 x x t  -1/2  t =   -1/2   -1/2  t =   -1/2    t  -1/2  t =   t  -1/2   -1/2 (  t ) = I Coordinate Transformations

ECE 8443: Lecture 03, Slide 17 The weighted Euclidean distance: is known as the Mahalanobis distance, and represents a statistically normalized distance calculation that results from our whitening transformation. Consider an example using our Java Applet.Java Applet Mahalanobis Distance

ECE 8443: Lecture 03, Slide 18 Recall our discriminant function for minimum error rate classification: For a multivariate normal distribution: Consider the case:  i =  2 I (statistical independence, equal variance, class-independent variance) Discriminant Functions

ECE 8443: Lecture 03, Slide 19 The discriminant function can be reduced to: Since these terms are constant w.r.t. the maximization: We can expand this: The term x t x is a constant w.r.t. i, and  i t  i is a constant that can be precomputed. Gaussian Classifiers

ECE 8443: Lecture 03, Slide 20 We can use an equivalent linear discriminant function: w i0 is called the threshold or bias for the i th category. A classifier that uses linear discriminant functions is called a linear machine. The decision surfaces defined by the equation: Linear Machines

ECE 8443: Lecture 03, Slide 21 This has a simple geometric interpretation: The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance). Threshold Decoding

ECE 8443: Lecture 03, Slide 22 Summary Decision Surfaces: geometric interpretation of a Bayesian classifier. Gaussian Distributions: how is the shape of the distribution influenced by the mean and covariance? Bayesian classifiers for Gaussian distributions: how does the decision surface change as a function of the mean and covariance?