LECTURE 04: DECISION SURFACES

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminant Functions
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
1 E. Fatemizadeh Statistical Pattern Recognition.
Bayesian Decision Theory (Classification) 主講人:虞台文.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Discriminant Analysis
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 11: Advanced Discriminant Analysis
Logistic Regression Gary Cottrell 6/8/2018
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
Computational Intelligence: Methods and Applications
LECTURE 01: COURSE OVERVIEW
Digital Systems: Hardware Organization and Design
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The basics of Bayes decision theory
LECTURE 01: COURSE OVERVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Bayesian Classification
Mathematical Foundations of BME
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Test #1 Thursday September 20th
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

LECTURE 04: DECISION SURFACES • Objectives: Multi-category Decision Surfaces Discriminant Functions Mahalanobis Distance Principal Component Analysis Gaussian Distributions Threshold Decoding Resources: D.H.S: Chapter 2 (Part 3) K.F.: Intro to PR X. Z.: PR Course A.N. : Gaussian Discriminants M.R.: Chernoff Bounds S.S.: Bhattacharyya

Likelihood By employing Bayes formula, we can replace the posteriors by the prior probabilities and conditional densities: choose ω1 if: (λ21- λ11) p(x|ω1) P(ω1) > (λ12- λ22) p(x|ω2) P(ω2); otherwise decide ω2 If λ21- λ11 is positive, our rule becomes: If the loss factors are identical, and the prior probabilities are equal, this reduces to a standard likelihood ratio:

Multicategory Decision Surfaces Define a set of discriminant functions: gi(x), i = 1,…, c Define a decision rule: choose ωi if: gi(x) > gj(x) ∨j ≠ i For a Bayes classifier, let gi(x) = -R(ωi|x) because the maximum discriminant function will correspond to the minimum conditional risk. For the minimum error rate case, let gi(x) = P(ωi|x), so that the maximum discriminant function corresponds to the maximum posterior probability. Choice of discriminant function is not unique: multiply or add by same positive constant Replace gi(x) with a monotonically increasing function, f(gi(x)).

Decision Surfaces We can visualize our decision rule several ways: choose ωi if: gi(x) > gj(x) ∨j ≠ i

Network Representation of a Classifier A classifier can be visualized as a connected graph with arcs and weights: What are the advantages of this type of visualization?

Multivariate Normal Distributions A multivariate distribution is defined as: where μ represents the mean (vector) and Σ represents the covariance (matrix). Note the exponent term is really a dot product or weighted Euclidean distance. The covariance is always symmetric and positive semidefinite. How does the shape vary as a function of the covariance?

Coordinate Transformations Why is it convenient to convert an arbitrary distribution into a spherical one? (Hint: Euclidean distance) Consider the transformation: Aw= Φ Λ-1/2 where Φ is the matrix whose columns are the orthonormal eigenvectors of Σ and Λ is a diagonal matrix of eigenvalues (Σ= Φ Λ Φt). Note that Φ is unitary. What is the covariance of y=Awx? E[yyt] = (Awx)(Awx)t =(Φ Λ-1/2x) (Φ Λ-1/2x)t = Φ Λ-1/2x xt Λ-1/2 Φt = Φ Λ-1/2 Σ Λ-1/2 Φt = Φ Λ-1/2 Φ Λ Φt Λ-1/2 Φt = (Φ Φt) (Λ-1/2 Λ Λ-1/2 )(ΦΦt) = I This is known as Principal Component Analysis (PCA). Examining the eigenvectors of the covariance matrix provides information about the relationships between features.

Mahalanobis Distance The weighted Euclidean distance: is known as the Mahalanobis distance, and represents a statistically normalized distance calculation that results from our whitening transformation. Consider an example using our Java Applet. This distance measure plays an important role in Gaussian mixture modeling of data, hidden Markov models, etc.

Discriminant Functions Recall our discriminant function for minimum error rate classification: For a multivariate normal distribution: Consider the case: Σi = σ2I (statistical independence, equal variance, class-independent variance)

Gaussian Classifiers The discriminant function can be reduced to: Since these terms are constant w.r.t. the maximization: We can expand this: The term xtx is a constant w.r.t. i, and μitμi is a constant that can be precomputed.

Linear Machines We can use an equivalent linear discriminant function: wi0 is called the threshold or bias for the ith category. A classifier that uses linear discriminant functions is called a linear machine. The decision surfaces defined by the equation:

Threshold Decoding This has a simple geometric interpretation: The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance).

General Case for Gaussian Classifiers

Identity Covariance Case: Σi = σ2I This can be rewritten as:

Equal Covariances Case: Σi = Σ

Arbitrary Covariances

Typical Examples of 2D Classifiers

Summary Decision Surfaces: geometric interpretation of a Bayesian classifier. Gaussian Distributions: how is the shape of the distribution influenced by the mean and covariance? Threshold Decoding: Data modeled as a multivariate Gaussian distribution can be classified using a simple threshold decoding scheme. Bayesian classifiers for Gaussian distributions: how does the decision surface change as a function of the mean and covariance?