Lecture 2. Bayesian Decision Theory

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
Visual Recognition Tutorial
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
URL:.../publications/courses/ece_8443/lectures/current/exam/2004/ ECE 8443 – Pattern Recognition LECTURE 15: EXAM NO. 1 (CHAP. 2) Spring 2004 Solutions:
1 E. Fatemizadeh Statistical Pattern Recognition.
Optimal Bayes Classification
Bayesian Decision Theory (Classification) 主講人:虞台文.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Analysis
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Linear Classifier Team teaching.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 04: DECISION SURFACES
LECTURE 10: DISCRIMINANT ANALYSIS
Probability theory retro
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Bayesian Classification
Mathematical Foundations of BME
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Mathematical Foundations of BME
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Lecture 2. Bayesian Decision Theory Bayes Decision Rule Loss function Decision surface Multivariate normal and Discriminant Function

Bayes Decision It is the decision making when all underlying probability distributions are known. It is optimal given the distributions are known. For two classes w1 and w2 , Prior probabilities for an unknown new observation: P(w1) : the new observation belongs to class 1 P(w2) : the new observation belongs to class 2 P(w1 ) + P(w2 ) = 1 It reflects our prior knowledge. It is our decision rule when no feature on the new object is available: Classify as class 1 if P(w1 ) > P(w2 )

Bayes Decision We observe features on each object. P(x| w1) & P(x| w2) : class-specific density The Bayes rule:

Bayes Decision Likelihood of observing x given class label.

Bayes Decision Posterior probabilities.

Loss function Loss function: probability statement --> decision some classification mistakes can be more costly than others. The set of c classes: The set of possible actions: : deciding that an observation belongs to a certain class Loss when taking action i given the observation belongs to hidden class j:

Loss function The expected loss: Given an observation with covariant vector x, the conditional risk is: At every x, a decision is made: α(x), by minimizing the expected loss. Our final goal is to minimize the total risk over all x.

Loss function The zero-one loss: All errors are equally costly. The conditional risk is: The risk corresponding to this loss function is the average probability error.

Loss function Let denote the loss for deciding class i when the true class is j In minimizing the risk, we decide class one if Rearrange it, we have

Loss function Example:

Loss function Likelihood ratio. If miss-classifying w2 is penalized more: Zero-one loss function

Discriminant function & decision surface Features -> discriminant functions gi(x), i=1,…,c Assign class i if gi(x) > gj(x) j  i Decision surface defined by gi(x) = gj(x)

Decision surface The discriminant functions help partition the feature space into c decision regions (not necessarily contiguous). Our interest is to estimate the boundaries between the regions.

Minimax Minimizing the maximum possible loss. What happens when the priors change?

Normal density Reminder: the covariance matrix is symmetric and positive semidefinite. Entropy - the measure of uncertainty Normal distribution has the maximum entropy over all distributions with a given mean and variance.

Reminder of some results for random vectors Let Σ be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as: Positive-definite matrix:

Normal density Whitening transform:

Normal density To make a minimum error rate classification (zero-one loss), we use discriminant functions: This is the log of the numerator in the Bayes formula. Log is used because we are only comparing the gi’s, and log is monotone. When normal density is assumed: We have:

Discriminant function for normal density i = 2I Linear discriminant function: Note: blue boxes – irrelevant terms.

Discriminant function for normal density The decision surface is where With equal prior, x0 is the middle point between the two means. The decision surface is a hyperplane,perpendicular to the line between the means.

Discriminant function for normal density “Linear machine”: dicision surfaces are hyperplanes.

Discriminant function for normal density With unequal prior probabilities, the decision boundary shifts to the less likely mean.

Discriminant function for normal density

Discriminant function for normal density Set: The decision boundary is:

Discriminant function for normal density The hyperplane is generally not perpendicular to the line between the means.

Discriminant function for normal density (3) i is arbitrary Decision boundary is hyperquadrics (hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

Discriminant function for normal density

Discriminant function for normal density

Discriminant function for normal density Extention to multi-class.

Discriminant function for discrete features Discrete features: x = [x1, x2, …, xd ]t , xi{0,1 } pi = P(xi = 1 | 1) qi = P(xi = 1 | 2) The likelihood will be:

Discriminant function for discrete features The discriminant function: The likelihood ratio:

Discriminant function for discrete features So the decision surface is again a hyperplane.

Optimality Consider a two-class case. Two ways to make a mistake in the classification: Misclassifying an observation from class 2 to class 1; Misclassifying an observation from class 1 to class 2. The feature space is partitioned into two regions by any classifier: R1 and R2

Optimality

Optimality In the multi-class case, there are numerous ways to make mistakes. It is easier to calculate the probability of correct classification. Bayes classifier maximizes P(correct). Any other partitioning will yield higher probability of error. The result is not dependent on the form of the underlying distributions.