Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections

Slides:



Advertisements
Similar presentations
CS479/679 Pattern Recognition Dr. George Bebis
Advertisements

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
URL:.../publications/courses/ece_8443/lectures/current/exam/2004/ ECE 8443 – Pattern Recognition LECTURE 15: EXAM NO. 1 (CHAP. 2) Spring 2004 Solutions:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Bayes Decision Theory: Part II. Prof. A.L. Yuille Stat 231. Fall 2004.
1 E. Fatemizadeh Statistical Pattern Recognition.
Optimal Bayes Classification
Bayesian Decision Theory (Classification) 主講人:虞台文.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
1 Pattern Recognition Chapter 2 Bayesian Decision Theory.
Objectives: Chernoff Bound Bhattacharyya Bound ROC Curves Discrete Features Resources: V.V. – Chernoff Bound J.G. – Bhattacharyya T.T. – ROC Curves NIST.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 04: DECISION SURFACES
Special Topics In Scientific Computing
LECTURE 03: DECISION SURFACES
Comp328 tutorial 3 Kai Zhang
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Bayesian Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections 2.1-2.10 CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Decision Theory Design classifiers to recommend decisions that minimize some total expected ”risk”. The simplest risk is the classification error (i.e., costs are equal). Typically, the risk includes the cost associated with different decisions.

Terminology State of nature ω (random variable): e.g., ω1 for sea bass, ω2 for salmon Probabilities P(ω1) and P(ω2) (priors): e.g., prior knowledge of how likely is to get a sea bass or a salmon Probability density function p(x) (evidence): e.g., how frequently we will measure a pattern with feature value x (e.g., x corresponds to lightness)

Terminology (cont’d) Conditional probability density p(x/ωj) (likelihood) : e.g., how frequently we will measure a pattern with feature value x given that the pattern belongs to class ωj e.g., lightness distributions between salmon/sea-bass populations

Terminology (cont’d) Conditional probability P(ωj /x) (posterior) : e.g., the probability that the fish belongs to class ωj given measurement x.

Decision Rule Using Prior Probabilities Decide ω1 if P(ω1) > P(ω2); otherwise decide ω2 or P(error) = min[P(ω1), P(ω2)] Favours the most likely class. This rule will be making the same decision all times. i.e., optimum if no other information is available

Decision Rule Using Conditional Probabilities Using Bayes’ rule, the posterior probability of category ωj given measurement x is given by: where (i.e., scale factor – sum of probs = 1) Decide ω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2 or Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2) otherwise decide ω2

Decision Rule Using Conditional pdf (cont’d) p(x/ωj) P(ωj /x)

Probability of Error The probability of error is defined as: or What is the average probability error? The Bayes rule is optimum, that is, it minimizes the average probability error! P(error/x) = min[P(ω1/x), P(ω2/x)]

Where do Probabilities Come From? There are two competitive answers to this question: (1) Relative frequency (objective) approach. Probabilities can only come from experiments. (2) Bayesian (subjective) approach. Probabilities may reflect degree of belief and can be based on opinion.

Example (objective approach) Classify cars whether they are more or less than $50K: Classes: C1 if price > $50K, C2 if price <= $50K Features: x, the height of a car Use the Bayes’ rule to compute the posterior probabilities: We need to estimate p(x/C1), p(x/C2), P(C1), P(C2)

Example (cont’d) Collect data Ask drivers how much their car was and measure height. Determine prior probabilities P(C1), P(C2) e.g., 1209 samples: #C1=221 #C2=988

Example (cont’d) Determine class conditional probabilities (likelihood) Discretize car height into bins and use normalized histogram

Example (cont’d) Calculate the posterior probability for each bin:

A More General Theory Use more than one features. Allow more than two categories. Allow actions other than classifying the input to one of the possible categories (e.g., rejection). Employ a more general error function (i.e., “risk” function) by associating a “cost” (“loss” function) with each error (i.e., wrong action).

Terminology Features form a vector A finite set of c categories ω1, ω2, …, ωc Bayes rule (i.e., using vector notation): A finite set of l actions α1, α2, …, αl A loss function λ(αi / ωj) the cost associated with taking action αi when the correct classification category is ωj

Conditional Risk (or Expected Loss) Suppose we observe x and take action αi Suppose that the cost associated with taking action αi with ωj being the correct category is λ(αi / ωj) The conditional risk (or expected loss) with taking action αi is:

Overall Risk Suppose α(x) is a general decision rule that determines which action α1, α2, …, αl to take for every x; then the overall risk is defined as: The optimum decision rule is the Bayes rule

Overall Risk (cont’d) The Bayes decision rule minimizes R by: (i) Computing R(αi /x) for every αi given an x (ii) Choosing the action αi with the minimum R(αi /x) The resulting minimum overall risk is called Bayes risk and is the best (i.e., optimum) performance that can be achieved:

Example: Two-category classification Define α1: decide ω1 α2: decide ω2 λij = λ(αi /ωj) The conditional risks are: (c=2)

Example: Two-category classification (cont’d) Minimum risk decision rule: or or (i.e., using likelihood ratio) > likelihood ratio threshold

Special Case: Zero-One Loss Function Assign the same loss to all errors: The conditional risk corresponding to this loss function:

Special Case: Zero-One Loss Function (cont’d) The decision rule becomes: In this case, the overall risk is the average probability error! or or

Example Assuming general loss: > Assuming zero-one loss: Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2 (decision regions) assume:

Discriminant Functions A useful way to represent classifiers is through discriminant functions gi(x), i = 1, . . . , c, where a feature vector x is assigned to class ωi if: gi(x) > gj(x) for all

Discriminants for Bayes Classifier Assuming a general loss function: gi(x)=-R(αi / x) Assuming the zero-one loss function: gi(x)=P(ωi / x)

Discriminants for Bayes Classifier (cont’d) Is the choice of gi unique? Replacing gi(x) with f(gi(x)), where f() is monotonically increasing, does not change the classification results. gi(x)=P(ωi/x) we’ll use this form extensively!

Case of two categories More common to use a single discriminant function (dichotomizer) instead of two: Examples:

Decision Regions and Boundaries Decision rules divide the feature space in decision regions R1, R2, …, Rc, separated by decision boundaries. decision boundary is defined by: g1(x)=g2(x)

Discriminant Function for Multivariate Gaussian Density Consider the following discriminant function: N(μ,Σ) p(x/ωi)

Multivariate Gaussian Density: Case I Σi=σ2(diagonal) Features are statistically independent Each feature has the same variance favours the a-priori more likely category

Multivariate Gaussian Density: Case I (cont’d) wi= )

Multivariate Gaussian Density: Case I (cont’d) Properties of decision boundary: It passes through x0 It is orthogonal to the line linking the means. What happens when P(ωi)= P(ωj) ? If P(ωi)= P(ωj), then x0 shifts away from the most likely category. If σ is very small, the position of the boundary is insensitive to P(ωi) and P(ωj) )

Multivariate Gaussian Density: Case I (cont’d) If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case I (cont’d) If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case I (cont’d) If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case I (cont’d) Minimum distance classifier When P(ωi) are equal, then: max

Multivariate Gaussian Density: Case II

Multivariate Gaussian Density: Case II (cont’d)

Multivariate Gaussian Density: Case II (cont’d) Properties of hyperplane (decision boundary): It passes through x0 It is not orthogonal to the line linking the means. What happens when P(ωi)= P(ωj) ? If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case II (cont’d) If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case II (cont’d) If P(ωi)= P(ωj), then x0 shifts away from the most likely category.

Multivariate Gaussian Density: Case II (cont’d) Mahalanobis distance classifier When P(ωi) are equal, then: max

Multivariate Gaussian Density: Case III Σi= arbitrary hyperquadrics; e.g., hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids etc.

Example - Case III decision boundary: P(ω1)=P(ω2) boundary does not pass through midpoint of μ1,μ2

Multivariate Gaussian Density: Case III (cont’d) non-linear decision boundaries

Multivariate Gaussian Density: Case III (cont’d) More examples

Error Bounds Exact error calculations could be difficult – easier to estimate error bounds! or min[P(ω1/x), P(ω2/x)] P(error)

Error Bounds (cont’d) If the class conditional distributions are Gaussian, then where: |

Error Bounds (cont’d) The Chernoff bound corresponds to β that minimizes e-κ(β) This is a 1-D optimization problem, regardless to the dimensionality of the class conditional densities. loose bound loose bound tight bound

Error Bounds (cont’d) Bhattacharyya bound Approximate the error bound using β=0.5 Easier to compute than Chernoff error but looser. The Chernoff and Bhattacharyya bounds will not be good bounds if the distributions are not Gaussian.

Example Bhattacharyya error: k(0.5)=4.06

Receiver Operating Characteristic (ROC) Curve Every classifier employs some kind of a threshold. Changing the threshold affects the performance of the system. ROC curves can help us evaluate system performance for different thresholds.

Example: Person Authentication Authenticate a person using biometrics (e.g., fingerprints). There are two possible distributions (i.e., classes): Authentic (A) and Impostor (I) I A

Example: Person Authentication (cont’d) Possible decisions: (1) correct acceptance (true positive): X belongs to A, and we decide A (2) incorrect acceptance (false positive): X belongs to I, and we decide A (3) correct rejection (true negative): X belongs to I, and we decide I (4) incorrect rejection (false negative): X belongs to A, and we decide I I A false positive correct acceptance correct rejection false negative

Error vs Threshold ROC x* (threshold)

False Negatives vs Positives ROC

Bayes Decision Theory: Case of Discrete Features Replace with See section 2.9

Missing Features Consider a Bayes classifier using uncorrupted data. Suppose x=(x1,x2) is a test vector where x1 is missing and the value of x2 is - how can we classify it? If we set x1 equal to the average value, we will classify x as ω3 But is larger; maybe we should classify x as ω2 ?

Missing Features (cont’d) Suppose x=[xg,xb] (xg: good features, xb: bad features) Derive the Bayes rule using the good features: Marginalize posterior probability over bad features. p

Compound Bayesian Decision Theory Sequential decision (1) Decide as each fish emerges. Compound decision (1) Wait for n fish to emerge. (2) Make all n decisions jointly. Could improve performance when consecutive states of nature are not be statistically independent.

Compound Bayesian Decision Theory (cont’d) Suppose Ω=(ω(1), ω(2), …, ω(n))denotes the n states of nature where ω(i) can take one of c values ω1, ω2, …, ωc (i.e., c categories) Suppose P(Ω) is the prior probability of the n states of nature. Suppose X=(x1, x2, …, xn) are n observed vectors.

Compound Bayesian Decision Theory (cont’d)     P P acceptable! i.e., consecutive states of nature may not be statistically independent!