Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.

Slides:



Advertisements
Similar presentations
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Decision Theory Naïve Bayes ROC Curves
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
URL:.../publications/courses/ece_8443/lectures/current/exam/2004/ ECE 8443 – Pattern Recognition LECTURE 15: EXAM NO. 1 (CHAP. 2) Spring 2004 Solutions:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Bayes Decision Theory: Part II. Prof. A.L. Yuille Stat 231. Fall 2004.
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
Optimal Bayes Classification
Bayesian Decision Theory (Classification) 主講人:虞台文.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 8 Sept 23, 2005 Nanjing University of Science & Technology.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Basic Technical Concepts in Machine Learning
Lecture 1.31 Criteria for optimal reception of radio signals.
Lecture 15. Pattern Classification (I): Statistical Formulation
LECTURE 04: DECISION SURFACES
Special Topics In Scientific Computing
Probability theory retro
LECTURE 03: DECISION SURFACES
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Advanced Pattern Recognition
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The basics of Bayes decision theory
Bayesian Classification
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Foundations 2.
Bayesian Decision Theory
Presentation transcript:

Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves

Figure 1.1

Basic Concepts Definitions. State of nature. Let c denote the state of nature. c is a random variable. (e.g., c = c1 for sea bass or c = c2 for salmon ) A priori probabilities. Let P(c1) and P(c2) denote the a priori probability of c1 and c2 respectively. We know P(c1) + P(c2) = 1. Decision rule. Decide c1 if P(c1) > P(c2); otherwise choose c2. But we know there will be many mistakes ….

Basic Concepts Class-conditional probability density function. Let x be a continuous random variable. p(x|c) is the probability density for x given the state of nature c. For example, what is the probability of lightness given that the class is salmon? p(lightness | salmon) ? Or what is the probability of lightness given that the class is sea bass? P(lightness | sea bass) ?

Figure 2.1

Bayes Formula How do we combine a priori and class-conditional probabilities to know the probability of a state of nature? Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) posterior probability likelihood evidence prior probability Bayes Decision: Choose c1 if P(c1|x) > P(c2|x); otherwise choose c2.

Figure 2.2

Simplifying Bayes Rule Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) The evidence p(x) is the same for all states or classes so we can dispense with it to make a decision. Rule: Choose c1 if p(x|c1)P(c1) > p(x|c2)P(c2); otherwise decide c2 Observations: If p(x|c1) = p(x|c2) then the decision depends on the prior probabilities only. If P(c1) = P(c2) then the decision depends on the likelihoods only.

Thomas Bayes Born in London (1701). Studied logic and theology (Univ. of Edinburgh). Fellow of the Royal Society (year 1742). Given white and black balls in an urn, what is the prob. of drawing one or the other? Given one or more balls, what can be said about the number of balls in the urn?

Minimizing Error What is the probability of error? P(error | x) = P(c1|x) if we decide c2 P(c2|x) if we decide c1 Does Bayes rule minimize the probability of error? P(error) = ∫ P(error,x) dx = ∫ P(error|x) p(x) dx and if for every x we minimize the error then P(error|x) is as small as it can be. Answer is “yes”.

Loss Function Let {c1,c2, …} be the possible states of nature. Let {a1, a2, …} be the possible actions. Loss function λ(ai|cj) is the loss incurred for taking action ai when the state of nature is cj. Suppose we observe a particular x and think about taking action ai. If the true state is wj, the loss will be λ(ai|cj). Expected loss R(ai|x) = Σj λ(ai|cj) P(cj|x) (this is also called the conditional risk.) Decision: Select the action that minimizes the conditional risk ( ** best possible performance ** )

Two Categories How does this apply when we only have two categories? Assume λij = λ (ai|cj). R(a1|x) = λ11 P(c1|x) + λ12 P(c2|x) R(a2|x) = λ21 P(c1|x) + λ22 P(c2|x) Decide a1 if (λ21 - λ11) P(c1|x) > (λ12 - λ22) P(c2|x)

Two Categories Another way of stating this is as follows. Decide a1 if P(x|c1) / P(x|c2) > [λ12 - λ22 / λ21 - λ11] P(c2) / P(c1) Stated as: Choose a1 if the likelihood ratio exceeds a threshold value independent of the observation x. likelihood ratioconstant

Figure 2.3

Zero-One Loss We will normally be concerned with the symmetrical or zero-one loss function: λ (ai|cj) = 0 if i = j 1 if i = j In this case the conditional risk is: R(ai|x) = Σj λ(ai|cj) P(cj|x) = 1 – P(ci|x)

Exercise This question refers to Bayes's classifiers. Let {C1,C2, …, Cn} be the possible states of nature or classes. Let {a1, a2, …, ak} be the possible actions. We define a loss function λ(ai|Cj) as the loss incurred for taking action ai when the state of nature is Cj. Let's define a new loss function as follows: λ(ai|Cj) = 0 if i = j and λ(ai|Cj) = 3 if i ≠ j Assume a uniform distribution on the posterior probability P(Cj|x) (i.e., all posterior probabilities have the same probability). What is the expected risk R(ai | x) ?

Exercise R(ai|x) = Σj λ(ai|Cj) P(Cj|x) = Σj λ(ai|Cj) 1/n = 1/n 3(n-1) = 3 - 3/n = 3(1 - 1/n)

Discriminant Functions How do we represent pattern classifiers? The most common way is through discriminant functions. Remember we use {c1,c2, …} to be the possible states of nature. For each class we create a discriminant function gi(x). The classifier assigns class ci if gi(x) > gj(x) for all j = i Our classifier is a network or machine that computes c discriminant functions.

Figure 2.5

Simplifying the Discriminant Functions Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function. Alternatives: gi(x) = P(ci|x) gi(x) = p(x|ci) P(ci) gi(x) = ln p(x|ci) + ln P(ci) ( where ln is the natural logarithm) The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries.

Figure 2.6

Normal Density Recall that using Bayes rule a discriminant function needs two probabilities: gi(x) = p(x|ci) P(ci) How do we model the likelihood or class-conditional probability? The most common approach is to assume a multivariate normal density. p(x) = 1 / σ (2Π) 0.5 e ( -1/2 [(x – u) / σ]2 ) u = E [ x ] = ∫ x p(x) dx σ 2 = E [ (x – u) 2 ] = ∫ (x – u) 2 p(x) dx

Figure 2.7

Normal Density If features are statistically independent and the variance is the same for all features, the discriminant function is simple and is linear in nature. A classifier that uses linear discriminant functions is called a linear machine. The decision surface are pieces of hyperplanes defined by linear equations.

Figure 2.10

Figure 2.11

Signal Detection Theory Suppose we want to detect a single pulse from a signal. We assume the signal has some random noise. When the signal is present we observe a normal distribution with mean u2. When the signal is not present we observe a normal distribution with mean u1. We assume same standard deviation. Can we measure the discriminability of the problem? Can we do this independent of the threshold x*? Discriminability: d’ = | u2 – u1 | / σ

Figure 2.19

Signal Detection Theory How do we find d’ if we do not know u1, u2, or x*? From the data we can compute: 1.P( x > x* | c2) a hit. 2.P( x > x* | c1) a false alarm. 3.P( x < x* | c2) a miss. 4.P( x < x* | c1) a correct rejection. If we plot a point in a space representing hit and false alarm rates, then we have a ROC (receiver operating characteristic) curve. With it we can distinguish between discriminability and bias.

Figure 2.20

Points to Remember Bayes formula and Bayes decision theory Loss functions (e.g., zero-one loss) Discriminant functions ROC curves