Bayesian Decision Theory (Classification) 主講人:虞台文.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Chapter 4: Linear Models for Classification
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Linear Discriminant Functions
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Statistical Decision Theory, Bayes Classifier
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Psychophysics 3 Research Methods Fall 2010 Tamás Bőhm.
Data Selection In Ad-Hoc Wireless Sensor Networks Olawoye Oyeyele 11/24/2003.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Bayesian Decision Theory
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
1 E. Fatemizadeh Statistical Pattern Recognition.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Models for Classification
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Support Vector Machines
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Random Variables By: 1.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
LECTURE 04: DECISION SURFACES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Bayesian Decision Theory (Classification) 主講人:虞台文

Contents Introduction Generalize Bayesian Decision Rule Discriminant Functions The Normal Distribution Discriminant Functions for the Normal Populations. Minimax Criterion Neyman-Pearson Criterion

Bayesian Decision Theory (Classification) Introduction

What is Bayesian Decision Theory? Mathematical foundation for decision making. Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost).

Preliminaries and Notations a state of nature prior probability feature vector class-conditional density posterior probability

Bayesian Rule

Decision unimportant in making decision unimportant in making decision

Decision Decide  i if P(  i |x) > P(  j |x)  j  i Decide  i if p(x|  i )P(  i ) > p(x|  j )P(  j )  j  i Special cases: 1. P(  1 )=P(  2 )=    =P(  c ) 2. p(x|  1 )=p(x|  2 ) =    = p(x|  c )

Two Categories Decide  i if P(  i |x) > P(  j |x)  j  i Decide  i if p(x|  i )P(  i ) > p(x|  j )P(  j )  j  i Decide  1 if P(  1 |x) > P(  2 |x); otherwise decide  2 Decide  1 if p(x|  1 )P(  1 ) > p(x|  2 )P(  2 ); otherwise decide  2 Special cases: 1. P(  1 )=P(  2 ) Decide  1 if p(x|  1 ) > p(x|  2 ); otherwise decide  1 2. p(x|  1 )=p(x|  2 ) Decide  1 if P(  1 ) > P(  2 ); otherwise decide  2

Example R2R2 P(  1 )=P(  2 ) R1R1

Example R1R1 R1R1 R2R2 R2R2 P(  1 )=2/3 P(  2 )=1/3 Decide  1 if p(x|  1 )P(  1 ) > p(x|  2 )P(  2 ); otherwise decide  2

Classification Error Consider two categories: Decide  1 if P(  1 |x) > P(  2 |x); otherwise decide  2

Classification Error Consider two categories: Decide  1 if P(  1 |x) > P(  2 |x); otherwise decide  2

Bayesian Decision Theory (Classification) Generalized Bayesian Decision Rule

The Generation a set of c states of nature a set of a possible actions The loss incurred for taking action  i when the true state of nature is  j. We want to minimize the expected loss in making decision. Risk can be zero.

Conditional Risk Given x, the expected loss (risk) associated with taking action  i. Given x, the expected loss (risk) associated with taking action  i.

0/1 Loss Function

Decision Bayesian Decision Rule:

Overall Risk Decision function Bayesian decision rule: the optimal one to minimize the overall risk Its resulting overall risk is called the Bayesian risk

Two-Category Classification Action State of Nature Loss Function

Two-Category Classification Perform  1 if R(  2 |x) > R(  1 |x); otherwise perform  2

Two-Category Classification Perform  1 if R(  2 |x) > R(  1 |x); otherwise perform  2 positive Posterior probabilities are scaled before comparison.

Two-Category Classification irrelevan t Perform  1 if R(  2 |x) > R(  1 |x); otherwise perform  2

Two-Category Classification Perform  1 if Likelihood Ratio Threshold This slide will be recalled later.

Bayesian Decision Theory (Classification) Discriminant Functions

The Multicategory Classification g1(x)g1(x) g1(x)g1(x) g2(x)g2(x) g2(x)g2(x) gc(x)gc(x) gc(x)gc(x) x Action (e.g., classification) (x)(x) Assign x to  i if g i (x) > g j (x) for all j  i. g i (x)’s are called the discriminant functions. How to define discriminant functions?

Simple Discriminant Functions Minimum Risk case: Minimum Error-Rate case: If f( . ) is a monotonically increasing function, than f(g i ( . ) )’s are also be discriminant functions.

Decision Regions Two-category example Decision regions are separated by decision boundaries.

Bayesian Decision Theory (Classification) The Normal Distribution

Basics of Probability Discrete random variable (X) - Assume integer Continuous random variable (X) Probability mass function (pmf): Cumulative distribution function (cdf): Probability density function (pdf): Cumulative distribution function (cdf): not a probability

Expectations Let g be a function of random variable X. The k th moment The k th central moments The 1 st moment

Important Expectations Mean Variance Fact:

Entropy The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution.

Univariate Gaussian Distribution x p(x)p(x) X~N(μ,σ 2 ) μ σσ 2σ 3σ3σ 3σ3σ E[X] = μ Var[X] =σ 2 Properties: 1.Maximize the entropy 2.Central limit theorem

Random Vectors A d-dimensional random vector Vector Mean: Covariance Matrix:

Multivariate Gaussian Distribution X~N(μ,Σ)X~N(μ,Σ) E[X] = μ E[(X -μ ) (X -μ ) T ] = Σ A d-dimensional random vector

Properties of N(μ,Σ) X~N(μ,Σ)X~N(μ,Σ) A d-dimensional random vector Let Y=A T X, where A is a d × k matrix. Y~N(A T μ, A T Σ A)

Properties of N(μ,Σ) X~N(μ,Σ)X~N(μ,Σ) A d-dimensional random vector Let Y=A T X, where A is a d × k matrix. Y~N(A T μ, A T Σ A)

On Parameters of N(μ,Σ) X~N(μ,Σ)X~N(μ,Σ)

More On Covariance Matrix  is symmetric and positive semidefinite.  : orthonormal matrix, whose columns are eigenvectors of .  : diagonal matrix (eigenvalues).

Whitening Transform X~N(μ,Σ)X~N(μ,Σ) Y=ATXY=ATX Y~N(A T μ, A T Σ A) Let

Whitening Transform X~N(μ,Σ)X~N(μ,Σ) Y=ATXY=ATX Y~N(A T μ, A T Σ A) Let Whitening Projection Linear Transform

Mahalanobis Distance constant r2r2 depends on the value of r 2 X~N(μ,Σ)X~N(μ,Σ)

Mahalanobis Distance constant r2r2 depends on the value of r 2 X~N(μ,Σ)X~N(μ,Σ)

Bayesian Decision Theory (Classification) Discriminant Functions for the Normal Populations

Minimum-Error-Rate Classification Xi~N(μi,Σi)Xi~N(μi,Σi)

Three Cases: Case 1: Case 2: Case 3: Classes are centered at different mean, and their feature components are pairwisely independent have the same variance. Classes are centered at different mean, but have the same variation. Arbitrary.

Case 1.  i =  2 I irrelevant

Case 1.  i =  2 I

Boundary btw.  i and  j

Case 1.  i =  2 I Boundary btw.  i and  j wTwT w x0x0 x xx0xx0 The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere. 0 if P (  i )= P (  j ) midpoint

Case 1.  i =  2 I Minimum distance classifier (template matching)

Case 1.  i =  2 I

Demo

Case 2.  i =  Irrelevant if P (  i )= P (  j )  i, j Mahalanobis Distance irrelevant

Case 2.  i =  Irrelevant

Case 2.  i =  w x0x0 x

Demo

Case 3.  i   j irrelevant Without this term In Case 1 and 2 Decision surfaces are hyperquadrics, e.g., hyperplanes hyperspheres hyperellipsoids hyperhyperboloids

Case 3.  i   j Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance.

Case 3.  i   j

Demo

Multi-Category Classification

Bayesian Decision Theory (Classification) Minimax Criterion

Bayesian Decision Rule: Two-Category Classification Decide  1 if Likelihood Ratio Threshold Minimax criterion deals with the case that the prior probabilities are unknown.

Basic Concept on Minimax To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk. Minimize the maximum possible overall risk.

Overall Risk

The overall risk for a particular P(  1 ). The value depends on the setting of decision boundary The value depends on the setting of decision boundary R(x) = ax + b

Overall Risk = 0 for minimax solution = R mm, minimax risk R(x) = ax + b Independent on the value of P(  i ).

Minimax Risk

Error Probability Use 0/1 loss function

Minimax Error-Probability Use 0/1 loss function P(1|2)P(1|2) P(2|1)P(2|1)

Minimax Error-Probability R1R1 R2R2 11 22 P(1|2)P(1|2) P(2|1)P(2|1)

Bayesian Decision Theory (Classification) Neyman-Pearson Criterion

Bayesian Decision Rule: Two-Category Classification Decide  1 if Likelihood Ratio Threshold Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown.

Signal Detection Theory The theory of signal detection theory evolved from the development of communications and radar equipment the first half of the last century. It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds.

The situation of interest A person is faced with a stimulus (signal) that is very faint or confusing. The person must make a decision, is the signal there or not. What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. Let us call this mess noise.

Example Noise is present both in the environment and in the sensory system of the observer. The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal.

Example A radiologist is examining a CT scan, looking for evidence of a tumor. A Hard job, because there is always some uncertainty. There are four possible outcomes: – hit (tumor present and doctor says "yes'') – miss (tumor present and doctor says "no'') – false alarm (tumor absent and doctor says "yes") – correct rejection (tumor absent and doctor says "no"). Two types of Error

Correct Rejection The Four Cases P(1|1)P(1|1) Miss False Alarms Hit Signal (tumor) Absent (  1 ) Present (  2 ) Decision No (  1 ) Yes (  2 ) P(2|2)P(2|2) P(1|2)P(1|2) P(2|1)P(2|1) Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision.

No (  1 ) Yes (  2 ) Decision Making d’d’ Noise 11 Noise + Signal 22 Criterion Hit False Alarm Discriminability Based on expectancy (decision bias) P(2|2)P(2|2) P(2|1)P(2|1)

ROC Curve (Receiver Operating Characteristic) Hit False Alarm P H =P(  2 |  2 ) P FA =P(  2 |  1 )

Neyman-Pearson Criterion False Alarm P FA =P(  2 |  1 ) NP: max. P H subject to P FA ≦ a Hit P H =P(  2 |  2 )

Likelihood Ratio Test where T is a threshold that meets the P FA constraint ( ≦ a). How to determine T?

Likelihood Ratio Test PHPH P FA R2R2 R1R1

Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give P FA (  ) = a. There is no decision rule  ’ such that P FA (  ’ )  a and P H (  ’ ) > P H (  ). Pf) Let  ’ be a decision rule with =1  0 0 > 0> 0

Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give P FA (  ) = a. There is no decision rule  ’ such that P FA (  ’ ) ≦ a and P H (  ’ ) > P H (  ). Pf) Let  ’ be a decision rule with =0 00 00 

OK Neyman-Pearson Lemma Consider the aforementioned rule with T chosen to give P FA (  ) = a. There is no decision rule  ’ such that P FA (  ’ ) ≦ a and P H (  ’ ) > P H (  ). Pf) Let  ’ be a decision rule with 00