Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 E. Fatemizadeh Statistical Pattern Recognition.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability theory retro
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Bayes decision theory febr. 17.

2 Supervised learning: Based on training examples (E), learn a modell which works fine on previously unseen examples. Classification: a supervised learning task of categorisation of entities into predefined set of classes Classification

Pattern Classification, Chapter 2 (Part 1) 3

4 P(  j | x) = P(x |  j ). P (  j ) / P(x) Posterior = (Likelihood. Prior) / Evidence Where in case of two categories Posterior, likelihood, evidence

Pattern Classification, Chapter 2 (Part 1) 5

6 Decision given the posterior probabilities X is an observation for which: if P(  1 | x) > P(  2 | x) True state of nature =  1 if P(  1 | x) < P(  2 | x) True state of nature =  2 This rule minimizes the probability of the error. Bayesian Decision

Pattern Classification, Chapter 2 (Part 1) 7 Bayesian Decision Theory – Generalization Use of more than one feature Use more than two states of nature Allowing actions and not only decide on the state of nature Introduce a loss of function which is more general than the probability of error

Pattern Classification, Chapter 2 (Part 1) 8 Let {  1,  2,…,  c } be the set of c states of nature (or “categories”) Let {  1,  2,…,  a } be the set of possible actions Let (  i |  j ) be the loss incurred for taking action  i when the state of nature is  j

Bayes decision theory - example Automatic trading (on stock exchanges)  1 : the prices will increase (in the future!)  2 : the prices will be lower  3 : the prices won’t change too much We cannot observe  (latent)!  1: buy  2 : sell x: actual prices (and historical prices) x is observed : how much to lose with an action

Pattern Classification, Chapter 2 (Part 1) 10 Overall risk R = Sum of all R(  i | x) for i = 1,…,a Minimizing R Minimizing R(  i | x) for i = 1,…, a for i = 1,…,a Conditional risk

Pattern Classification, Chapter 2 (Part 1) 11 Select the action  i for which R(  i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

Pattern Classification, Chapter 2 (Part 1) 12 Two-category classification  1 : deciding  1  2 : deciding  2 ij = (  i |  j ) loss incurred for deciding  i when the true state of nature is  j Conditional risk: R(  1 | x) =  11 P(  1 | x) + 12 P(  2 | x) R(  2 | x) =  21 P(  1 | x) + 22 P(  2 | x)

Pattern Classification, Chapter 2 (Part 1) 13 Our rule is the following: if R(  1 | x) < R(  2 | x) action  1 : “decide  1 ” is taken This results in the equivalent rule : decide  1 if: ( ) P(x |  1 ) P(  1 ) > ( ) P(x |  2 ) P(  2 ) and decide  2 otherwise

Pattern Classification, Chapter 2 (Part 1) 14 Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action  1 (decide  1 ) Otherwise take action  2 (decide  2 )

Pattern Classification, Chapter 2 (Part 1) 15 Exercise Select the optimal decision where: W = {  1,  2 } P(x |  1 ) N(2, 0.5) (Normal distribution) P(x |  2 ) N(1.5, 0.2) P(  1 ) = 2/3 P(  2 ) = 1/3

Pattern Classification, Chapter 2 (Part 2) 16 Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error”  Zero-one loss function (Bayes Classifier)

Pattern Classification, Chapter 2 (Part 2) 17 Classifiers, Discriminant Functions and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c The classifier assigns a feature vector x to class  i if: g i (x) > g j (x)  j  i

Pattern Classification, Chapter 2 (Part 2) 18 Let g i (x) = - R(  i | x) (max. discriminant corresponds to min. risk!) For the minimum error rate, we take g i (x) = P(  i | x) (max. discrimination corresponds to max. posterior!) g i (x)  P(x |  i ) P(  i ) g i (x) = ln P(x |  i ) + ln P(  i ) (ln: natural logarithm!)

Pattern Classification, Chapter 2 (Part 2) 19 Feature space divided into c decision regions if g i (x) > g j (x)  j  i then x is in R i ( R i means assign x to  i ) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x)  g 1 (x) – g 2 (x) Decide  1 if g(x) > 0 ; Otherwise decide  2

Pattern Classification, Chapter 2 (Part 2) 20 The computation of g(x)

Discriminant functions of the Bayes Classifier with Normal Density Pattern Classification, Chapter 2 (Part 1) 21

Pattern Classification, Chapter 2 (Part 2) 22 The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x  2 = expected squared deviation or variance

Pattern Classification, Chapter 2 (Part 2) 23

Pattern Classification, Chapter 2 (Part 2) 24 Multivariate density Multivariate normal density in d dimensions is: where: x = (x 1, x 2, …, x d ) t (t stands for the transpose vector form)  = (  1,  2, …,  d ) t mean vector  = d*d covariance matrix |  | and  -1 are determinant and inverse respectively

Pattern Classification, Chapter 2 (Part 3) 25 Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x |  i ) + ln P(  i ) Case of multivariate normal

Pattern Classification, Chapter 2 (Part 3) 26 Case  i =  2. I ( I stands for the identity matrix)

Pattern Classification, Chapter 2 (Part 3) 27 A classifier that uses linear discriminant functions is called “a linear machine” The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x)

Pattern Classification, Chapter 2 (Part 3) 28 The hyperplane is always orthogonal to the line linking the means!

Pattern Classification, Chapter 2 (Part 3) 29 The hyperplane separating R i and R j always orthogonal to the line linking the means!

Pattern Classification, Chapter 2 (Part 3) 30

Pattern Classification, Chapter 2 (Part 3) 31

Pattern Classification, Chapter 2 (Part 3) 32 Case  i =  (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means!)

Pattern Classification, Chapter 2 (Part 3) 33

Pattern Classification, Chapter 2 (Part 3) 34

Pattern Classification, Chapter 2 (Part 3) 35 Case  i = arbitrary The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

Pattern Classification, Chapter 2 (Part 3) 36

Pattern Classification, Chapter 2 (Part 3) 37

Pattern Classification, Chapter 2 38 Exercise Select the optimal decision where: W = {  1,  2 } P(x |  1 ) N(2, 0.5) (Normal distribution) P(x |  2 ) N(1.5, 0.2) P(  1 ) = 2/3 P(  2 ) = 1/3

Parameter estimation Pattern Classification, Chapter 3 39

Data availability in a Bayesian framework We could design an optimal classifier if we knew: P(  i ) (priors) P(x |  i ) (class-conditional densities) Unfortunately, we rarely have this complete information! Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

A priori information about the problem E.g. assume normality of P(x |  i ) P(x |  i ) ~ N(  i,  i ) Characterized by 2 parameters Estimation techniques Maximum-Likelihood (ML) and the Bayesian estimations Results are nearly identical, but the approaches are different 1

Parameters in ML estimation are fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution In either approach, we use P(  i | x) for our classification rule! 1

Use the information provided by the training samples to estimate  = (  1,  2, …,  c ), each  i (i = 1, 2, …, c) is associated with each category Suppose that D contains n samples, x 1, x 2,…, x n ML estimate of  is, by definition the value that maximizes P(D |  ) “It is the value of  that best agrees with the actually observed training sample” 2

Optimal estimation Let  = (  1,  2, …,  p ) t and let   be the gradient operator We define l(  ) as the log-likelihood function l(  ) = ln P(D |  ) New problem statement: determine  that maximizes the log-likelihood 2

Bayesian Estimation In MLE  was supposed fix In BE  is a random variable The computation of posterior probabilities P(  i | x) lies at the heart of Bayesian classification Goal: compute P(  i | x, D) Given the sample D, Bayes formula can be written

Pa tte rn Cl as sifi cat ion, Ch ap ter 1

47 Bayesian Parameter Estimation: Gaussian Case Goal: Estimate  using the a-posteriori density P(  | D) The univariate case: P(  | D)  is the only unknown parameter (  0 and  0 are known!)

Reproducing density Identifying (1) and (2) yields:

The univariate case P(x | D) P(  | D) computed P(x | D) remains to be computed! It provides: (Desired class-conditional density P(x | D j,  j )) Therefore: P(x | D j,  j ) together with P(  j ) and using Bayes formula, we obtain the Bayesian classification rule:

Bayesian Parameter Estimation: General Theory P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: The form of P(x |  ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P(  ) The rest of our knowledge  is contained in a set D of n random variables x 1, x 2, …, x n that follows P(x) 5

The basic problem is: “Compute the posterior density P(  | D)” then “Derive P(x | D)” Using Bayes formula, we have: And by independence assumption:

MLE vs. Bayes estimation If n→∞ they are equal! MLE Simple and fast (convex optimisation vs. numerical integration) Bayes estimation We can express our uncertainty by P(  ) 52

Sumamry Bayes decision theory General framework for probabilistic decision making Bayes classifier Classification is a special decision making (  1 : choose  1 ) Zero-one loss function can be omitted Bayes classifier with zero-one loss with Normal Density

Parameter estimation General procedures for densities’ parameter estimation based on a sample (it can be applied beyond Bayes classifier) Bayesian Machine learning: the marrige of Bayesian Decision Theory and Parameter estimation from a training sample Summary