Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, with the permission of the authors and the publisher

Chapter 2 (Part 1): Bayesian Decision Theory (Sections 1-4)
Introduction – Bayesian Decision Theory Pure statistics, probabilities known, optimal decision Bayesian Decision Theory–Continuous Features Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces

1. Introduction The sea bass/salmon example State of nature, prior
State of nature is a random variable The catch of salmon and sea bass is equiprobable P(1) = P(2) (uniform priors) P(1) + P( 2) = 1 (exclusivity and exhaustivity) Pattern Classification, Chapter 2 (Part 1)

Decision rule with only the prior information
Decide 1 if P(1) > P(2) otherwise decide 2 Use of the class –conditional information P(x | 1) and P(x | 2) describe the difference in lightness between populations of sea and salmon Pattern Classification, Chapter 2 (Part 1)

Pattern Classification, Chapter 2 (Part 1)

Posterior, likelihood, evidence
P(j | x) = P(x | j) P (j) / P(x) (Bayes Rule) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence Pattern Classification, Chapter 2 (Part 1)

Decision given the posterior probabilities
X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability of error is : P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1 Pattern Classification, Chapter 2 (Part 1)

2. Bayesian Decision Theory – Continuous Features
Generalization of the preceding ideas Use of more than one feature Use more than two states of nature Allowing actions other than deciding the state of nature Introduce a loss of function which is more general than the probability of error Pattern Classification, Chapter 2 (Part 1)

Refusing to make a decision in close or bad cases!
Allowing actions other than classification primarily allows the possibility of rejection Refusing to make a decision in close or bad cases! The loss function states how costly each action taken is Pattern Classification, Chapter 2 (Part 1)

Let {1, 2,…, c} be the set of c states of nature (or “categories”)
Let {1, 2,…, a} be the set of possible actions Let (i | j) be the loss incurred for taking action i when the state of nature is j Pattern Classification, Chapter 2 (Part 1)

R = Sum of all R(i | x) for i = 1,…,a
Overall risk R = Sum of all R(i | x) for i = 1,…,a Minimizing R Minimizing R(i | x) for i = 1,…, a for i = 1,…,a Conditional risk Pattern Classification, Chapter 2 (Part 1)

Select the action i for which R(i | x) is minimum
R is minimum and R in this case is called the Bayes risk = best performance that can be achieved! Pattern Classification, Chapter 2 (Part 1)

Two-category classification 1 : deciding 1 2 : deciding 2
ij = (i | j) loss incurred for deciding i when the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x) Pattern Classification, Chapter 2 (Part 1)

action 1: “decide 1” is taken
Our rule is the following: if R(1 | x) < R(2 | x) action 1: “decide 1” is taken This results in the equivalent rule : decide 1 if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide 2 otherwise Pattern Classification, Chapter 2 (Part 1)

Likelihood ratio: Then take action 1 (decide 1)
The preceding rule is equivalent to the following rule: Then take action 1 (decide 1) Otherwise take action 2 (decide 2) Pattern Classification, Chapter 2 (Part 1)

Optimal decision property
“If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions” Pattern Classification, Chapter 2 (Part 1)

3. Minimum-Error-Rate Classification
Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i  j Seek a decision rule that minimizes the probability of error which is the error rate Pattern Classification, Chapter 2 (Part 2)

Introduction of the zero-one loss function:
Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error”  Pattern Classification, Chapter 2 (Part 2)

Regions of decision and zero-one loss function, therefore:
If  is the zero-one loss function wich means: Pattern Classification, Chapter 2 (Part 2)

4. Classifiers, Discriminant Functions and Decision Surfaces
The multi-category case Set of discriminant functions gi(x), i = 1,…, c The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Pattern Classification, Chapter 2 (Part 2)

(max. discriminant corresponds to min. risk!)
Let gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!) Pattern Classification, Chapter 2 (Part 2)

Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri (Ri means assign x to i) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 2)

The computation of g(x)
Pattern Classification, Chapter 2 (Part 2)

Chapter 2 (Part 2): Bayesian Decision Theory (Sections 5, 6, 9)
5. The Normal Density 6. Discriminant Functions for the Normal Density 9. Bayes Decision Theory – Discrete Features

5. The Normal Density Univariate density Where:
Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x 2 = expected squared standard deviation or variance Pattern Classification, Chapter 2 (Part 2)

Multivariate normal density p(x) ~ N(, )
Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t (t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively  Pattern Classification, Chapter 2 (Part 2)

Properties of the Normal Density Covariance Matrix 
Always symmetric Always positive semi-definite, but for our purposes  is positive definite determinant is positive  invertible Eigenvalues real and positive, and the principal axes of the hyperellipsoidal loci of points of constant density are eigenvectors of 

 Pattern Classification, Chapter 2 (Part 2)

6. Discriminant Functions for the Normal Density
We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) Case of multivariate normal p(x) ~ N(, ) Pattern Classification, Chapter 2 (Part 3)

Case i = 2 I (I stands for the identity matrix)
Pattern Classification, Chapter 2 (Part 3)

A classifier that uses linear discriminant functions is called “a linear machine”
The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) If equal priors for all classes, this reduces to the minimum-distance classifier where an unknown is assigned to the class of the nearest mean Pattern Classification, Chapter 2 (Part 3)

always orthogonal to the line linking the means!
The hyperplane separating Ri and Rj always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 3)

2. Case i =  (covariance of all classes are identical but arbitrary
Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) If priors equal, reduces to squared Mahalanobis distance Pattern Classification, Chapter 2 (Part 3)

3. Case i = arbitrary The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids) Pattern Classification, Chapter 2 (Part 3)

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John."— Presentation transcript:

Similar presentations

About project

Feedback