Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.

Similar presentations


Presentation on theme: "Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves."— Presentation transcript:

1 Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves

2 Figure 1.1

3 Basic Concepts Definitions. State of nature. Let c denote the state of nature. c is a random variable. (e.g., c = c1 for sea bass or c = c2 for salmon ) A priori probabilities. Let P(c1) and P(c2) denote the a priori probability of c1 and c2 respectively. We know P(c1) + P(c2) = 1. Decision rule. Decide c1 if P(c1) > P(c2); otherwise choose c2. But we know there will be many mistakes ….

4 Basic Concepts Class-conditional probability density function. Let x be a continuous random variable. p(x|c) is the probability density for x given the state of nature c. For example, what is the probability of lightness given that the class is salmon? p(lightness | salmon) ? Or what is the probability of lightness given that the class is sea bass? P(lightness | sea bass) ?

5 Figure 2.1

6 Bayes Formula How do we combine a priori and class-conditional probabilities to know the probability of a state of nature? Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) posterior probability likelihood evidence prior probability Bayes Decision: Choose c1 if P(c1|x) > P(c2|x); otherwise choose c2.

7 Figure 2.2

8 Simplifying Bayes Rule Bayes Formula. P(cj | x) = p(x|cj) P(cj) / p(x) The evidence p(x) is the same for all states or classes so we can dispense with it to make a decision. Rule: Choose c1 if p(x|c1)P(c1) > p(x|c2)P(c2); otherwise decide c2 Observations: If p(x|c1) = p(x|c2) then the decision depends on the prior probabilities only. If P(c1) = P(c2) then the decision depends on the likelihoods only.

9 Thomas Bayes Born in London (1701). Studied logic and theology (Univ. of Edinburgh). Fellow of the Royal Society (year 1742). Given white and black balls in an urn, what is the prob. of drawing one or the other? Given one or more balls, what can be said about the number of balls in the urn?

10 Minimizing Error What is the probability of error? P(error | x) = P(c1|x) if we decide c2 P(c2|x) if we decide c1 Does Bayes rule minimize the probability of error? P(error) = ∫ P(error,x) dx = ∫ P(error|x) p(x) dx and if for every x we minimize the error then P(error|x) is as small as it can be. Answer is “yes”.

11 Loss Function Let {c1,c2, …} be the possible states of nature. Let {a1, a2, …} be the possible actions. Loss function λ(ai|cj) is the loss incurred for taking action ai when the state of nature is cj. Suppose we observe a particular x and think about taking action ai. If the true state is wj, the loss will be λ(ai|cj). Expected loss R(ai|x) = Σj λ(ai|cj) P(cj|x) (this is also called the conditional risk.) Decision: Select the action that minimizes the conditional risk ( ** best possible performance ** )

12 Two Categories How does this apply when we only have two categories? Assume λij = λ (ai|cj). R(a1|x) = λ11 P(c1|x) + λ12 P(c2|x) R(a2|x) = λ21 P(c1|x) + λ22 P(c2|x) Decide a1 if (λ21 - λ11) P(c1|x) > (λ12 - λ22) P(c2|x)

13 Two Categories Another way of stating this is as follows. Decide a1 if P(x|c1) / P(x|c2) > [λ12 - λ22 / λ21 - λ11] P(c2) / P(c1) Stated as: Choose a1 if the likelihood ratio exceeds a threshold value independent of the observation x. likelihood ratioconstant

14 Figure 2.3

15 Zero-One Loss We will normally be concerned with the symmetrical or zero-one loss function: λ (ai|cj) = 0 if i = j 1 if i = j In this case the conditional risk is: R(ai|x) = Σj λ(ai|cj) P(cj|x) = 1 – P(ci|x)

16 Exercise This question refers to Bayes's classifiers. Let {C1,C2, …, Cn} be the possible states of nature or classes. Let {a1, a2, …, ak} be the possible actions. We define a loss function λ(ai|Cj) as the loss incurred for taking action ai when the state of nature is Cj. Let's define a new loss function as follows: λ(ai|Cj) = 0 if i = j and λ(ai|Cj) = 3 if i ≠ j Assume a uniform distribution on the posterior probability P(Cj|x) (i.e., all posterior probabilities have the same probability). What is the expected risk R(ai | x) ?

17 Exercise R(ai|x) = Σj λ(ai|Cj) P(Cj|x) = Σj λ(ai|Cj) 1/n = 1/n 3(n-1) = 3 - 3/n = 3(1 - 1/n)

18 Discriminant Functions How do we represent pattern classifiers? The most common way is through discriminant functions. Remember we use {c1,c2, …} to be the possible states of nature. For each class we create a discriminant function gi(x). The classifier assigns class ci if gi(x) > gj(x) for all j = i Our classifier is a network or machine that computes c discriminant functions.

19 Figure 2.5

20 Simplifying the Discriminant Functions Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function. Alternatives: gi(x) = P(ci|x) gi(x) = p(x|ci) P(ci) gi(x) = ln p(x|ci) + ln P(ci) ( where ln is the natural logarithm) The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries.

21 Figure 2.6

22 Normal Density Recall that using Bayes rule a discriminant function needs two probabilities: gi(x) = p(x|ci) P(ci) How do we model the likelihood or class-conditional probability? The most common approach is to assume a multivariate normal density. p(x) = 1 / σ (2Π) 0.5 e ( -1/2 [(x – u) / σ]2 ) u = E [ x ] = ∫ x p(x) dx σ 2 = E [ (x – u) 2 ] = ∫ (x – u) 2 p(x) dx

23 Figure 2.7

24 Normal Density If features are statistically independent and the variance is the same for all features, the discriminant function is simple and is linear in nature. A classifier that uses linear discriminant functions is called a linear machine. The decision surface are pieces of hyperplanes defined by linear equations.

25 Figure 2.10

26 Figure 2.11

27 Signal Detection Theory Suppose we want to detect a single pulse from a signal. We assume the signal has some random noise. When the signal is present we observe a normal distribution with mean u2. When the signal is not present we observe a normal distribution with mean u1. We assume same standard deviation. Can we measure the discriminability of the problem? Can we do this independent of the threshold x*? Discriminability: d’ = | u2 – u1 | / σ

28 Figure 2.19

29 Signal Detection Theory How do we find d’ if we do not know u1, u2, or x*? From the data we can compute: 1.P( x > x* | c2) a hit. 2.P( x > x* | c1) a false alarm. 3.P( x < x* | c2) a miss. 4.P( x < x* | c1) a correct rejection. If we plot a point in a space representing hit and false alarm rates, then we have a ROC (receiver operating characteristic) curve. With it we can distinguish between discriminability and bias.

30 Figure 2.20

31 Points to Remember Bayes formula and Bayes decision theory Loss functions (e.g., zero-one loss) Discriminant functions ROC curves


Download ppt "Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves."

Similar presentations


Ads by Google