Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Similar presentations


Presentation on theme: "Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e."— Presentation transcript:

1 Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2 2 Modeling data as random variables Example: coin toss Given sufficient knowledge, we could use Newton’s laws of motion to calculate the result of each toss with minimal uncertainty In conjunction with our model, analysis of experimental trajectories will probably reveal why the coin is unfair if heads and tails do not occur with equal probability Alternative: Accept doubt about result of toss. Treat result as random variable X subject to P(X=x). Use P(X=x) to make rational decision about result of next toss. Assume that we are not interested in why the coin is unfair if that is the case. “The reason is in the data”

3 Statistical Analysis of Coin-Toss Data Let heads = 1; tails = 0 Boolean random variables obey Bernoulli statistics P (x) = p o X (1 ‒ p o ) (1 ‒ X) p o = probability of heads Given a sample of N tosses, an unbiased estimator of p o is the fraction of tosses that show heads. Prediction of next toss: Heads if p o > ½, Tails otherwise 3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4 4 posterior Class likelihoodprior normalization Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Prior is information relevant to classifying that is independent of attributes x Class likelihood is probability that member of class C will have attribute x Assign client with attribute x to class C if P(C|x) > 0.5 Review: Bayes’ Rule for binary classification

5 Review: Bayes’ Rule: K>2 Classes 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

6 6 With class labels r i t, estimators are Review: Estimating priors and class likelihoods from data Number of examples in a class is an estimate of its prior. If we assume members of a class are Gaussian distributed, then mean and covariance parameterize class likelihood.

7 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: naïve Bayes classification Each class is characterized by a set of means and variances for the components of the attributes in that class. A simpler model results from assuming that components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x.

8 Actions: α i assigning x to C i of K classes Loss λ ik occurs if we take α i when x belongs to C k Expected risk (Duda and Hart, 1973) 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Minimizing risk given attributes x

9 Special case: correct decisions no loss and error have equal cost: “0/1 loss function” 9 For minimum risk, choose the most probable class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

10 Add rejection option: don’t assign a class 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) risk of no assignment risk of choosing C i 1- is risk making some assignment

11 R(  1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x) R(  2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x) Choose C1 if R(  1|x) < R(  2|x), which is true if 10 P(C2|x) < P(C1|x), which becomes P(C1|x) > 10/11 using normalization of posteriors Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct. Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1 Loss λ ik occurs if we take α i when x belongs to C k

12

13 13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Bayes’ classifier based on neighbors Consider data set with N examples, N i of which belong to class i; P(C i ) = N i Given a new example x, draw a hyper-sphere of volume V in attribute space, centered on x and containing precisely  training examples, irrespective of their class. Suppose this sphere contains n i examples from class i, then p(x|C i )P(C i ) = V -1 (n i /N i )N i = V -1 n i

14 14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Using Bayes’ rule we find posteriors p(C k |x) = n k /  Assign x to the class with highest posterior, which is the class with the highest representation among the  training examples in the hyper-sphere centered on x K=1 (nearest neighbor rule) assign x to the class of nearest neighbor in the training data. Bayes’ classifier based on neighbors

15 15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Usually chose  from a range values based on validation error In 2D, we can visualize the classification by applying KNN to every point in the (x 1,x 2 ) plane. As  increases expect fewer islands and smoother boundaries Bayes’ classifier based on  nearest neighbors (KNN)

16

17 Analysis of binary classification: beyond the confusion matrix

18 Quantities defined by binary confusion matrix Let C1 be positive class, C2 be negative class, N be # of instances Error rate = (FP+FN)/N = 1-accuracy False positive rate = FP / (FP+TN) = fraction of C2 instances misclassified Ture positive rate = TP / (TP+FN) = fraction of C1 instances correctly classified 18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Receiver operating characteristic (ROC) curve Let C1 be positive class Let  be the threshold of P(C1|x) for assignment of x to C1 If  is near 1, rare assignments to C1 have high probability of being correct both FP-rate and TP-rate are small As  decreases both FP-rate and TP-rate increase For every value of , (FP-rate, TP-rate) is point on ROC curve

20 20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Chance alone marginal success ROC curves

21 Drawing ROC curves Assume C1 is the positive class. Rand all examples by decreasing P(C1|x) In decreasing rank order, move up 1/P(C1) for each positive example and move right 1/P(C2) for each negative example If all examples are correctly classified, ROC curve will be in upper left. If P(C1|x) is not correlated with class labels, ROC curve will be close to the diagonal

22 Performance with reduced attribute set is slightly improved Slight improvement Misclassified malignant cases decreased by 2


Download ppt "Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e."

Similar presentations


Ads by Google