Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007.

Similar presentations


Presentation on theme: "Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007."— Presentation transcript:

1 Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007

2 Definitions a posteriori: derived from observed facts a priori: based on hypothesis or theory rather than experiment

3 Bayesian Learning Make predictions using all hypotheses, weighted by their probabilities Bayes’ rule: P(a | b) = α P(b | a) P(a) For each hypothesis h i, observed data d: P(h i | d) = α P(d | h i ) P(h i ) P(d | h i ) is the likelihood of d under hypothesis h i P(h i ) is the hypothesis prior α is a normalization constant = 1 / ∑ i P(d | h i ) P(h i )

4 Bayesian Learning We want to predict some quantity X: P(X | d) = ∑ i P(X | d, h i ) P(h i | d) = ∑ i P(X | h i ) P(h i | d) The predictions are weighted averages over the predictions of the individual hypotheses

5 Example Suppose we know that there are 5 kinds of bags of candy: cherrylime% of all bags Type 1100%10% Type 275%25%20% Type 350% 40% Type 425%75%20% Type 5100%10%

6 Example: priors Given a new bag of candy, predict the type of the bag: Five hypotheses: h 1 : bag is type 1, P(h 1 ) =.1 h 2 : bag is type 2, P(h 2 ) =.2 h 3 : bag is type 3, P(h 3 ) =.4 h 4 : bag is type 4, P(h 4 ) =.2 h 5 : bag is type 5, P(h 5 ) =.1 With no evidence, we use the hypothesis priors

7 Example: one lime candy Suppose we unwrap one candy and determine that it is lime. P(h 1 | onelime) = α P(onlime | h 1 )P(h 1 ) = 0.5*(0 * 0.1) = 0 P(h 2 | onelime) = α P(onlime | h 2 )P(h 2 ) = 0.5*(0.25 * 0.2) = 0.1 P(h 3 | onelime) = α P(onlime | h 3 )P(h 3 ) = 0.5*(0.5 * 0.4) = 0.4 P(h 4 | onelime) = α P(onlime | h 4 )P(h 4 ) = 0.5*(0.75 * 0.2) = 0.3 P(h 5 | onelime) = α P(onlime | h 5 )P(h 5 ) = 0.5*(1.0 * 0.1) = 0.2

8 Example: two lime candies Suppose we unwrap another candy and it is also lime. P(h 1 | twolime) = α P(twolime | h 1 )P(h 1 ) = 0.33*(0 * 0.1) = 0 P(h 2 | twolime) = α P(twolime | h 2 )P(h 2 ) = 0.33*(0.0625 * 0.2) = 0.05 P(h 3 | twolime) = α P(twolime | h 3 )P(h 3 ) = 0.33*(0.25 * 0.4) = 0.4 P(h 4 | twolime) = α P(twolime | h 4 )P(h 4 ) = 0.33*(0.5625 * 0.2) = 0.45 P(h 5 | twolime) = α P(twolime | h 5 )P(h 5 ) = 0.33*(1.0 * 0.1) = 0.4

9 Example: n lime candies Suppose we unwrap n candies and they are all lime. P(h 1 | nlime) = α n (0 n * 0.1) P(h 2 | nlime) = α n (0.25 n * 0.2) P(h 3 | nlime) = α n (0.5 n * 0.4) P(h 4 | nlime) = α n (0.75 n * 0.2) P(h 5 | nlime) = α n (1 n * 0.1)

10

11 Prediction: what candy is next? P(nextlime | nlime) = ∑ i P(nextlime | h i ) P(h i | nlime) P(nextlime | h 1 ) P(h 1 | nlime) + P(nextlime | h 2 ) P(h 2 | nlime) + P(nextlime | h 3 ) P(h 3 | nlime) + P(nextlime | h 4 ) P(h 4 | nlime) + P(nextlime | h 5 ) P(h 5 | nlime) = 0 * α n (0 n * 0.1) + 0.25 * α n (0.25 n * 0.2) + 0.5 * α n (0.5 n * 0.4) + 0.75 * α n (0.75 n * 0.2) + 1 * α n (1 n * 0.1)

12 0.97

13 Analysis: Bayesian Prediction The true hypothesis eventually dominates The posterior probability of any false hypothesis will eventually dominate Probability of uncharacteristic data will become vanishingly small Bayesian prediction is optimal Bayesian prediction is expensive Hypothesis space may be very large (or infinite)

14 MAP Approximation To avoid expense of Bayesian learning, one approach is to simply chose the most probable hypothesis and assume it is correct MAP = maximum a posteriori h map = h i with highest value for P(h i | d) In candy example, after 3 limes have been selected a MAP learner will always predict next candy is lime with 100% probability Less accurate, but much cheaper

15

16 Avoiding Complexity As we’ve seen earlier, allowing overly complex hypotheses can lead to overfitting Bayesian and MAP learning use the hypothesis prior to penalize complex hypotheses Complex hypotheses typically have lower priors – since there are typically more complex hypotheses We get the simplest hypothesis consistent with the data (as per Ockham’s razor)

17 ML Approximation For large data sets, the priors become irrelevant, in this case we may use maximum likelihood (ML) learning Choose h ml that maximizes P(d | h i ) Choose the hypothesis that has the highest probability of causing the observed data identical to MAP for uniform priors ML is the standard (non-Bayesian) statistical learning method

18

19

20 Exercise Suppose we were pulling candy from a 50/50 bag (type 3) or a 25/75 bag (type 4) With full Bayesian learning, what would the posterior probability and prediction plots look like after 100 candies? What would prediction plots look like for MAP and ML learning after 1000 candies?

21 Bayesian 50/50 bag

22

23 Bayesian 25/75 bag

24

25 MAP 50/50 bag

26 ML 50/50 bag

27 MAP 25/75 bag

28 ML 25/75 bag

29 Exercise

30 Answer


Download ppt "Statistical Learning: Bayesian and ML COMP155 Sections 20.1-20.2 May 2, 2007."

Similar presentations


Ads by Google