Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Learning Rong Jin.

Similar presentations


Presentation on theme: "Bayesian Learning Rong Jin."— Presentation transcript:

1 Bayesian Learning Rong Jin

2 Outline MAP learning vs. ML learning
Minimum description length principle Bayes optimal classifier Bagging

3 Maximum Likelihood Learning (ML)
Find the model that best model by maximizing the log-likelihood of the training data Logistic regression Parameters are found by maximizing the likelihood of training data

4 Maximum A Posterior Learning (MAP)
In ML learning, models are solely determined by the training examples Very often, we have prior knowledge/preference about parameters/models ML learning is unable to incorporate the prior knowledge/preference on parameters/models Maximum a posterior learning (MAP) Knowledge/preference about parameters/models are incorporated through a prior Prior for parameters

5 Example: Logistic Regression
ML learning Prior knowledge/Preference No feature should dominate over all other features  Prefer small weights Gaussian prior for parameters/models:

6 Example: Logistic Regression
ML learning Prior knowledge/Preference No feature should dominate over all other features  Prefer small weights Gaussian prior for parameters/models:

7 Example (cont’d) MAP learning for logistic regression
Compared to regularized logistic regression

8 Example (cont’d) MAP learning for logistic regression
Compared to regularized logistic regression

9 Minimum Description Length Principle
Occam’s razor: prefer the simplest hypothesis Simplest hypothesis  hypothesis with shortest description length Minimum description length Prefer shortest hypothesis LC (x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h Complexity of Model # of Mistakes

10 Minimum Description Length Principle
Sender Receiver Send only D ? Send only h ? D Send h + D/h ?

11 Example: Decision Tree
H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given tree h Note LC2(D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions hMDL trades off tree size for training errors

12 MAP vs. MDL MAP learning: Fact from information theory
The optimal (shortest expected coding length) code for an event with probability p is –log2p Interpret MAP using MDL principle Description length of h under optimal coding Description length of exceptions under optimal coding

13 Problems with Maximum Approaches
Consider Three possible hypotheses: Maximum approaches will pick h1 Given new instance x Maximum approaches will output + However, is this most probably result?

14 Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification: Example: The most probably class is -

15 Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification: Example: The most probably class is -

16 When do We Need Bayesian Average?
Bayes optimal classification When do we need Bayesian average? Multiple mode case Optimal mode is flat When NOT Bayesian Average? Can’t estimate Pr(h|D) accurately

17 Computational Issues with Bayes Optimal Classifier
Bayes optimal classification Computational issues: Need to sum over all possible models/hypotheses h It is expensive or impossible when the model/hypothesis space is large Example: decision tree Solution: sampling !

18 Gibbs Classifier Gibbs algorithm Surprising fact:
Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: Improve by sampling multiple hypotheses from P(h|D) and average their classification results Markov chain Monte Carlo (MCMC) sampling Importance sampling

19 Bagging Classifiers In general, sampling from P(h|D) is difficult because P(h|D) is rather difficult to compute Example: how to compute P(h|D) for decision tree? P(h|D) is impossible to compute for non-probabilistic classifier such as SVM P(h|D) is extremely small when hypothesis space is large Bagging Classifiers: Realize sampling P(h|D) through a sampling of training examples

20 Boostrap Sampling Bagging = Boostrap aggregating
Boostrap sampling: given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0.37 of examples from D

21 Bagging Algorithm Create k boostrap samples D1, D2,…, Dk
Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights

22 Bagging  Bayesian Average
P(h|D) Bayesian Average h1 h2 hk Sampling D Bagging D1 D2 Dk Boostrap Sampling h1 h2 hk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

23 Empirical Study of Bagging
Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predicate the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree performances better than a single decision tree

24 Bias-Variance Tradeoff
Why Bagging works better than a single classifier? Bias-variance tradeoff Real value case Output y for x follows y~f(x)+, ~N(0,) (x|D) is a predictor learned from training data D Bias-variance decomposition Model variance: The simpler the (x|D), the smaller the variance Model bias: The simpler the (x|D), the larger the bias Irreducible variance

25 Bias-Variance Tradeoff
Fit with Complicated Models Small model bias Large model variance True Model

26 Bias-Variance Tradeoff
Large model bias Small model variance True Model Fit with Simple Models

27 Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree


Download ppt "Bayesian Learning Rong Jin."

Similar presentations


Ads by Google