Bayesian Learning Rong Jin.

Name: Bayesian Learning Rong Jin.
Uploaded: 2017-12-02T20:39:11+00:00
Duration: PTM8S49
Description: Bayesian Learning Rong Jin.

Bayesian Learning Rong Jin

Outline MAP learning vs. ML learning
Minimum description length principle Bayes optimal classifier Bagging

Maximum Likelihood Learning (ML)
Find the model that best model by maximizing the log-likelihood of the training data Logistic regression Parameters are found by maximizing the likelihood of training data

Maximum A Posterior Learning (MAP)
In ML learning, models are solely determined by the training examples Very often, we have prior knowledge/preference about parameters/models ML learning is unable to incorporate the prior knowledge/preference on parameters/models Maximum a posterior learning (MAP) Knowledge/preference about parameters/models are incorporated through a prior Prior for parameters

Example: Logistic Regression
ML learning Prior knowledge/Preference No feature should dominate over all other features  Prefer small weights Gaussian prior for parameters/models:

Example (cont’d) MAP learning for logistic regression
Compared to regularized logistic regression

Minimum Description Length Principle
Occam’s razor: prefer the simplest hypothesis Simplest hypothesis  hypothesis with shortest description length Minimum description length Prefer shortest hypothesis LC (x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h Complexity of Model # of Mistakes

Minimum Description Length Principle
Sender Receiver Send only D ? Send only h ? D Send h + D/h ?

Example: Decision Tree
H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given tree h Note LC2(D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions hMDL trades off tree size for training errors

MAP vs. MDL MAP learning: Fact from information theory
The optimal (shortest expected coding length) code for an event with probability p is –log2p Interpret MAP using MDL principle Description length of h under optimal coding Description length of exceptions under optimal coding

Problems with Maximum Approaches
Consider Three possible hypotheses: Maximum approaches will pick h1 Given new instance x Maximum approaches will output + However, is this most probably result?

Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification: Example: The most probably class is -

When do We Need Bayesian Average?
Bayes optimal classification When do we need Bayesian average? Multiple mode case Optimal mode is flat When NOT Bayesian Average? Can’t estimate Pr(h|D) accurately

Computational Issues with Bayes Optimal Classifier
Bayes optimal classification Computational issues: Need to sum over all possible models/hypotheses h It is expensive or impossible when the model/hypothesis space is large Example: decision tree Solution: sampling !

Gibbs Classifier Gibbs algorithm Surprising fact:
Choose one hypothesis at random, according to P(h|D) Use this to classify new instance Surprising fact: Improve by sampling multiple hypotheses from P(h|D) and average their classification results Markov chain Monte Carlo (MCMC) sampling Importance sampling

Bagging Classifiers In general, sampling from P(h|D) is difficult because P(h|D) is rather difficult to compute Example: how to compute P(h|D) for decision tree? P(h|D) is impossible to compute for non-probabilistic classifier such as SVM P(h|D) is extremely small when hypothesis space is large Bagging Classifiers: Realize sampling P(h|D) through a sampling of training examples

Boostrap Sampling Bagging = Boostrap aggregating
Boostrap sampling: given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0.37 of examples from D

Bagging Algorithm Create k boostrap samples D1, D2,…, Dk
Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights

Bagging  Bayesian Average
P(h|D) Bayesian Average … h1 h2 hk Sampling D Bagging … D1 D2 Dk Boostrap Sampling h1 h2 hk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

Empirical Study of Bagging
Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predicate the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree performances better than a single decision tree

Bias-Variance Tradeoff
Why Bagging works better than a single classifier? Bias-variance tradeoff Real value case Output y for x follows y~f(x)+, ~N(0,) (x|D) is a predictor learned from training data D Bias-variance decomposition Model variance: The simpler the (x|D), the smaller the variance Model bias: The simpler the (x|D), the larger the bias Irreducible variance

Fit with Complicated Models Small model bias Large model variance True Model

Large model bias Small model variance True Model Fit with Simple Models

Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree

Bayesian Learning Rong Jin.

Similar presentations

Presentation on theme: "Bayesian Learning Rong Jin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Learning Rong Jin.

Similar presentations

Presentation on theme: "Bayesian Learning Rong Jin."— Presentation transcript:

Similar presentations

About project

Feedback