 # Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

## Presentation on theme: "Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging."— Presentation transcript:

Bayesian Learning Rong Jin

Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging

Maximum Likelihood Learning (ML) Find the best model by maximizing the log- likelihood of the training data

Maximum A Posterior Learning (MAP) ML learning Models are determined by training data Unable to incorporate prior knowledge/preference about models Maximum a posterior learning (MAP) Knowledge/preference is incorporated through a prior Prior encodes the knowledge/preference

MAP Uninformative prior: regularized logistic regression

MAP Consider text categorization w i : importance of i-th word in classification Prior knowledge: the more common the word, the less important it is How to construct a prior according to the prior knowledge ?

MAP An informative prior for text categorization  i : the occurrence of the i-th word in training data

MAP Two correlated classification tasks: C 1 and C 2 How to introduce an appropriate prior to capture this prior knowledge ?

MAP Construct priors to capture the dependence between w 1 and w 2

Minimum Description Length (MDL) Principle Occam’s razor: prefer a simple hypothesis Simple hypothesis  short description length Minimum description length L C (x) is the description length for message x under coding scheme c Bits for encoding hypothesis h Bits for encoding data given h

MDL D Sender Receiver Send only D ? Send only h ? Send h + D/h ?

Example: Decision Tree H = decision trees, D = training data labels L C 1 (h) is # bits to describe tree h L C 2 (D|h) is # bits to describe D given tree h L C 2 (D|h)=0 if examples are classified perfectly by h. Only need to describe exceptions h MDL trades off tree size for training errors

MAP vs. MDL MAP learning MDL learning

Problems with Maximum Approaches Consider Three possible hypotheses: Maximum approaches will pick h 1 Given new instance x Maximum approaches will output + However, is this most probable result?

Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: The most probable class is -

Computational Issues Need to sum over all possible hypotheses It is expensive or impossible when the hypothesis space is large E.g., decision tree Solution: sampling !

Gibbs Classifier Gibbs algorithm 1.Choose one hypothesis at random, according to p(h|D) 2.Use this hypothesis to classify new instance Surprising fact: Improve by sampling multiple hypotheses from p(h|D) and average their classification results

Bagging Classifiers In general, sampling from p(h|D) is difficult P(h|D) is difficult to compute P(h|D) is impossible to compute for non- probabilistic classifier such as SVM Bagging Classifiers: Realize sampling p(h|D) by sampling training examples

Boostrap Sampling Bagging = Boostrap aggregating Boostrap sampling: given set D containing m training examples Create D i by drawing m examples at random with replacement from D D i expects to leave out about 0.37 of examples from D

Bagging Algorithm Create k boostrap samples D 1, D 2,…, D k Train distinct classifier h i on each D i Classify new instance by classifier vote with equal weights

Bagging  Bayesian Average P(h|D) Bayesian Average … h1h1 h2h2 hkhk Sampling D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Boostrap sampling is almost equivalent to sampling from posterior P(h|D)

Empirical Study of Bagging Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predict the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree outperforms a single decision tree

Why Bagging works better than a single classifier? Real value case y~f(x)+ ,  ~N(0,  )  (x|D) is a predictor learned from training data D Bias-Variance Tradeoff Irreducible variance Model bias: The simpler the  (x|D), the larger the bias Model variance: The simpler the  (x|D), the smaller the variance

Bagging Bagging performs better than a single classifier because it effectively reduces the model variance single decision tree Bagging decision tree bias variance

Download ppt "Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging."

Similar presentations