End of Chapter 8 Neil Weisenfeld March 28, 2005.

Name: End of Chapter 8 Neil Weisenfeld March 28, 2005.
Uploaded: 2017-12-11T19:00:00+00:00
Duration: PTM8S0
Description: End of Chapter 8 Neil Weisenfeld March 28, 2005.

End of Chapter 8 Neil Weisenfeld March 28, 2005

Outline 8.6 MCMC for Sampling from the Posterior 8.7 Bagging
8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping

MCMC for Sampling from the Posterior
Markov chain Monte Carlo method Estimate parameters given a Bayesian model and sampling from the posterior distribution Gibbs sampling, a form of MCMC, is like EM only sample from conditional dist rather than maximizing

Gibbs Sampling Wish to draw a sample from the joint distribution
If this is difficult, but it’s easy to simulate conditional distributions Gibbs sampler simulates each of these Process produces a Markov chain with stationary distribution equal to desired joint disttribution

Algorithm 8.3: Gibbs Sampler
Take some initial values for t=1,2,…: for k=1,2,…,K generate from: Continue step 2 until joint distribution of does not change

Gibbs Sampling Only need to be able to sample from conditional distribution, but if it is known, then: is a better estimate

Gibbs sampling for mixtures
Consider latent data from EM procedure to be another parameter: See algorithm (next slide), same as EM except sample instead of maximize Additional steps can be added to include other informative priors

Algorithm 8.4: Gibbs sampling for mixtures
Take some initial values Repeat for t=1,2,…, For I=1,2,…,N generate Set Continue step 2 until the joint distribution of doesn’t change.

Figure 8.8: Gibbs Sampling from Mixtures
Simplified case with fixed variances and mixing proportion

8.7 Bagging Using bootstrap to improve the estimate itself
Bootstrap mean approximately posterior average Consider regression problem: Bagging averages estimates over bootstrap samples to produce:

Bagging, cnt’d Point is to reduce variance of the estimate while leaving bias unchanged Monte-Carlo estimate of “true” bagging estimate, approaching as Bagged estimate will differ from the original estimate only when latter is adaptive or non-linear function of the data

Bagging B-Spline Example
Bagging would average the curves in the lower left-hand corner at each x value.

Quick Tree Intro Can’t do. Recursive subdivision. Tree. f-hat.

Spam Example

Bagging Trees Each run produces different trees
Each tree may have different terminal nodes Bagged estimate is the average prediction at x from the B trees. Prediction can be a 0/1 indicator function, in which case bagging gives a pk proportion of trees predicting class k at x.

8.7.1: Example Trees with Simulated Data
Original and 5 bootstrap-grown trees Two classes, five features, Gaussian distribution Y from Bayes error 0.2 Trees fit to 200 bootstrap samples

Example Performance High variance among trees because features have pairwise correlation 0.95. Bagging successfully smooths out vairance and reduces test error.

Where Bagging Doesn’t Help
Classifier is a single axis-oriented split. Split is chosen along either x1 or x2 in order to minimize training error. Boosting is shown on the right.

Model Averaging and Stacking
More general Bayesian model averaging Given candidate models Mm, m =1…M and a training set Z and Bayesian prediction is weighted avg of indiv predictions with weights proportional to posterior of each model

Other Averaging Strategies
Simple unweighted average of predictions (each model equally likely) BIC: use to estimate posterior model probabilities: weight each model depending on fit and how many parameters it uses Full Bayesian strategy:

Frequentist Viewpoint of Averaging
Given a set of predictions from M models, we seek optimal weights w: Input x is fixed and N observations in Z are distributed according to P. Solution is the linear regression of Y on the vector of model predictions:

Notes of Frequentist Viewpoint
At the population level, adding models with arbitrary weights can only help. But the population is, of course, not available Regression over training set can be used, but this may not be ideal: model complexity not taken into account…

Stacked Generalization, Stacking
Cross validated predictions avoid unfairly high weight to models with high complexity If w restricted to vectors with one unit weight and the rest zero, model choice has smallest leave-one-out cross validation In practice we use combined models with optimal weights: better prediction, but less interpretability

Stochastic Search: Bumping
Rather than average models, try to find a better single model. Good for avoiding local minima in the fitting method. Like bagging, draw bootstrap samples and fit model to each, but choose model that best fits the training data

Stochastic Search: Bumping
Given B bootstrap samples Z*1,…, Z*B, fitting model to each yields predictions: For squared error, choose model from bootstrap sample: Bumping tries to move around the model space by perturbing the data.

A contrived case where bumping helps
Greedy tree-based algorithm tries to split on each dimension separately, first one, then the other. Bumping stumbles upon the right answer.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Similar presentations

Presentation on theme: "End of Chapter 8 Neil Weisenfeld March 28, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Similar presentations

Presentation on theme: "End of Chapter 8 Neil Weisenfeld March 28, 2005."— Presentation transcript:

Similar presentations

About project

Feedback