End of Chapter 8 Neil Weisenfeld March 28, 2005.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Model generalization Test error Bias, variance and complexity
Supervised Learning Recap
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
The loss function, the normal equation,
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Sparse vs. Ensemble Approaches to Supervised Learning
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Ensemble Learning: An Introduction
Additive Models and Trees
Machine Learning CMPT 726 Simon Fraser University
Prediction and model selection
Maximum Likelihood (ML), Expectation Maximization (EM)
Bayesian Learning Rong Jin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
Bayes Factor Based on Han and Carlin (2001, JASA).
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Model Inference and Averaging
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
Randomized Algorithms for Bayesian Hierarchical Clustering
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Ensemble Methods in Machine Learning
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning Deep Generative Models by Ruslan Salakhutdinov
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Boosting and Additive Trees (2)
Model Inference and Averaging
Data Mining Lecture 11.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
Robust Full Bayesian Learning for Neural Networks
The loss function, the normal equation,
Ensemble learning Reminder - Bagging of Trees Random Forest
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

End of Chapter 8 Neil Weisenfeld March 28, 2005

Outline 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping

MCMC for Sampling from the Posterior Markov chain Monte Carlo method Estimate parameters given a Bayesian model and sampling from the posterior distribution Gibbs sampling, a form of MCMC, is like EM only sample from conditional dist rather than maximizing

Gibbs Sampling Wish to draw a sample from the joint distribution If this is difficult, but it’s easy to simulate conditional distributions Gibbs sampler simulates each of these Process produces a Markov chain with stationary distribution equal to desired joint disttribution

Algorithm 8.3: Gibbs Sampler Take some initial values for t=1,2,…: for k=1,2,…,K generate from: Continue step 2 until joint distribution of does not change

Gibbs Sampling Only need to be able to sample from conditional distribution, but if it is known, then: is a better estimate

Gibbs sampling for mixtures Consider latent data from EM procedure to be another parameter: See algorithm (next slide), same as EM except sample instead of maximize Additional steps can be added to include other informative priors

Algorithm 8.4: Gibbs sampling for mixtures Take some initial values Repeat for t=1,2,…, For I=1,2,…,N generate Set Continue step 2 until the joint distribution of doesn’t change.

Figure 8.8: Gibbs Sampling from Mixtures Simplified case with fixed variances and mixing proportion

Outline 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping

8.7 Bagging Using bootstrap to improve the estimate itself Bootstrap mean approximately posterior average Consider regression problem: Bagging averages estimates over bootstrap samples to produce:

Bagging, cnt’d Point is to reduce variance of the estimate while leaving bias unchanged Monte-Carlo estimate of “true” bagging estimate, approaching as Bagged estimate will differ from the original estimate only when latter is adaptive or non-linear function of the data

Bagging B-Spline Example Bagging would average the curves in the lower left-hand corner at each x value.

Quick Tree Intro Can’t do. Recursive subdivision. Tree. f-hat.

Spam Example

Bagging Trees Each run produces different trees Each tree may have different terminal nodes Bagged estimate is the average prediction at x from the B trees. Prediction can be a 0/1 indicator function, in which case bagging gives a pk proportion of trees predicting class k at x.

8.7.1: Example Trees with Simulated Data Original and 5 bootstrap-grown trees Two classes, five features, Gaussian distribution Y from Bayes error 0.2 Trees fit to 200 bootstrap samples

Example Performance High variance among trees because features have pairwise correlation 0.95. Bagging successfully smooths out vairance and reduces test error.

Where Bagging Doesn’t Help Classifier is a single axis-oriented split. Split is chosen along either x1 or x2 in order to minimize training error. Boosting is shown on the right.

Outline 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping

Model Averaging and Stacking More general Bayesian model averaging Given candidate models Mm, m =1…M and a training set Z and Bayesian prediction is weighted avg of indiv predictions with weights proportional to posterior of each model

Other Averaging Strategies Simple unweighted average of predictions (each model equally likely) BIC: use to estimate posterior model probabilities: weight each model depending on fit and how many parameters it uses Full Bayesian strategy:

Frequentist Viewpoint of Averaging Given a set of predictions from M models, we seek optimal weights w: Input x is fixed and N observations in Z are distributed according to P. Solution is the linear regression of Y on the vector of model predictions:

Notes of Frequentist Viewpoint At the population level, adding models with arbitrary weights can only help. But the population is, of course, not available Regression over training set can be used, but this may not be ideal: model complexity not taken into account…

Stacked Generalization, Stacking Cross validated predictions avoid unfairly high weight to models with high complexity If w restricted to vectors with one unit weight and the rest zero, model choice has smallest leave-one-out cross validation In practice we use combined models with optimal weights: better prediction, but less interpretability

Outline 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Examples: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping

Stochastic Search: Bumping Rather than average models, try to find a better single model. Good for avoiding local minima in the fitting method. Like bagging, draw bootstrap samples and fit model to each, but choose model that best fits the training data

Stochastic Search: Bumping Given B bootstrap samples Z*1,…, Z*B, fitting model to each yields predictions: For squared error, choose model from bootstrap sample: Bumping tries to move around the model space by perturbing the data.

A contrived case where bumping helps Greedy tree-based algorithm tries to split on each dimension separately, first one, then the other. Bumping stumbles upon the right answer.