Model Comparison.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

J. Daunizeau Institute of Empirical Research in Economics, Zurich, Switzerland Brain and Spine Institute, Paris, France Bayesian inference.
Objectives 10.1 Simple linear regression
Bayesian Inference Chris Mathys Wellcome Trust Centre for Neuroimaging UCL SPM Course London, May 12, 2014 Thanks to Jean Daunizeau and Jérémie Mattout.
Biointelligence Laboratory, Seoul National University
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
MEG/EEG Inverse problem and solutions In a Bayesian Framework EEG/MEG SPM course, Bruxelles, 2011 Jérémie Mattout Lyon Neuroscience Research Centre ? ?
Week 2 Video 4 Metrics for Regressors.
Inferential Statistics & Hypothesis Testing
Robert Plant != Richard Plant. Sample Data Response, covariates Predictors Remotely sensed Build Model Uncertainty Maps Covariates Direct or Remotely.
Model Assessment, Selection and Averaging
Chapter 4: Linear Models for Classification
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
458 Model Uncertainty and Model Selection Fish 458, Lecture 13.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
J. Daunizeau Wellcome Trust Centre for Neuroimaging, London, UK Institute of Empirical Research in Economics, Zurich, Switzerland Bayesian inference.
Probabilistic Sports Prediction Using Machine Learning Ryan Baird Supervisor: A. Prof. David Dowe.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Bayesian Model Comparison and Occam’s Razor Lecture 2.
A quick intro to Bayesian thinking 104 Frequentist Approach 10/14 Probability of 1 head next: = X Probability of 2 heads next: = 0.51.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Randomized Algorithms for Bayesian Hierarchical Clustering
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Not in FPP Bayesian Statistics. The Frequentist paradigm Defines probability as a long-run frequency independent, identical trials Looks at parameters.
INTRODUCTION TO Machine Learning 3rd Edition
Simple examples of the Bayesian approach For proportions and means.
Cosmological Model Selection David Parkinson (with Andrew Liddle & Pia Mukherjee)
1 Model choice Gil McVean, Department of Statistics Tuesday 17 th February 2007.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Score Function for Data Mining Algorithms Based on Chapter 7 of Hand, Manilla, & Smyth David Madigan.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Model Selection and Averaging SPM for MEG/EEG course Peter Zeidman 17 th May 2016, 16:15-17:00.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Estimation and Confidence Intervals
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Robert Plant != Richard Plant
Inference: Conclusion with Confidence
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Maximum Likelihood Estimation
Limitations of Hierarchical and Mixture Model Comparisons
Bayesian Model Comparison and Occam’s Razor
I. Statistical Tests: Why do we use them? What do they involve?
BAYESIAN MODEL SELECTION Royal Astronomical Society
PSY 626: Bayesian Statistics for Psychological Science
Bayes for Beginners Luca Chech and Jolanda Malamud
Bayesian inference J. Daunizeau
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Mathematical Foundations of BME Reza Shadmehr
Bayesian Model Selection and Averaging
Presentation transcript:

Model Comparison

Overview What is Model Selection Cross Validation Likelihood Ratio Test Bayesian Model Selection Bayes Factor Sampling Bayesian Information Criterion Akaike Information Criterion Free Energy

What is Model Comparison Different Models represent different Hypotheses A Model is good to the extent that it captures the repeatable aspects of the data, allowing good, generalisable predictions. Model Comparison: To what extent does the Data support different candidate Models? Models are mathematical representations of hypotheses. Model Comparison is the process of evaluating how well each model that has been fitted to the data can be used to make predictions about other data sets concerning the processes/phenomena that they set out to explain/model

There is a balance to be struck. A good model should be able to make accurate predictions about the population from which the data was sampled. This means having enough flexibility to fit the data accurately, but not so flexible that it ends up fitting noise in the sample. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Cross-Validation One method of comparing models is cross validation. If we have a data set…… Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross-Validation ……We can fit various models to the data. Here we have three model classes; a linear model, a quadratic and ‘join the dots.’ Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Test Set Method In cross validation, we can use the data set itself to test how well each model class makes predictions about the population that the data came from. We can do this by splitting the data into a ‘training set’ to which we fit the model, and a test set, with which we test how well. Here, the blue dots will be the training set which the candidate will be fitted….. Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Test Set Method …..Like this Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Test Set Method Now you ‘test’ by measuring the mean square error for the model and the test set. Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Test Set Method You then repeat for each model class. The model class with the least mean squared error is the best model. Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Test Set Method Pro’s: Simple Con’s: Wastes Data: The estimate uses only 70% of the data. If there is not much data, the test set might be lucky or unlucky. Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Leave one out Another Cross validation method is the ‘Leave one out’ method. Here you take out one data point. Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Leave one out Fit the model to the remaining data Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Leave one out And then find the square error for the missing data point. Taken From presentation by Prof. Moore, Carnegie Mellon University

You then repeated the process for each data point and work out the mean square error Taken From presentation by Prof. Moore, Carnegie Mellon University

You then repeat the process for each model class Taken From presentation by Prof. Moore, Carnegie Mellon University

And, again, the model class with the least square error is the best model Taken From presentation by Prof. Moore, Carnegie Mellon University

Cross Validation-Leave one out Pro’s: Doesn’t waste Data Con’s: Expensive

K-fold Cross Validation K-Fold is like a mixture of the test set method and the leave one out method. The data set is divided up into ‘k’ sub-sets (in this case, three sub-sets.) Taken From presentation by Prof. Moore, Carnegie Mellon University

K-fold Cross Validation One sub-set is taken out, and the model is fitted to the remaining data. This is repeated for each sub-set. The square error measured for each data point and the version of the model that was fitted without using it. The mean square error can then be found. Taken From presentation by Prof. Moore, Carnegie Mellon University

K-fold Cross Validation This is then repeated for each model class Taken From presentation by Prof. Moore, Carnegie Mellon University

K-fold Cross Validation And, again ,the model class with the least mean square error is the preferred model. Taken From presentation by Prof. Moore, Carnegie Mellon University

Likelihood Ratio Test Used to compare nested models ie. one model is a special case of the other, in which some of the parameters have been fixed to zero e.g. weight=b1*height vs weight=b1*height+b2*age The likelihood ratio can be used to decide whether to reject the null model in favour of the alternative.

Likelihood Probability: If we know the coin is fair ie. PH=PT=0.5. What is the probability of getting two heads in a row. P(HH|PH=0.5) = PH*PH = 0.5* 0.5 = 0.25 Likelihood: If two coin tosses gives two heads in a row, what is the likelihood that the coin is fair? L(θ|X)=P(X|θ) and so L(PH=0.5|HH)= P(HH|PH=0.5)=0.25

Maximum Likelihood Estimate MLE Maximum Likelihood Estimate: The setting of the parameters that maximizes the likelihood function

Likelihood Ratio Test Find the maximum likelihoods for each model, given the data Find the test statistic Assess significance by comparison with Chi Squared Distribution with DFalt – Dfnull Degrees of Freedom Taken from pdf ‘likelihood ratio tests ‘by prof. G. White

Bayesian Model Comparison Likelihood Function Marginal Likelihood Probability of a model given the data Model prior This is Bayes Rule as applied to models and data. P(m) is the prior probability of the model. The term in red the probability of the data given the model, which equals the likelihood of the model given the data. P(y) is the marginal likelihood of the models Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Likelihood Function Probability that the data came from a particular model is the ‘average’ performance of the model weighted by the prior probability of its parameters Or Probability that randomly selected parameter values from the model class would generate data set y The Likelihood function is an integral of the probability density function for the probability of the data given various parameters of the fit to the model and the prior probability of the respective parameters given the model. The likelihood function measures probability that the data was generated by randomly selected parameter values from the model. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Bayesian Occam’s Razor This has an inbuilt Occam’s razor (ie. a preference for parsimony.) Whereas a model that is too simple may be very bad at explaining the data, a very flexible, complex model will be fairly good at explaining lots of sets of data given the appropriate parameters, but they are relatively unlikely to generate any data set at RANDOM and so the likelyhood function will tend to favour a parsimonious models (that can explain the data adequately) over compelx ones. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Bayes Factor Ratio of the posterior probabilities of two models Measures the relative fit of one model vs another The Bayes Factor is the ratio of posterior probabilities for two models. The denominator (model evidence) cancels out and, if the priors are the same for each model, it becomes the ratios of the likelihood functions. Bayes Factors by Kass and Raftery

However, this integral is often intractable and so, instead of solving it directly, one has to approximate it. There are several methods for doing this. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Approximations Sampling Bayesian Information Criterion Akaike Information Criterion Free Energy These are some (but not all) of the methods that one can use to approximate the likelihood function

Sampling Draw candidate parameters settings with high P(θm|M) Compute P(D|M, θm) Calculate the mean Can perform poorly if number of parameters is large

Laplace Approximation The Laplace Approximation approximates a gaussian around the MLE of the model (actually the MAP I think) Taken from ‘Approximate Inference’ by Ruslan Salachutdinov

Bayesian Information Criterion BIC simplifies Laplace approximation by assuming that the sample size approaches ∞ Retain only the terms that grow as sample size grows The BIC is derived from the Laplace approximation. By assuming that N approaches infinity, the terms that grow as N grows will start to contribute far more to the overall value than those that don’t and so the latter become negligible. The penalty term comes arises from the derivation, unlike with the AIC.

What is a significant BIC?

BIC- An Example This is an example of the BIC for two models, one with 3 parameters, the other with 5. Each participant has 100 trials.

BIC- An Example If the Models are to have equal BIC’s: In the case where these models have equal BIC’s, the log likelihood term will equal the penalty term for each model.

BIC- An Example So Model B only has to perform 4.7% better than Model A on average per trial in order to offset the complexity term

BIC- An Example

K-L Divergence The Kullback Liebler Divergence is a measure of disparity between P: the true model (the probability density function that the data are drawn from) and Q:the approximating model(s) ie. the candidate models. The AIC is derived from the KL-divergence. It tries to select for the model Q that minimizes the KL-divergence. Because the P is fixed (ie. it is dependant on actual data) the AIC is derived only from the component of the KL-divergence relating to q (ie. integral of -p(x) log q(x)) As the KL divergence can’t be less than zero, maximizing this will minimize the KL-divergence Taken from Wikipedia page ‘Kullback-Liebler Divergence

Akaike Information Criterion The formula that comes from derivation is biased ie. you end up allowing approximately ‘n’ (number of parameters) degrees of freedom more than the number of data points should allow. The AIC has a penalty term that accounts for this bias, unlike the BIC, the penalty term does not come about automatically as a result of the derivation.

AIC vs BIC AIC and BIC share the same goodness-of-fit term but the penalty term for the BIC is potentially much more stringent than for the AIC. So The BIC tends to favour simpler models than the AIC. The difference in stringency between the two get larger as sample size gets larger.

Free Energy Start with Bayes Rule Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

Free Energy Free energy seemed to be derived from a rearranging of bayes rule (from previous slide) y, given theta and the prior are contracted to give the numerator you see on the right, then both sides are multiplied by the marginal likelihood (previously the denominator) and divided by the posterior probability. Both sides are logged. Then the formula is manipulated, first by introducing the integral of q theta, the integral of q theta will equal 1, then by introducing the q(thetha) divided by q(theta) term, which also will equal one, and so, not change the equation. What this allows is for the original equation to be separated out into two terms, one being the KL divergence and the other the Free energy. We want the KL divergence to be as small as possible (ie. for the disparity of the probability density function which the data were taken from and model to be as small as possible. As Ln p(y) is fixed, we can minimize the KL divergence by maximizing free energy. This is done by using variational calculus to find the value of q(theta) that maximises the free energy (assuming that the various values of theta are independent of each other.) The free energy is useful so long as your assumption aboutu q are correct, but for example if values of theta are dependant it can be quite inaccurate. Taken from Variational Bayesian Inference by Kay Broderson

In Conclusion Model Comparison is the Comparison of Hypotheses Can be done with Frequentist or Bayesian Methods Often Approximations are Necessary Each Approximation has Advantages and Disadvantages