Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model Comparison.

Similar presentations


Presentation on theme: "Model Comparison."— Presentation transcript:

1 Model Comparison

2 Overview What is Model Selection Cross Validation
Likelihood Ratio Test Bayesian Model Selection Bayes Factor Sampling Bayesian Information Criterion Akaike Information Criterion Free Energy

3 What is Model Comparison
Different Models represent different Hypotheses A Model is good to the extent that it captures the repeatable aspects of the data, allowing good, generalisable predictions. Model Comparison: To what extent does the Data support different candidate Models? Models are mathematical representations of hypotheses. Model Comparison is the process of evaluating how well each model that has been fitted to the data can be used to make predictions about other data sets concerning the processes/phenomena that they set out to explain/model

4 There is a balance to be struck.
A good model should be able to make accurate predictions about the population from which the data was sampled. This means having enough flexibility to fit the data accurately, but not so flexible that it ends up fitting noise in the sample. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

5 Cross-Validation One method of comparing models is cross validation. If we have a data set…… Taken From presentation by Prof. Moore, Carnegie Mellon University

6 Cross-Validation ……We can fit various models to the data.
Here we have three model classes; a linear model, a quadratic and ‘join the dots.’ Taken From presentation by Prof. Moore, Carnegie Mellon University

7 Cross Validation-Test Set Method
In cross validation, we can use the data set itself to test how well each model class makes predictions about the population that the data came from. We can do this by splitting the data into a ‘training set’ to which we fit the model, and a test set, with which we test how well. Here, the blue dots will be the training set which the candidate will be fitted….. Taken From presentation by Prof. Moore, Carnegie Mellon University

8 Cross Validation-Test Set Method
…..Like this Taken From presentation by Prof. Moore, Carnegie Mellon University

9 Cross Validation-Test Set Method
Now you ‘test’ by measuring the mean square error for the model and the test set. Taken From presentation by Prof. Moore, Carnegie Mellon University

10 Cross Validation-Test Set Method
You then repeat for each model class. The model class with the least mean squared error is the best model. Taken From presentation by Prof. Moore, Carnegie Mellon University

11 Cross Validation-Test Set Method
Pro’s: Simple Con’s: Wastes Data: The estimate uses only 70% of the data. If there is not much data, the test set might be lucky or unlucky. Taken From presentation by Prof. Moore, Carnegie Mellon University

12 Cross Validation-Leave one out
Another Cross validation method is the ‘Leave one out’ method. Here you take out one data point. Taken From presentation by Prof. Moore, Carnegie Mellon University

13 Cross Validation-Leave one out
Fit the model to the remaining data Taken From presentation by Prof. Moore, Carnegie Mellon University

14 Cross Validation-Leave one out
And then find the square error for the missing data point. Taken From presentation by Prof. Moore, Carnegie Mellon University

15 You then repeated the process for each data point and work out the mean square error
Taken From presentation by Prof. Moore, Carnegie Mellon University

16 You then repeat the process for each model class
Taken From presentation by Prof. Moore, Carnegie Mellon University

17 And, again, the model class with the least square error is the best model
Taken From presentation by Prof. Moore, Carnegie Mellon University

18 Cross Validation-Leave one out
Pro’s: Doesn’t waste Data Con’s: Expensive

19 K-fold Cross Validation
K-Fold is like a mixture of the test set method and the leave one out method. The data set is divided up into ‘k’ sub-sets (in this case, three sub-sets.) Taken From presentation by Prof. Moore, Carnegie Mellon University

20 K-fold Cross Validation
One sub-set is taken out, and the model is fitted to the remaining data. This is repeated for each sub-set. The square error measured for each data point and the version of the model that was fitted without using it. The mean square error can then be found. Taken From presentation by Prof. Moore, Carnegie Mellon University

21 K-fold Cross Validation
This is then repeated for each model class Taken From presentation by Prof. Moore, Carnegie Mellon University

22 K-fold Cross Validation
And, again ,the model class with the least mean square error is the preferred model. Taken From presentation by Prof. Moore, Carnegie Mellon University

23 Likelihood Ratio Test Used to compare nested models ie. one model is a special case of the other, in which some of the parameters have been fixed to zero e.g. weight=b1*height vs weight=b1*height+b2*age The likelihood ratio can be used to decide whether to reject the null model in favour of the alternative.

24 Likelihood Probability: If we know the coin is fair ie. PH=PT=0.5. What is the probability of getting two heads in a row. P(HH|PH=0.5) = PH*PH = 0.5* 0.5 = 0.25 Likelihood: If two coin tosses gives two heads in a row, what is the likelihood that the coin is fair? L(θ|X)=P(X|θ) and so L(PH=0.5|HH)= P(HH|PH=0.5)=0.25

25 Maximum Likelihood Estimate
MLE Maximum Likelihood Estimate: The setting of the parameters that maximizes the likelihood function

26 Likelihood Ratio Test Find the maximum likelihoods for each model, given the data Find the test statistic Assess significance by comparison with Chi Squared Distribution with DFalt – Dfnull Degrees of Freedom Taken from pdf ‘likelihood ratio tests ‘by prof. G. White

27 Bayesian Model Comparison
Likelihood Function Marginal Likelihood Probability of a model given the data Model prior This is Bayes Rule as applied to models and data. P(m) is the prior probability of the model. The term in red the probability of the data given the model, which equals the likelihood of the model given the data. P(y) is the marginal likelihood of the models Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

28 Likelihood Function Probability that the data came from a particular model is the ‘average’ performance of the model weighted by the prior probability of its parameters Or Probability that randomly selected parameter values from the model class would generate data set y The Likelihood function is an integral of the probability density function for the probability of the data given various parameters of the fit to the model and the prior probability of the respective parameters given the model. The likelihood function measures probability that the data was generated by randomly selected parameter values from the model. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

29 Bayesian Occam’s Razor
This has an inbuilt Occam’s razor (ie. a preference for parsimony.) Whereas a model that is too simple may be very bad at explaining the data, a very flexible, complex model will be fairly good at explaining lots of sets of data given the appropriate parameters, but they are relatively unlikely to generate any data set at RANDOM and so the likelyhood function will tend to favour a parsimonious models (that can explain the data adequately) over compelx ones. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

30 Bayes Factor Ratio of the posterior probabilities of two models
Measures the relative fit of one model vs another The Bayes Factor is the ratio of posterior probabilities for two models. The denominator (model evidence) cancels out and, if the priors are the same for each model, it becomes the ratios of the likelihood functions. Bayes Factors by Kass and Raftery

31 However, this integral is often intractable and so, instead of solving it directly, one has to approximate it. There are several methods for doing this. Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

32 Approximations Sampling Bayesian Information Criterion
Akaike Information Criterion Free Energy These are some (but not all) of the methods that one can use to approximate the likelihood function

33 Sampling Draw candidate parameters settings with high P(θm|M)
Compute P(D|M, θm) Calculate the mean Can perform poorly if number of parameters is large

34 Laplace Approximation
The Laplace Approximation approximates a gaussian around the MLE of the model (actually the MAP I think) Taken from ‘Approximate Inference’ by Ruslan Salachutdinov

35 Bayesian Information Criterion
BIC simplifies Laplace approximation by assuming that the sample size approaches ∞ Retain only the terms that grow as sample size grows The BIC is derived from the Laplace approximation. By assuming that N approaches infinity, the terms that grow as N grows will start to contribute far more to the overall value than those that don’t and so the latter become negligible. The penalty term comes arises from the derivation, unlike with the AIC.

36 What is a significant BIC?

37 BIC- An Example This is an example of the BIC for two models, one with 3 parameters, the other with 5. Each participant has 100 trials.

38 BIC- An Example If the Models are to have equal BIC’s:
In the case where these models have equal BIC’s, the log likelihood term will equal the penalty term for each model.

39 BIC- An Example So Model B only has to perform 4.7% better than Model A on average per trial in order to offset the complexity term

40 BIC- An Example

41 K-L Divergence The Kullback Liebler Divergence is a measure of disparity between P: the true model (the probability density function that the data are drawn from) and Q:the approximating model(s) ie. the candidate models. The AIC is derived from the KL-divergence. It tries to select for the model Q that minimizes the KL-divergence. Because the P is fixed (ie. it is dependant on actual data) the AIC is derived only from the component of the KL-divergence relating to q (ie. integral of -p(x) log q(x)) As the KL divergence can’t be less than zero, maximizing this will minimize the KL-divergence Taken from Wikipedia page ‘Kullback-Liebler Divergence

42 Akaike Information Criterion
The formula that comes from derivation is biased ie. you end up allowing approximately ‘n’ (number of parameters) degrees of freedom more than the number of data points should allow. The AIC has a penalty term that accounts for this bias, unlike the BIC, the penalty term does not come about automatically as a result of the derivation.

43 AIC vs BIC AIC and BIC share the same goodness-of-fit term but the penalty term for the BIC is potentially much more stringent than for the AIC. So The BIC tends to favour simpler models than the AIC. The difference in stringency between the two get larger as sample size gets larger.

44 Free Energy Start with Bayes Rule
Taken From ‘Bayesian Model Comparison’ by Zoubin Ghahramani

45 Free Energy Free energy seemed to be derived from a rearranging of bayes rule (from previous slide) y, given theta and the prior are contracted to give the numerator you see on the right, then both sides are multiplied by the marginal likelihood (previously the denominator) and divided by the posterior probability. Both sides are logged. Then the formula is manipulated, first by introducing the integral of q theta, the integral of q theta will equal 1, then by introducing the q(thetha) divided by q(theta) term, which also will equal one, and so, not change the equation. What this allows is for the original equation to be separated out into two terms, one being the KL divergence and the other the Free energy. We want the KL divergence to be as small as possible (ie. for the disparity of the probability density function which the data were taken from and model to be as small as possible. As Ln p(y) is fixed, we can minimize the KL divergence by maximizing free energy. This is done by using variational calculus to find the value of q(theta) that maximises the free energy (assuming that the various values of theta are independent of each other.) The free energy is useful so long as your assumption aboutu q are correct, but for example if values of theta are dependant it can be quite inaccurate. Taken from Variational Bayesian Inference by Kay Broderson

46 In Conclusion Model Comparison is the Comparison of Hypotheses
Can be done with Frequentist or Bayesian Methods Often Approximations are Necessary Each Approximation has Advantages and Disadvantages


Download ppt "Model Comparison."

Similar presentations


Ads by Google