Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Triangle of Statistical Inference: Likelihoood

Similar presentations


Presentation on theme: "The Triangle of Statistical Inference: Likelihoood"— Presentation transcript:

1 The Triangle of Statistical Inference: Likelihoood
Data Inference Probability Model Scientific Model Alright, now that we have covered the most common probability distributions you are likely to encounter, we are going to turn our attention to likelihood and how it relates to the statistical inference framework we have just presented to you.

2 An example... The Data: xi = measurements of DBH on 50 trees
yi = measurements of crown radius on those trees The Scientific Model: yi = a + b xi + e (linear relationship, with 2 parameters (a, b) and an error term (e) (the residuals)) The Probability Model: e is normally distributed, with E[e ] and variance estimated from the observed variance of the residuals... We are going to go back to our example, the model that describes crown radius as a function of DBH and an added error.

3 So what is likelihood –and what is it good for?
Probability based (“inverse probability”). “ mathematical quantity that appears to be appropriate for measuring our order of preference among different possible populations but does not in fact obey the laws of probability” --RA. Fischer Foundation of theory of statistics. Enables comparison of alternate models. Likelihood is a mathematical quantity that allows us to measure the order of preference for different possible population (models) or hypotheses but does not obey the laws of probability.

4 So what is likelihood –and what is it good for?
Scientific hypotheses cannot be treated as outcomes of trials (probabilities) because we will never have the full set of possible outcomes. However, we can calculate the probability of obtaining the results, given our model (scientific hypothesis (P(data|model). Likelihood is proportional to this probability. Likelihood is a mathematical quantity that allows us to measure the order of preference for different possible population (models) or hypotheses but does not obey the laws of probability.

5 Likelihood is proportional to probability
P(data|hypothesis (q )) L(hyp|data) P(data|hypothesis (q )) = kL(q| data) In plain English: “The likelihood (L) of the set of parameters (q) (in the scientific model), given the data (x), is proportional to the probability of observing the data, given the parameters...” {and this probability is something we can calculate, using the appropriate underlying probability model (i.e. a PDF)} If we are using likelihood to compare the strength of evidence for several models (hypotheses or parameter values), we can assume that that k is 1. Because likelihood is a meaningless number that can only be used when comparing one model to another, we can assume that the constant K is 1. Likelihood can take any one value negative or positive. NOTE ABOUT TERMINOLOGY. SOMETIMES PEOPLE WILL TURN THIS AROUND. YOU WILL SEE BOTH KINDS OF ARRANGEMENTS LIKELIHOOD IS THE PROBABILITY OF OBSERVING THE DATA GIVEN SOME ECOLOGICAL MODEL FOR THE PROCESS (INCLUDING SPECIFIC VALUES FOR THE PARAMETERS). SO ESSENTIALLY WE ARE CALCULATING THE PROBABILITY THAT WHAT ACTUALLY HAPPENED DID HAPPEN!!! KINDA WEIRD.

6 Parameter values can specify your hypotheses
P(datai|θ) = kL(θ |data) Parameter is variable, data fixed. What is the likelihood of the parameter given the data? Parameter is fixed, data variable. What is the prob. of observing the data if our model and parameters are correct? It is important that you understand the difference between likelihood and probability. Probability by definition must always take a value between 0 and 1. The data are fixed and we apply a fixed model to the data. Likelihood on the other hand is the opposite, parameter is variable—we let the data tell us—and the data are fixed. So we try different values of the parameters that maximize the probability of observing the data given a parameter values.

7 General Likelihood Function
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -4 -3 -2 -1 1 2 3 4 5 Probability Data (xi ) L(θ|x) = cg(x|θ ) Probability density function or discrete density function So the likelihood function, is composed of a constant, a probability density function and data. C is a constant and thus unimportant when comparing alternate models AS LONG AS WE USE THE SAME DATASET FOR ALL THE MODELS. Parameters in probability model c is a constant, and thus, unimportant in comparison of alternate hypotheses or models as long as the data remain constant.

8 General Likelihood Function
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -4 -3 -2 -1 1 2 3 4 5 Probability Data (xi ) L(θ|x) = g(xi|θ ) Probability density function or discrete density function So the likelihood function, is composed of a constant, a probability density function and data. C is a constant and thus unimportant when comparing alternate models AS LONG AS WE USE THE SAME DATASET FOR ALL THE MODELS. Parameters in probability model The parameters of the pdf are determined by the data and by the value of the parameters in scientific model!!

9 Likelihood Axiom “Within the framework of a statistical model, a
set of data supports one statistical hypothesis better than other if the likelihood of the first hypothesis, on the data, exceeds the likelihood of the second hypothesis”. (Edwards 1972, p.) We are advocating a philosophy of science-based A PRIORI modeling. Hypothesis testing is a means of TESTING a model.

10 How to derive a likelihood function: Binomial
Event 10 trees die out of a population of 50 Question: What is the mortality rate (p)? Probability Density Function We want to estimate the value of the parameter theta, given that there are 10 out of 50 trees in the population that were windthrown. Likelihood The most likely parameter value is 10/50 = 0.20

11 Likelihood Profile: Binomial
2E-12 4E-12 6E-12 8E-12 1E-11 1.2E-11 1.4E-11 1.6E-11 0.2 0.4 0.6 0.8 1 Value of estimated parameter (p) likelihood The model (parameter p) is defined by the data!! We want to estimate the value of the parameter theta, given that there are 10 out of 50 trees in the population that were windthrown. EMPHASIZE THAT THIS IS AN ITERATIVE OFTEN NUMERICAL PROCESS THAT ATTEMPTS TO DEFINE THE MODEL IN TERMS OF WHAT THE DATA SAYS!

12 An example: Can we predict tree fecundity as a function of tree size?
Data Scientific Model Probability An example: Can we predict tree fecundity as a function of tree size? Inference The Data: xi = measurements of DBH on 50 trees yi = counts of seeds produced by trees The Scientific Model: yi = DBH b + e exponential relationship, with 1 parameter (b) and an error term (e) The Probability Model: Data follow a Poisson distribution, with E[x] and variance = λ Emphasize and note the expected value and variance of the Poisson probability distribution is lamdda.

13 Data Scientific Model (hypothesis) Probability Model Inference Iterative process Pick a value for the parameter in your scientific model, b. Recall scientific model is yi = DBHb For each data point, calculate the expected (predicted) value for that value of b. Calculate the probability of observing what you observed given that parameter value and your probability model. Multiply the probabilities of individual observations. Go back to 1 until you find maximum likelihood estimate for parameter b. Emphasize and note the expected value and variance of the Poisson probability distribution is lamdda.

14 Likelihood Poisson Process
E[x]= λ Often, we want to estimate the likelihood of several We can estimate as the product of the likelihoods. This becomes impractical because probabilities are very small numbers so we tale the log of both sides and we calculate the log likelihood. Then we sum the probabilities of observing a given result. NOTE: This terminology is opposite to what Hillborn and Mangel use. In fact, we often cannot calculate or the particular value of a distribution so we must know how to calculate the parameters of the distribution. Often this leads to parameterization. Choice of model requires being precise about the motivations for building a model and focusing sharply on the question at hand. To calculate likelihood we would try all the different values of the parameters for the model (thetas) and find the one that would explain the observation best. In this case, for each one of the trees we would have to calculate P^x (1-p)^(n-x) and find the most suitable. Xlogp + (n-x)log(1-p)

15 First pass… E[x]= λ Do for n observations…… Model yi = DBH b b= 2
Predicted = Observed = 2 E[x]= λ Poisson random Variable with E[x1]=0.0617 Often, we want to estimate the likelihood of several We can estimate as the product of the likelihoods. This becomes impractical because probabilities are very small numbers so we tale the log of both sides and we calculate the log likelihood. Then we sum the probabilities of observing a given result. NOTE: This terminology is opposite to what Hillborn and Mangel use. In fact, we often cannot calculate or the particular value of a distribution so we must know how to calculate the parameters of the distribution. Often this leads to parameterization. MAKE CHANGES IN ALPHA CHANGES THE SHAPE OF THE PDF To calculate likelihood we would try all the different values of the parameters for the model (thetas) and find the one that would explain the observation best. In this case, for each on the trees we would have to calculate P^x (1-p)^(n-x) and find the most suitable. Xlogp + (n-x)log(1-p) Do for n observations……

16 Pick a new value of beta... Do for n observations…… Model yi = DBH b
Predicted = 0.498 Observed = 2 Poisson random Variable with E[x1]=0.498 Often, we want to estimate the likelihood of several We can estimate as the product of the likelihoods. This becomes impractical because probabilities are very small numbers so we tale the log of both sides and we calculate the log likelihood. Then we sum the probabilities of observing a given result. NOTE: This terminology is opposite to what Hillborn and Mangel use. In fact, we often cannot calculate or the particular value of a distribution so we must know how to calculate the parameters of the distribution. Often this leads to parameterization. MAKE CHANGES IN ALPHA CHANGES THE SHAPE OF THE PDF To calculate likelihood we would try all the different values of the parameters for the model (thetas) and find the one that would explain the observation best. In this case, for each on the trees we would have to calculate P^x (1-p)^(n-x) and find the most suitable. Xlogp + (n-x)log(1-p) Do for n observations……

17 Probability and Likelihood
Multiplying probabilities is not convenient from a computational point of view. We take the log of the probabilities and we maximize that number. This gives us the Maximum Likelihood Estimate of the parameter. If we are using likelihood to compare the strength of evidence for several models (hypotheses or parameter values), we can assume that that k is 1. Because likelihood is a meaningless number that can only be used when comparing one model to another, we can assume that the constant K is 1. Likelihood can take any one value negative or positive. LIKELIHOOD IS THE PROBABILITY OF OBSERVING THE DATA GIVEN SOME ECOLOGICAL MODEL FOR THE PROCESS (INCLUDING SPECIFIC VALUES FOR THE PARAMETERS). SO ESSENTIALLY WE ARE CALCULATING THE PROBABILITY THAT WHAT ACTUALLY HAPPENED DID HAPPEN!!! KINDA WEIRD.

18 Likelihood Profile Model yi = DBH b Beta ML estimate
Often, we want to estimate the likelihood of several observations. We can estimate as the product of the likelihoods. This becomes impractical because probabilities are very small numbers so we tale the log of both sides and we calculate the log likelihood. Then we sum the probabilities of observing a given result. NOTE: This terminology is opposite to what Hillborn and Mangel use. In fact, we often cannot calculate or the particular value of a distribution so we must know how to calculate the parameters of the distribution. Often this leads to parameterization. To calculate likelihood we would try all the different values of the parameters for the model (thetas) and find the one that would explain the observation best. In this case, for each one of the trees we would have to calculate P^x (1-p)^(n-x) and find the most suitable. Xlogp + (n-x)log(1-p) Model yi = DBH b

19 Model comparison The Data: xi = measurements of DBH on 50 trees
Scientific Model (hypothesis) Probability Model Inference Model comparison The Data: xi = measurements of DBH on 50 trees yi = counts of seed produced by trees The Scientific Models: yi = DBH b + e exponential relationship, with 1 parameter (b) OR yi = g DBH + e linear relationship with 1 parameter (g ) The Probability Model: Data follow a Poisson distribution, with E[x] and variance = λ This very same approach can also be used to compare models (as long as we adjust for the number of parameters) or even to compare probability density functions. Eg Negative binomial. For instance, if you are modeling seed counts we know that seeds and seedlings tend to be clumped more than expected by chance. However, this may differ depending on the dispersal model of the particular species--> wind vs. animal dispersed. We can compare pdf ‘s in the same way that we compare scientific models or competing values of parameters within the same model. STATE THAT ONE MUST ADJUST FOR THE NUMBER OF PARAMETERS ESTIMATED.

20 Model comparison The Data: xi = measurements of DBH on 50 trees
Scientific Model (hypothesis) Probability Model Inference Model comparison The Data: xi = measurements of DBH on 50 trees yi = counts of seed produced by trees The Scientific Models: yi = DBH b + e exponential relationship, with 1 parameter (b) The Probability Model: Data follow a Poisson distribution, with E[x] and variance = λ OR Data follow a negative binomial distribution with E[x]=m and clumping parameter k. (Variance is defined by m and k (estimated). This very same approach can also be used to compare models (as long as we adjust for the number of parameters) or even to compare probability density functions. Eg Negative binomial. For instance, if you are modeling seed counts we know that seeds and seedlings tend to be clumped more than expected by chance. However, this may differ depending on the dispersal model of the particular species--> wind vs. animal dispersed. We can compare pdf ‘s in the same way that we compare scientific models or competing values of parameters within the same model. STATE THAT ONE MUST ADJUST FOR THE NUMBER OF PARAMETERS ESTIMATED.

21 Determination of appropriate likelihood function
FIRST PRINCIPLES Proportions Binomial Several categories Multinomial Count events Poisson, Neg. binomial Continuous data, additive processes Normal Quantities from multiplicative probabilities Lognormal, Gamma. EMPIRICAL Examine residuals. Tests different probability distributions for model errors. We are advocating a philosophy of science-based A PRIORI modeling. Hypothesis testing is a means of TESTING a model. Probability models can be thought of as competing hypotheses in exactly the same way that different parameter values (structural models) are competing hypotheses.

22 Likelihood functions: An aside about logarithms
Taking the logarithm in base a of a number is the inverse of raising that number to the power a. Example: log101000= 3 Basic Log Operations When random variable is sum of other continuous random variables, distribution is usually normal. Distinction between likelihood, log likelihood and negative log likelihood. The value that maximizes the likelihood is the one that minimizes the deviation between observed and predicted. But the variance is the variance of predicted, right? The value of the varince depends on the value of m (in other words if m-x is smallest so will the variance. Poisson calculated lhood := lhood + ((observed*ln(predicted)) - predicted - ln(factorial(observed)))

23 Poisson Likelihood Function
Discrete Density Function Likelihood When random variable is sum of other continuous random variables, distribution is usually normal. Distinction between likelihood, log likelihood and negative log likelihood.

24 Negative Binomial Distribution Likelihood Function
Discrete Density Function Likelihood k is an estimated parameter!! When random variable is sum of other continuous random variables, distribution is usually normal. Distinction between likelihood, log likelihood and negative log likelihood.

25 Normal Distribution Likelihood Function
Prob. Density Function Likelihood This in fact is a shortcut. We could be plugging in the value of expected – predicted up there and in the denominator we could just plug in the standard deviation of the variable (x) and add up all the probabilities. Calculating the likelihood of the error provides a shortcut. E[x] = μ Variance = δ2

26 Lognormal Distribution Likelihood Function
Prob. Density Function Likelihood See handout to see parameterization in terms of E(x) and Var (x) the variable that is normally distributed.

27 Gamma Distribution Likelihood Function
Prob. Density Function

28 Exponential Distribution Likelihood Function
Prob. Density Function Likelihood The negative exponential function  amount of time (or space) until some event occurs. Lambda is 1/predicted.

29 Evaluating the strength of evidence for the MLE
Now that you have an MLE, how should you evaluate it?

30 Two purposes of support/confidence intervals
Measure of support for alternate parameter estimates. Help with fitting when something goes wrong.

31 Methods of calculating support intervals
Bootstrapping Likelihood curves and profiles

32 Bootstrapping Resample the data with replacement and record the number of times that the parameter estimate fell within an interval. Frequentist approach: If I sampled my data a large number of times, what would my confidence in the estimate be?

33 General method Draw the likelihood curve (one parameter) or surface (two parameters) or n-dimensional space (n-parameters). Figure out how much the likelihood changes as the parameter of interest moves away from the MLE.

34 Strength of evidence for particular parameter estimates – “Support”
Log-likelihood = “Support” (Edwards 1992) Likelihood provides an objective measure of the strength of evidence for different parameter estimates... Plot likelihood curve for any given parameter. The fixed is a slice the simult is a profile

35 Asymptotic vs. Simultaneous M-Unit Support Limits
Hold all other parameters at their MLE values, and systematically vary the remaining parameter until likelihood declines by a chosen amount (m) What should “m” be? (1.92 is a good number, and is roughly analogous to a 95% CI) Add bit about chi-square and the LRT test. This can also be done in more than one dimension but it is generally wrong because we only vary one parameter at a time. In two dimension we would have to consider a likelihood ridge that provides a sense of how the lieklihood changes as the other parameters change.

36 An aside on the Likelihood Ratio Test
Ratios of log-likelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B. What this means for the likelihood profile where we fix one parameter with respect to all the other parameters LRT test only correct for large number of data and when the value of the parameter is not at the edge of the allowable range.

37 Asymptotic vs. Simultaneous M-Unit Support Limits
Resampling method: draw a very large number of random sets of parameters and calculate log-likelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support. Set the focal parameter to a range of values and for each value optimize the likelihood for all the other parameters: Likelihood Ratio Test: 2 times the difference in log-likelihoods is distributed as a Chisquared statistic with degrees of freedom equal to the difference in the number of parameters between two models. In the case of asymptotic support intervals, use the critical value with 1 degree of freedom, because there is just 1 fitted value (the parameter of interest). The p = 0.05 value for a Chisquared with 1 df = 3.84, so the critical difference in LL with 1 parameter is In practice, it can require an enormous number of iterations to do this if there are more than a few parameters

38 Asymptotic vs. Simultaneous Support Limits
A hypothetical likelihood surface for 2 parameters Simultaneous 2-unit support limits for P1 2-unit drop in support Parameter 2 In general, the asymptotic limits will almost always be LESS than the simultaneous limits thus giving you a false sense of confidence around the parameters… On the other hand, I find the asymptotic limits to be more intuitively informative. Given that we know what the MLE values are for the other parameters, why shouldn’t we use them to express our strength of support for a parameter of interest. If you calculate it just for one parameter then you just change the value until it parameter decreases de likelihood by 1.92 units. Mention the negative correlation of parameters and the Hessian. Asymptotic 2-unit support limits for P1 Parameter 1

39 A powerful, general, but approximate shortcut is to examine the second derivative(s) of the log-likelihood as a function of the parameter(s). The second derivatives provide information about the curvature of the surface, which tells us how rapidly the log-likelihood gets worse, which allows us to estimate the confidence intervals. This procedure involves a second level of approximation (like the LRT, becoming more accurate as the number of data points increases), but it can be useful when you run into numerical dif- ficulties calculating the profile confidence limits, when you want to compute bivariate confidence regions for complex models, or more generally explore correlations in high-dimensional parameter spaces. The second partial derivatives with respect to the same variable twice (e.g. represent the curvature of the likelihood surface along a particular axis; the cross-derivatives, e.g. describe how the slope in one direction changes as you move along another direction. For example, for the log-likelihood L of the normal distribution with parameters µ and , the Hessian is: @2L In the simplest case of a one-parameter model, the Hessian reduces to a single number (i.e. d2L/dp2), the curvature of the likelihood curve at the MLE, and the estimated standard deviation of the parameter is just as above. In simple two-parameter models such as the normal

40

41 Other measures of strength of evidence for different parameter estimates
Edwards (1992; Chapter 5) Various measures of the “shape” of the likelihood surface in the vicinity of the MLE... How pointed is the peak?...

42 Evaluating Support for Parameter Estimates
Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates Also called the “Information Matrix” by Fisher Provides a measure of the steepness of the likelihood surface in the region of the optimum Evaluated at MLE points it is the observed information matrix Can be generated in R using optim Illustrate on the board… the square root of the diagonals of the inverse of the negative of the Hessian = S.E. 95% CI = * S.E.

43 An example from R The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i.e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE. > res$hessian a b sd a b Sd

44 The Hessian CI Now invert the negative of the Hessian matrix to get the matrix of parameter variance and covariance The square roots of the diagonals of the inverted negative Hessian are the standard errors Are we reverting to a frequentist framework? > solve(-1*res$hessian) a b sd a e e e-06 b e e e-07 sd e e e-03 > sqrt(diag(solve(-1*res$hessian))) a b sd This is s little bit like reverting to a frequentist framework? (*and 1.96 * S.E. is a 95% CI)

45 A.W.F. Edwards. 1972. Likelihood. Cambridge University Press.
Some references A.W.F. Edwards Likelihood. Cambridge University Press. Feller, W An introduction to probability theory and its application. Wiley & Sons.


Download ppt "The Triangle of Statistical Inference: Likelihoood"

Similar presentations


Ads by Google