Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFERENCE Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.

Similar presentations


Presentation on theme: "INFERENCE Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1."— Presentation transcript:

1 INFERENCE Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1

2 Objects of Inference  The different objects of Inference in LME models are: 1) Fixed effects structure, 2) Random effects structure, and 3) Random error structure  Of the three, the fixed effects structure is the focus of most applied research, though the other two structures are important to include when analyzing longitudinal data 2

3 Objects of Inference  The random effects and random error are random variables and typically not estimated  It is the variances-covariances of the random effects and the variance of the random error that are typically estimated. We use the term variance component to refer to any of these 3

4 Objects of Inference  The variance components of the random effects contain potentially useful information regarding individual differences  However, questions regarding individual differences are usually secondary to questions of aggregate (population) effects as represented by the fixed effects 4

5 Objects of Inference  In other words, the fixed effects are the focus and the other two structures (random effects and error) are primarily included to account for the dependency due to the repeated measures 5

6 Objects of Inference  Regression models for repeated measures can generally be estimated only by iterative procedures  The most satisfactory way to obtain estimates for this kind of problem is with a general method such as maximum likelihood 6

7 General Elements of ML Estimation 1) Assume the distribution of variables has a particular form, for example, where is an ( n i x n i ) covariance matrix 2) Assume the data are a random sample from this distribution 3) Define the likelihood function in light of the data 7

8 General Elements of ML Estimation 4) Given the data, obtain values of the parameters that are most likely from among all possibilities 5) These particular parameters are called maximum likelihood estimates of the parameters under the distribution assumption 8

9 The Likelihood Principle  Main Idea:  Given that we have observed the data, y, we want to learn about the model, specifically, the model’s parameters  In other words, we want to know the distribution of the unknown parameter(s) conditional on the observed data, i.e. p(Model|Data). This is known as the “ inverse probability problem ” 9

10 The Likelihood Principle  To solve this inverse probability problem, we define the likelihood function by reversing the roles of the data vector y and the parameter vector θ in f ( y | θ ), i.e., L ( θ | y )  The best estimator,, is whatever value of that maximizes: L ( θ | y ) = f ( y | θ )  Thus, we are looking for the that maximizes the likelihood of observing the sample data 10

11 The Likelihood Principle  IF the y i are all independent (or conditionally independent given X i ), then the likelihood of the whole sample is the product of the individual likelihoods over all the observations 11

12 The Likelihood Principle  Instead of writing down the likelihood function, we often write down the log- likelihood function. That is, NOTE: Both likelihood and log-likelihood are maximized at the same value. This is because the log is a monotonic function. It tends to be much simpler to work with the log-likelihood since we get to sum things up. 12

13 Basic Idea: Maximum Likelihood (ML)  The purpose of ML is to find the estimate of the parameter(s) that maximizes the probability of observing the data that we have  Suppose our dependent variable follows a normal distribution:  Thus, we have: 13

14 Basic Idea: ML  In general, we will have some observations on Y and we want to estimate and  Suppose that we have the following five observations on Y :  Intuitively, we might wonder about the odds of getting these five data points if they were from a normal distribution with 14

15 Basic Idea: ML  The answer here is that it is not very likely -- all of the data points are a long way from 100  But what are the odds of getting the five data points from a normal distribution with Now this seems much more reasonable  ML estimation is just a systematic way of searching for the parameter values of our chosen distribution that maximize the probability of observing the data that we have in hand 15

16 Detailed Explanation of ML  The following ten scores were sampled from a normal distribution:  It is known that the variance of the normal distribution is The mean of the distribution is unknown  What value of μ is most reasonable in light of the fact that 1) the population is y ~ N ( μ,1) 2) these particular ten observations have been obtained? 16

17 Detailed Explanation of ML 17  The ten data values are graphed as a dot plot. Above the data are three representative normal distributions which have σ ²=1. They differ in terms of μ

18 Detailed Explanation of ML  Things to Comprehend in this Example:  Picking different values of μ essentially slides the curve left or right over the data  Although we are interested in μ, the parameter is connected to the distribution  Deciding that μ equals a particular value means that we bring along the entire distribution. In other words, it is the mean of a normal distribution that is sought 18

19 Detailed Explanation of ML  Thus in ML estimation, parameter estimation is tied to a particular distribution  The data are the reference point, the anchor. Graphically and statistically, the data are fixed  The candidate normal distributions are shifted to match the data. The objective is to find the distribution that best matches the data 19

20 Detailed Explanation of ML  Review Possible Distributions:  The distribution with μ = 15 does not seem very promising as a candidate for this sample data  The distribution with μ = 7 is more attractive (at least the curve covers some of the scores)  It seems visually that N (10,1) is centered nearest the scores. A few data points fall to the left of the distribution's mode and a few to the right 20

21 Detailed Explanation of ML  In sum, we only checked three normal distributions from among many. It turns out, considering normal distributions with all possible choices of the mean, that μ = 10 is best by most criteria of similarity of curve to sample data 21

22 ML Estimation  The regression parameters in β characterize the mean vector of dimension p. Denote the parameters in Σ i as ω  Data vectors hold the responses y i, i =1,…, m such that 22

23 ML Estimation  Although it is assumed in repeated measures data that observations within a unit are correlated, however, subjects’ response should be unrelated to one another. That is, y i, i = 1,…, m are independent  We may represent the probability of observing data as a function of the values of the parameters β and ω 23

24 ML Estimation  In particular, if we believe then then the probability that this data vector takes on the particular value y i is represented by the joint density function for the multivariate normal distribution 24

25 ML Estimation  The individuals, y i, are independent, and therefore the joint density function of y is the product of the m individual joint densities  Let f ( y ) be the joint density for the entire data y we can then write  Subsequently, the likelihood function can be written as: L ( θ | y ) = f ( y | θ ) 25

26 ML Estimation  The method of ML thus boils down to maximizing L ( θ | y ) for the unknown parameters θ = ( β ′, ω ′)′  Carrying out the maximization is typically done on the log-likelihood function 26

27 ML Estimation  The maximizing values will be functions of y. These functions applied to random vector y yield what are known as maximum likelihood (ML) estimators  There are several ways to obtain MLEs.  The most common way tackle the estimation of Eq. on the previous slide as is using numerical methods such as Newton-Raphson or the EM algorithm 27

28 ML Estimation  In their seminal work, Laird and Ware (1982) obtained a ML estimator for β conditional on ω  When ω is not known but an estimate is available, we can set and estimate β by using the above expression in which is replaced by 28

29 ML Estimation  Regression coefficients,, are MLEs  Two frequently used methods for estimating ω are MLE or restricted MLE 29

30 Restricted MLE (REML)  A widely acknowledged problem with MLE has to do with the estimation of the parameters ω, the variance component estimators of ML  The estimates of β are approximately unbiased ; however, the estimators for ω have been observed to be biased when m is not too large 30

31 REML  These parameters are usually underestimates of the true population values  Recall from linear regression (independent errors), there are several possible estimates of σ ^2 31

32 REML  We use the MSE instead of the ML estimator in linear regression because the MSE is unbiased  An adjustment form of maximum likelihood along the same line for our model involves replacing the usual likelihood 32

33 REML  Fixed effects regression coefficients can be subsequently obtained by plugging these estimates into the generalized least squares estimator  Thus, ML and REML can yield different values for fixed effects and variance components 33

34 REML  The estimation process follows an iterative algorithm outlined below: 1) Start with initial values for ω and β 2) Obtain ML or REML estimates,, through maximizing L ml ( θ ) or L reml ( θ ), respectively 3) Using newly obtained ML or REML estimates of compute using the form specified as or, respectively 4) Iterate between steps 2 and 3 until some convergence criteria has been obtained 34

35 REML  Note:  It is common practice to use the REML function in place of the usual likelihood to form likelihood ratio tests and the AIC and BIC criteria (see upcoming slides) for comparing nested models for the covariance, where the mean model is same 35

36 REML  However, it is not recommended to use the REML function to form likelihood ratio tests and the AIC and BIC criteria to compare nested models for the mean (fixed effects), as it is not clear that the “restricted likelihood ratio” test statistic have distribution when m is large  Thus, it has been recommended to carry out tests involving the components of β using ML to fit the model 36

37 Sampling distributions  In our model ω is unknown, so the matrices is replaced by in the form of, like in  Thus, it is no longer possible to calculate the mean, covariance matrix, or anything else for exactly, e.g., 37

38 Sampling distributions  Note that depends on, which in turn depends on y i. Generally speaking, it is not possible to do this calculation analytically (closed form)  Similarly, it is no longer necessarily the case that has exactly a p -variate normal sampling distribution 38

39 Sampling distributions  Hence, we try to approximate these needed quantities under some simplifying conditions  The usual simplifying conditions involve letting the sample size (i.e., number of units, m in our case) get arbitrarily large. That is, the behavior of is evaluated under the mathematical condition that 39

40 Sampling distributions  Under this condition, mathematically, it is possible to evaluate the sampling distribution of and show that is unbiased  Such results are not exact. Rather, they are approximations. In other words, we find out what happens in the “ideal” situation where the sample size grows infinitely large. We then hope that this will be approximately true if the sample size m is finite 40

41 Sampling distributions  Often, if m is moderately large, the approximation is very good; however, how “large” is “large” is difficult to determine  It may be shown that, approximately, for m “large,” 41

42 Sampling distributions  The covariance matrix of the sampling distribution of is approximated by where denotes the matrices with the ML or REML estimated value for ω plugged in 42

43 Sampling distributions  Standard errors for the components of for constructing interval estimates or carrying out hypothesis tests can be computed as follows for the j th component  It is important to recognize that these standard errors and other inferences based on this approximation are in turn approximation! 43

44 Sampling distributions  Standard errors in likelihood inference are measures of the uncertainty associated with the MLEs, and are directly tied to the roundedness of the likelihood function  When the likelihood function is peaked, the standard error is small, meaning that there is high precision in the data relating to the estimate 44

45 Sampling distributions  When the likelihood function is rounded, the standard error is large. The data do not contain much specific information about which value of the parameter is optimal  Another way to interpret the standard errors is from the perspective of replication. If the study were replicated with a sample of the same size, how likely is it that an estimate similar to the current MLE would be obtained? 45

46 Sampling distributions  Standard errors from MLE are not highly accurate in small samples  If m is small, however, the standard errors are still useful as rough measures of variability, but again they are just approximations  Therefore, when a test statistic gives marginal evidence of a difference for a particular significance level one should not get too excited; and perhaps it is best to state that the evidence is inconclusive 46

47 Advantages of MLE  MLE has many optimal properties in estimation (Myung, 2003): 1) Sufficiency: complete information about the parameter of interest contained in its MLE estimator; 2) Consistency : true parameter value that generated the data recovered asymptotically, i.e. for data of sufficiently large samples 47

48 Advantages of MLE 3) Efficiency : lowest-possible variance of parameter estimates achieved asymptotically (i.e., the lowest possible mean squared error achieved asymptotically) 4) Invariance : same MLE solution obtained independent of the parametrization used ( i.e., L ( θ | y ) = L *( f (( θ | y ))] 48

49 MLE vs. LSE  ML estimation differs from least squares estimation (LSE) in terms of the criterion for estimating parameters. That is, 1) The least squares method minimizes the sum of squared errors ; where as the maximum likelihood method maximizes the probability of a model fitting the data (i.e., selects the set of values of the model parameters that maximizes the likelihood function) 49

50 MLE vs. LSE 2) In using ML, one must always make some assumption about the distribution of the data. Sometimes these distributional assumptions are made in least squares (e.g., MANOVA), but at other times are not necessary (e.g., estimating regression parameters) 50

51 MLE vs. LSE 3) One of the optimal property of MLE is parameterization invariance. In contrast, LSE doesn’t have this property. 51

52 Model Comparison  Often, the process of theory development includes the examination of many competing hypotheses  The evaluation of competing hypotheses employs the use of “model comparison” methods  These methods can be either “Null Hypothesis Significance Testing (NHST)” based or “information criteria” based 52

53 NHST Method- Likelihood Ratio Test  Likelihood ratio test  This a classic test for the comparison of nested models with different mean structures (with covariance structures fixed) or different, nested covariance structures (with mean structure fixed)  Recall that β is estimated by maximizing the log-likelihood 53

54 NHST Method- Likelihood Ratio Test  Once estimated, we can compute the value of the loglikelihood for a second model, with changes in the mean model by dropping some terms (removing columns of X )  Note : Provided that the covariance structure is not changed we can carry out inference through a test as: 54

55 NHST Method- Likelihood Ratio Test In the above equation T * LRT ~ under H 0 and s is the degrees of freedom (number of parameters in full model’s mean structure – number of parameters in the reduced model’s mean structure )  Through large sample theory, arguments can be made to show that for m → ∞, T * LRT ~ 55

56 NHST Method- Likelihood Ratio Test  If the likelihood ratio difference is equal to or greater than s we reject H 0 in favor of H 1  Suppose, for example, we are interested in testing whether the slopes for the treatment (e.g., lecithin) and placebo groups are the same, i.e., versus 56

57 NHST Method- Likelihood Ratio Test  It is clear that the hypotheses allow the intercepts of the two groups to be free parameters, focusing only on the slopes  If we think of the alternative hypothesis H 1 as specifying the full model, i.e., no restrictions on any of the values of the intercepts or slopes, then the null hypothesis H 0 (on previous slide) represents a reduced model in the sense that it requires the slope parameters from the two groups to be the same 57

58 NHST Method- Likelihood Ratio Test  The likelihood ratio test can be used only for nested models  The reduced model is just a special case of the full model  Operationally, the restriction that the two slope parameters are the same means that the regression parameters β r for the reduced model contains one less element than does the full model – that is, the matrices X i must be adjusted accordingly 58

59 NHST Method- Likelihood Ratio Test 59

60 NHST Method- Likelihood Ratio Test 60  The same type of likelihood ratio test can be implemented for testing two models with nested covariance structures  However, a key difference between testing the covariance structure compared with the mean is that we will use REML functions to construct the test statistic instead of ML functions  Remember, the models must be nested and the mean structure in both reduced and full models must remain the same

61 Multimodel Inference  One disadvantage of the likelihood ratio tests is that the models to be compared have to be nested  Often times in research settings many competing hypotheses are considered, which in turn aid in the development of new knowledge  Most likely, the competing reasonable hypotheses are not nested 61

62 Multimodel Inference  Thus, t he consideration of multiple working hypotheses requires “multimodel” approach  Multimodel approach involves the translation of the working hypotheses into mathematical models that are directly evaluated based on sample data  In this context, the mathematical models are LME models and the direct evaluation is via information criteria 62

63 Multimodel Inference  The goal is to formulate multiple LME models that reflect research questions of interest and then evaluate the models  Once the set of LME models is formed, the models are evaluated based on sample data  The goal of the evaluation is two-fold: 1. rank order the models according to their fit to the data, 2. determine their relative effect sizes 63

64 Multimodel Inference  Based on these results, the researcher can make judgments regarding the relative support (fit) of the models  Inferences can be based off the single best fitting model, but it is recommended to present at least limited information - such as global model fit - of ALL the models 64

65 Multimodel Inference  The detailed results of a single best fitting model need to be reported  This information includes parameter estimates, standard errors (SEs), t-ratios of estimates, etc. 65

66 Multimodel Inference - Methods  These methods are based on a penalized version of the log-likelihoods obtained under the full and reduced models, where the penalty adjusts each loglikelihood for over- parameterization of the model  It is a fact that, the more parameters we add to the model, the larger the log- likelihood becomes 66

67 Multimodel Inference - Methods  The penalized log-likelihood is a one- number summary that can be compared with several alternative models  Depending on how these penalized versions are defined (see the next slides) one prefers the model that gives either the smaller or larger value 67

68 Multimodel Inference - Methods  Akaike’s information criterion (AIC):  This index gives information about whether a more complicated model fits better than a simpler model over and above their difference in complexity  Alternatively, it gives information about whether a model that is efficient, one that fits well but only needs a few parameters to do so 68

69 Multimodel Inference - Methods where for model k, q k is the number of parameters  In the expression, one would prefer the model with the smaller AIC value 69

70 Multimodel Inference - Methods  Schwarz’s Bayesian information criterion (BIC):  This criterion penalizes models which are overparameterized and adjusts for the number of observations  If N is the total number of observations and w is the number of model parameters, then 70

71 Multimodel Inference - Methods  BIC values which are smaller are preferred 71


Download ppt "INFERENCE Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1."

Similar presentations


Ads by Google