Presentation is loading. Please wait.

Presentation is loading. Please wait.

Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.

Similar presentations


Presentation on theme: "Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions."— Presentation transcript:

1 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions The null hypothesis Tests of independence Subdividing tables Multiway tables and log-linear models Power analysis in goodness of fit and contingency tables Appropriate questions The null hypothesis Tests of independence Subdividing tables Multiway tables and log-linear models Power analysis in goodness of fit and contingency tables

2 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.2 Contingency analysis: types of questions Involves two (or more) categorical variables, each with 2 or more categories. Considers the number of observations (observed frequencies) in each category of the variables. Test is for lack of independence. Involves two (or more) categorical variables, each with 2 or more categories. Considers the number of observations (observed frequencies) in each category of the variables. Test is for lack of independence. Results of tests on the efficacy of two sprays (1, 2) in reducing apple blight infection in orchards

3 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.3 Contingency analysis: types of questions Does the species composition of bird communities differ among habitats? 2 categorical variables: species, habitat type H 0 : the proportion of individuals of each species is independent of (i.e. more or less the same in each) habitat. Does the species composition of bird communities differ among habitats? 2 categorical variables: species, habitat type H 0 : the proportion of individuals of each species is independent of (i.e. more or less the same in each) habitat.

4 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.4 Components of the test Null hypothesis Observations (observed frequencies) Statistic (Chi-square or G) Assumptions Null hypothesis Observations (observed frequencies) Statistic (Chi-square or G) Assumptions

5 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.5 Null hypothesis In contingency analysis, the null hypothesis is that the distribution of observed frequencies among categories of one variable (e.g. A) is independent of the category of the other variables (B, C,...), i.e. that there is no interaction. The null hypothesis is always intrinsic! In contingency analysis, the null hypothesis is that the distribution of observed frequencies among categories of one variable (e.g. A) is independent of the category of the other variables (B, C,...), i.e. that there is no interaction. The null hypothesis is always intrinsic!

6 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.6 Testing H 0 : goodness-of-fit In contingency analysis (as in all statistical procedures) we fit a model to the data. H 0 specifies particular values for particular terms (coefficients) in the model… …and is evaluated by assessing how well the fitted model, with parameter values as specified by H 0, fits the data, i.e. by evaluating goodness- of-fit. In contingency analysis (as in all statistical procedures) we fit a model to the data. H 0 specifies particular values for particular terms (coefficients) in the model… …and is evaluated by assessing how well the fitted model, with parameter values as specified by H 0, fits the data, i.e. by evaluating goodness- of-fit.

7 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.7 Reminder: goodness of fit. Measures the extent to which some empirical distribution “fits” the distribution expected under the null hypothesis. Observed Expected 2030405060 Fork length 0 10 20 30 Frequency

8 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.8 Testing goodness of fit : the Chi- square statistic (    Used for frequency data, i.e. the number of observations/results in each of n categories compared to the number expected under the null hypothesis. Frequency Category/class Observed Expected

9 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.9 Two way tables: H 0 accepted H 0 : proportion of infected versus non- infected trees is the same for both sprays. In this case, we accept H 0. H 0 : proportion of infected versus non- infected trees is the same for both sprays. In this case, we accept H 0. Proportion infected Spray 2 Spray 1

10 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.10 Two way tables: H 0 rejected H 0 : proportion of infected versus non- infected trees is the same for both sprays. In this case, we reject H 0. H 0 : proportion of infected versus non- infected trees is the same for both sprays. In this case, we reject H 0. Proportion infected Spray 2 Spray 1

11 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.11 Two-way tables: the general model- fitting procedure Fit 2 models: one in which the interaction is included, the other with it removed. Evaluate GOF for each model. Evaluate the reduction  in GOF associated with dropping the interaction, i.e. under H 0 that the interaction is zero. Fit 2 models: one in which the interaction is included, the other with it removed. Evaluate GOF for each model. Evaluate the reduction  in GOF associated with dropping the interaction, i.e. under H 0 that the interaction is zero. Model 1 (interaction in) Model 2 (interaction out)  GOF (e.g.   2 ) Accept H 0 (  small) Reject H 0 (  large)

12 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.12 Two-way tables : H 0 and model fit For two way tables, the general model includes a constant, two main effects, and an interaction. Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. For two way tables, the general model includes a constant, two main effects, and an interaction. Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. Interaction out Interaction in Accept H 0 Goodness of fit (e.g. G) Reject H 0  GOF

13 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.13 Two-way tables : what does the general model mean anyway? The model attempts to predict the observed frequencies in each category. So, if all frequencies are equal, then the appropriate model is: The model attempts to predict the observed frequencies in each category. So, if all frequencies are equal, then the appropriate model is: N = 80,  = 80/4 = 20

14 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.14 Two-way tables : what does the general model mean anyway? If N varies between the two sprays, then there will be a “main effect” due to spray. So the appropriate model includes a main effect due to spray (row, i). If N varies between the two sprays, then there will be a “main effect” due to spray. So the appropriate model includes a main effect due to spray (row, i). N = 80,  = 80/4 = 20 f 1_ /2 = 30 = 1.5   f 2_ /2 = 10 = 0.5   1 = 1.5,  2 = 0.5

15 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.15 Two-way tables : what does the general model mean anyway? If the total number of trees infected is different than the number not infected, then there will be a “main effect” due to “infection level”. So the appropriate model includes a main effect due to both spray type and infection level. If the total number of trees infected is different than the number not infected, then there will be a “main effect” due to “infection level”. So the appropriate model includes a main effect due to both spray type and infection level. N = 80,  = 80/4 = 20 f 1_ /2 = 30 = 1.5   f 2_ /2 = 10 = 0.5   1 = 1.5,  2 = 0.5 f _1 /2 = 16 = 0.8   f _2 /2 = 24 = 1.2   1 = 0.8,  2 = 1.2

16 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.16 Two-way tables : what does the general model mean anyway? Since the expected frequency in cell (i,j) under H 0 is:... we can calculate the interaction by: Since the expected frequency in cell (i,j) under H 0 is:... we can calculate the interaction by: N = 80,  = 80/4 = 20

17 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.17 Tests of independence: the Chi- square statistic (    Calculate expected frequency for each cell in the table. Calculate squared difference between observed and expected frequencies and sum over all cells. Calculate expected frequency for each cell in the table. Calculate squared difference between observed and expected frequencies and sum over all cells. Observed Expected Frequency Cell

18 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.18 Testing independence: the log likelihood-ratio Chi-square statistic (G) Similar to  2, and usually gives very similar results. In some cases, G is more conservative (i.e. will give higher p values). Similar to  2, and usually gives very similar results. In some cases, G is more conservative (i.e. will give higher p values). Observed Expected Frequency Cell

19 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.19 An example: sex-ratios of eider ducks in different habitats in Hudson’s Bay Cell counts are observed numbers (raw frequencies) of males and females in different habitats.

20 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.20 Computing expected frequencies Use intrinsic null hypothesis and compute the probability of an observation falling into a cell in the table under this hypothesis. Partition the total number of observations according to these probabilities. Use intrinsic null hypothesis and compute the probability of an observation falling into a cell in the table under this hypothesis. Partition the total number of observations according to these probabilities. p(A) = 64/160 =.40; p(male) = 97/160 =. 6105 p(A, male) under H 0 = p(A)p(male) =.2425 f(A, male) = p(A, male) X 160 = 38.8

21 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.21 Assumptions (  2 and G) n is larger than 30. Expected frequencies are all larger than 5. Test is quite robust except when there are only 2 categories (df = 1). For 2 categories, both X 2 and G overestimate  2, leading to rejection of null hypothesis with probability greater than  i.e. the test is liberal. n is larger than 30. Expected frequencies are all larger than 5. Test is quite robust except when there are only 2 categories (df = 1). For 2 categories, both X 2 and G overestimate  2, leading to rejection of null hypothesis with probability greater than  i.e. the test is liberal.

22 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.22 What if n is too small, there are only 2 categories, etc.? Increase n. If n > 2, combine categories. Use a correction factor. Use another test. Increase n. If n > 2, combine categories. Use a correction factor. Use another test.

23 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.23 An example: Combining categories With three habitat categories, expected frequencies are too small in 2 cells. Therefore, combine habitats B and C. With three habitat categories, expected frequencies are too small in 2 cells. Therefore, combine habitats B and C.

24 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.24 Corrections for 2 categories For 2 categories, both X 2 and G overestimate  2, leading to rejection of null hypothesis with probability greater than  i.e. test is liberal  Continuity correction: add 0.5 to observed frequencies. Williams’ correction: divide test statistic (G or  2 ) by: q = 1 + (k 2 - 1)/(6n(k-1)) For 2 categories, both X 2 and G overestimate  2, leading to rejection of null hypothesis with probability greater than  i.e. test is liberal  Continuity correction: add 0.5 to observed frequencies. Williams’ correction: divide test statistic (G or  2 ) by: q = 1 + (k 2 - 1)/(6n(k-1))

25 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.25 Subdividing tables When null hypothesis is rejected, you may wish to determine which categories are contributing substantially to the overall significant test statistic. General procedure: find set of largest homogeneous subtables. Start with smallest homogeneous table, then add rows or columns until the null hypothesis is rejected. When null hypothesis is rejected, you may wish to determine which categories are contributing substantially to the overall significant test statistic. General procedure: find set of largest homogeneous subtables. Start with smallest homogeneous table, then add rows or columns until the null hypothesis is rejected.

26 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.26 Subdividing tables Significant interaction

27 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.27 Subdividing tables Significant interaction

28 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.28 Subdividing tables No significant interaction

29 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.29 Subdividing tables Conclusion: B and C are homogeneous, with both differing significantly from A.

30 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.30 ConclusionConclusion Contingency tables are one of the most common methods of analyzing biological data. They provide robust tests (chi-square or G) of independence for categorical data......if sample sizes are adequate and expected frequencies are not too small. Contingency tables are one of the most common methods of analyzing biological data. They provide robust tests (chi-square or G) of independence for categorical data......if sample sizes are adequate and expected frequencies are not too small.

31 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.31 Multiway tables and log-linear models Notion of interaction extended to consideration of the effects of several different variables (factors) simultaneously… … exactly as in multiple-classification ANOVA. Notion of interaction extended to consideration of the effects of several different variables (factors) simultaneously… … exactly as in multiple-classification ANOVA.

32 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.32 Two-way tables : H 0 and model fit For two way tables, the general model includes a constant, two main effects, and an interaction. Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. For two way tables, the general model includes a constant, two main effects, and an interaction. Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. Interaction out Interaction in Accept H 0 Goodness of fit (e.g. G) Reject H 0  GOF

33 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.33 Multiway-way tables and log-linear models For 3- way tables, the general model includes a constant, 3 main effects, 3 2-way interactions, and 1 3-way interaction. Thus, independence implies that the goodness- of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. For 3- way tables, the general model includes a constant, 3 main effects, 3 2-way interactions, and 1 3-way interaction. Thus, independence implies that the goodness- of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included. Interaction out Interaction in Accept H 0 Goodness of fit (e.g. G) Reject H 0  GOF

34 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.34 Multi-way tables and log- linear models Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L) No 3-way interaction, as interaction between yield and temperature does not depend on humidity. Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L) No 3-way interaction, as interaction between yield and temperature does not depend on humidity. Frequency Yield class Humidity Temperature H L H L Low yield High yield

35 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.35 Multi-way tables and log-linear models Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L) 3-way interaction, since effect of temperature on yield depends on humidity. Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L) 3-way interaction, since effect of temperature on yield depends on humidity. Frequency Humidity Temperature H L H L Low yield High yield

36 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.36 The procedure Test highest order interaction by comparing goodness of fit of full model and model with interaction removed. If non-significant, test next-lowest interactions individually (i.e. with the others included). Where interactions are significant, do separate tests within each category of the factor(s) involved. Test highest order interaction by comparing goodness of fit of full model and model with interaction removed. If non-significant, test next-lowest interactions individually (i.e. with the others included). Where interactions are significant, do separate tests within each category of the factor(s) involved.

37 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.37 An example: sex-ratio of sturgeon in the lower Saskatchewan River What is the “best” model that can be fitted to these data? Does sex-ratio depend on location? On year? On location*year? What is the “best” model that can be fitted to these data? Does sex-ratio depend on location? On year? On location*year?

38 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.38 Questions/null hypotheses Does the sex ratio vary among years? H 0 : (  ) ij = 0 Does the sex ratio vary between locations? H 0 : (  ) ik = 0 Does the sex ratio vary among (year, location) combinations? H 0 : (  ) ijk = 0 Does the sex ratio vary among years? H 0 : (  ) ij = 0 Does the sex ratio vary between locations? H 0 : (  ) ik = 0 Does the sex ratio vary among (year, location) combinations? H 0 : (  ) ijk = 0

39 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.39 Fitting log-linear models with SYSTAT Test 3-way interaction by specifying model with 6 terms. Conclusion: accept H 0 Test 3-way interaction by specifying model with 6 terms. Conclusion: accept H 0

40 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.40 Fitting log-linear models (cont’d)

41 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.41 Residuals (in contingency tables and log-linear models) The difference between observed and expected cell frequencies. There is one residual for each cell in the table. If the fitted model is “good”, all residuals should be relatively small and there should be no obvious pattern in the table. The difference between observed and expected cell frequencies. There is one residual for each cell in the table. If the fitted model is “good”, all residuals should be relatively small and there should be no obvious pattern in the table.

42 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.42 Power and sample size in goodness of fit An external null hypothesis is specified, which specifies a set of expected frequencies, or, alternatively, a set of expected proportions: The effect size is given by: Observed Expected Frequency Cell

43 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.43 Calculating power given w Given   and w and N, we can read 1-  from suitable tables or curves (e.g. Cohen (1988), Tables 7-3). 1-  Decreasing N   =.05.1.2.3.4  =.01 w

44 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.44 Power in goodness of fit: an example Biological hypothesis: plumage colour in snow geese controlled by a single autosomal locus with 2 alleles, aa = white, Aa, AA = blue. So Aa X Aa cross should yield segregation ratios: 1 (AA): 2(Aa): 1(aa). Biological hypothesis: plumage colour in snow geese controlled by a single autosomal locus with 2 alleles, aa = white, Aa, AA = blue. So Aa X Aa cross should yield segregation ratios: 1 (AA): 2(Aa): 1(aa).

45 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.45 Power in goodness of fit: an example (cont’d) H 0 accepted, and effect size given by: From table, So, > 84% chance of Type II error, i.e. probability of detecting a true effect size of.076 is very small. H 0 accepted, and effect size given by: From table, So, > 84% chance of Type II error, i.e. probability of detecting a true effect size of.076 is very small.

46 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.46 Power and sample size in contingency tables Calculate expected cell proportions p 0,ij under H 0 of independence given by marginal proportions: The effect size is given by: Df = (R-1)(C-1)=

47 Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.47 Power and sample size in contingency tables: an example Age structure of two different field mice populations So, about 75% chance of Type II error. Cell proportions, N = 120


Download ppt "Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions."

Similar presentations


Ads by Google