1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity) for general two-way cross classifications of count data. Terms: Contingency Table Cross-Classification Table Measure of association Independence in two-way tables Chi-Square Test for Independence or Homogeneity
2 A university conducted a study concerning faculty teaching evaluation classification by students. A sample of 467 faculty is randomly selected, and each person is classified according to rank (Instructor, Assistant Professor, etc. ) and teaching evaluation (Above, Average, Below). Each person has two categorical responses. Data can be formatted into a cross- tabulation or contingency table. Test of Independence or Association
3 Is the level of teaching evaluation related to rank? Are Professors more likely to be judged above average than other ranks? Two variables that have been categorized in a two-way table are independent if the probability that a measurement is classified into a given cell of the table is equal to the probability of being classified into that row times the probability of being classified into that column. This must be true for all cells of the table. What are we interested in from this two-way classification table? H o : Teaching Evaluation and Rank are independent variables.
4 The independence assumption: Expected Observed r=#rows=3, c=#cols=4, 3 4 table. df = (r-1)(c-1) Test Statistic:
5 Observed Counts
6 Expected Counts Assumptions: no E ij < 1, and no more than 20% of E ij < 5.
7 Reject H o Individual Cell Chi Square Values There is evidence of an association between rank and evaluation. Note that we observed less Assistant Professors getting below average evaluations (13) than we would expect under independence (26.2). Chi Square value is 6.67.
8 Minitab Input data in this way STAT > TABLES > Cross Tabs Classification Variables: rank eval Check Chi-square Analysis, and Above and Std. residual Frequencies are in: count rankevalcount
9 Tabulated Statistics: eval, rank Rows: eval Columns: rank All All Chi-Square = , DF = 6, P-Value = Cell Contents -- Count Exp Freq Std. Resid Square roots of Individual Chi- square values:
10 SAS options ls=79 ps=40 nocenter; data eval; input job $ rating $ number; datalines; Instructor Above 36 Instructor Average 48 Instructor Below 30 Assistant Above 62 Assistant Average 50 Assistant Below 13 Associate Above 45 Associate Average 35 Associate Below 20 Professor Above 50 Professor Average 43 Professor Below 35 ; run; proc freq data=eval; weight number; table job*rating / chisq ; run; Table of job by rating job rating Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚Above ‚Average ‚Below ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Assistan ‚ 62 ‚ 50 ‚ 13 ‚ 125 ‚ ‚ ‚ 2.78 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Associat ‚ 45 ‚ 35 ‚ 20 ‚ 100 ‚ 9.64 ‚ 7.49 ‚ 4.28 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Instruct ‚ 36 ‚ 48 ‚ 30 ‚ 114 ‚ 7.71 ‚ ‚ 6.42 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Professo ‚ 50 ‚ 43 ‚ 35 ‚ 128 ‚ ‚ 9.21 ‚ 7.49 ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total
11 The FREQ Procedure Statistics for Table of job by rating Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Sample Size = 467
12 SPSS First you need to tell SPSS that each observation must be weighted by the cell count. DATA > WEIGHT CASES Then you choose the analysis. ANALYZE > DESCRIPTIVE STATISTICS > CROSS TABS
13
14 > score <- c(36,48,30,62,50,13,45,35,20,50,43,35) > mscore <- matrix(score,3,4) > mscore [,1] [,2] [,3] [,4] [1,] [2,] [3,] > chisq.test(mscore) Pearson's Chi-squared test data: mscore X-squared = , df = 6, p-value = > out <- chisq.test(mscore) > out[1:length(out)] $statistic X-squared $parameter df 6 $p.value [1] R
15 $method [1] "Pearson's Chi-squared test" $data.name [1] "mscore" $observed [,1] [,2] [,3] [,4] [1,] [2,] [3,] $expected [,1] [,2] [,3] [,4] [1,] [2,] [3,] $residuals [,1] [,2] [,3] [,4] [1,] [2,] [3,] Square roots of Individual Chi- square values:
16 Test of Homogeneity Suppose we wish to determine if there is an association between a rare disease and another more common categorical variable (e.g. smoking). We can’t just take a random sample of subjects and hope to get enough cases (subjects with the disease). One solution is to choose a fixed number of cases, and a fixed number of controls, and classify each according to whether they are smokers or not. The same chi square test of independence applies here, but since we are sampling within subpopulations (have fixed margin totals), this is now called a chi square test of homogeneity (of distributions).
17 Homogeneity Null Hypothesis In general, if the column categories represent c distinct subpopulations, random samples of size n 1, n 2, …, n c are selected from each and classified into the r values of a categorical variable represented by the rows of the contingency table. The hypothesis of interest here is if there a difference in the distribution of subpopulation units among the r levels of the categorical variable, i.e. are the subpopulations homogenous or not. Subpop 1 = Subpop 2= … =Subpop c 11 1c 21 2c :: r1 r2... rc ij = proportion of subpop j subjects (j=1,…,c) that fall in category i (i=1,…,r).
18 Null hypothesis of homogeneity
19 Example: Myocardial Infarction (MI) Data was collected to determine if there is an association between myocardial infarction and smoking in women. 262 women suffering from MI were classified according to whether they had ever smoked or not. Two controls (patients with other acute disorders) were matched to every case. Is the incidence of smoking the same for MI and non-MI sufferers? H o : the incidence of MI is homogenous with respect to smoking H o : 11 = 12 and 21 = 22
20 Example: MI results in MTB Stat -> Tables -> Chi-Square Test Chi-Square Test: MI Yes, MI No Expected counts are printed below observed counts MI Yes MI No Total Total Chi-Sq = = DF = 1, P-Value = Conclude: there is evidence of lack of homogeneity of incidence of MI with respect to smoking.
21 Odds and Odds Ratios Sometimes probabilities are expressed as odds, e.g. Gambling circles. (Why?) Biomedical studies. (Easy interpretation in logistic regression, etc.) Odds of Event A = P(A) (1-P(A)) P(A) = Odds of A / (1 + Odds of A) Ex: A horse has odds of 3 to 2 of winning. This means that in every 3+2=5 races the horse wins 3 and loses 2. So P(Wins) = 3/5. To use the above formula express the odds as d to 1, so 1.5 to 1 in this case. Thus P(Wins) = 1.5 / (1+1.5) = 1.5 / 2.5 = 3/5.
22 Example: MI and Odds Ratios For women sufferers of MI, the proportion who ever smoked is 172/262 = In other words, the odds that a woman MI sufferer is a smoker are 0.656/( ) = 1.9. For women non-sufferers of MI, the proportion who ever smoked is 173/519 = In other words, the odds that a woman non-MI sufferer is a smoker are 0.333/( ) = 0.5. We can now calculate the odds ratio of being a smoker among MI sufferers: OR = 1.9/0.5 = 3.82 Among MI suffers, the odds of being a smoker are about 4 times the odds of not being a smoker. Put another way: a randomly selected MI sufferer is about twice as likely (.656/.333) of being a smoker than of not being one.