Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17.

Similar presentations


Presentation on theme: "Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17."— Presentation transcript:

1 Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17

2 Copyright (c) Bani K. Mallick2 Topics in Lecture #17 Chi-squared tests for independence

3 Copyright (c) Bani K. Mallick3 Book Sections Covered in Lecture #17 Chapter 10.6

4 Copyright (c) Bani K. Mallick4 Lecture 17 Review: Comparison of Two Population Proportions In some cases, we may want to compare two populations  1 and  2 The null hypothesis is H 0 :  1 =  2 This is the same as H 0 :  1 -  2 = 0

5 Copyright (c) Bani K. Mallick5 Lecture 17 Review: Comparison of Two Population Proportions The null hypothesis is H 0 :  1 -  2 = 0 Form a CI for the difference in population proportions  1 -  2 The estimate of this difference is simply the difference in the sample fractions:

6 Copyright (c) Bani K. Mallick6 Lecture 17 Review: Comparison of Two Population Proportions The estimated standard error of the difference in the sample fractions: The (1  100% CI then is

7 Copyright (c) Bani K. Mallick7 Lecture 17 Review: Comparison of Two Population Proportions: Remarkably, but perhaps not surprisingly, you do not have to compute these confidence intervals by hand! The idea: simply pretend, and I do mean pretend, that the binary outcomes are real numbers and run your ordinary t-test CI, unequal variance line

8 Copyright (c) Bani K. Mallick8 Chisquared Tests for Independence and Homogeneity In the previous lecture, we asked whether two populations has the same fraction (proportion) Thus, the populations were (1) very good beers (2) good or fair beers We looked at the proportion of beer that were widely available in the U.S.

9 Copyright (c) Bani K. Mallick9 Chisquared Tests for Independence and Homogeneity Thus, the populations were (1) very good beers (2) good or fair beers We looked at the proportion of beers that were widely available in the U.S. If the proportions of the population that are widely available are the same, then the proportion of beer that is widely available is independent of whether the beer of very good or just good/fair.

10 Copyright (c) Bani K. Mallick10 Chisquared Tests for Independence and Homogeneity If the proportions of the population that are widely available are the same, then the proportion of beer that is widely available is independent of whether the beer of very good or just good/fair. Thus, we can think about testing whether the outcomes (widely available or not) are independent of the populations (very good or fair/good)

11 Copyright (c) Bani K. Mallick11 Chisquared Tests for Independence and Homogeneity We can test whether two categorical factors (availability and beer rating) are independent or not The null hypothesis is that they are independent The alternative hypothesis is that they are not independent This can be tested using a chisquared test

12 Copyright (c) Bani K. Mallick12 Chisquared Tests for Independence and Homogeneity The chisquared test for independence of two categorical factors: The factors can have more than 2 levels: this is the main advantage of the chisquared test Thus, for example, you could define three populations of beers (fair, good, very good), three levels of availability (special, regional, national) and ask whether quality is independent of availability.

13 Copyright (c) Bani K. Mallick13 Chisquared Tests for Independence and Homogeneity In SPSS, you get a table with counts in each “cell” and along the rows and columns # of observations in row i, column j = # of observations in row i = # of observations in column j = # of observations =

14 Copyright (c) Bani K. Mallick14 Factor A Level 1 Level 2 Level 1 Factor B Level 2 This describes the population table in its most general form in terms of counts

15 Copyright (c) Bani K. Mallick15 US Availability and Rating: Are Better Beers More Widely Available? The are 6 Very Good, National beers in this sample. There are 11 Very Good beers in the sample The total sample is of size 35

16 Copyright (c) Bani K. Mallick16 Chisquared Tests for Independence and Homogeneity For purposes of explanation, let’s make up a fake example. Consider two categorical factors for males: height (short, tall) and favorite sport (golf, baseball) We probably would not expect these to be related. Here is the data table (next slide!)

17 Copyright (c) Bani K. Mallick17 Sport Preference Golf Baseball Short Height Tall This describes the population table in its most general form in terms of row and column counts only. There are 400 short men, 200 golfers, etc. 1000 400 600 200800

18 Copyright (c) Bani K. Mallick18 Sport Preference Golf Baseball Short Height Tall Under the null hypothesis that height and sport preference are independent, how many short men who play golf would you Expect? Think hard about this! 1000 400 600 200800

19 Copyright (c) Bani K. Mallick19 Sport Preference Golf Baseball Short Height Tall 80 Note that 40% of men are short. Under the null hypothesis that height and sport preference are independent, of the 200 men who prefer golf, you would expect 40% = 80 to be short. 1000 400 600 200800 Expected Cell Counts under the null hypothesis

20 Copyright (c) Bani K. Mallick20 Sport Preference Golf Baseball Short Height Tall 80 Note that 40% of men are short. Under the null hypothesis that height and sport preference are independent, of the 800 men who prefer baseball, you would also expect 40% = 320 to be short. 1000 400 600 200800 320 Expected Cell Counts under the null hypothesis

21 Copyright (c) Bani K. Mallick21 Sport Preference Golf Baseball Short Height Tall 80 Under the null hypothesis that height and sport preference are independent, you can fill out the rest of the table of expected counts 1000 400 600 200800 320 Expected Cell Counts under the null hypothesis 120480

22 Copyright (c) Bani K. Mallick22 Sport Preference Golf Baseball Short Height Tall 80 & 100 Now you have to ask yourself, are the observed counts and the under independence expected counts (under independence) sufficiently different as to make the null hypothesis very unlikely? 1000 400 600 200800 320 & 300 Expected Cell Counts under the null hypothesis and the Observed Counts 120 & 100 480 & 500

23 Copyright (c) Bani K. Mallick23 Chisquared Tests for Independence and Homogeneity The expected number of observations in any cell of the table is You simply multiply the row and column totals and divide by the total sample size. The chisquared test for independence and homogeneity simply compares the actual table fractions/numbers (observed) to the table fractions/numbers you would expect (expected) under the null hypothesis of independence

24 Copyright (c) Bani K. Mallick24 Sport Preference Golf Baseball Short Height Tall 80 & 100 Note how the expected count for the number of men who are tall and prefer baseball is 600 x 800 / 1000 = 480 1000 400 600 200800 320 & 300 Expected Cell Counts under the null hypothesis and the Observed Counts 120 & 100 480 & 500

25 Copyright (c) Bani K. Mallick25 Chisquared Tests for Independence and Homogeneity The test statistic is computed as follows First get the expected counts under independence Compute

26 Copyright (c) Bani K. Mallick26 Chisquared Tests for Independence and Homogeneity The test statistic is This has the counts in the cells and their expected values under the null hypothesis

27 Copyright (c) Bani K. Mallick27 Chisquared Tests for Independence and Homogeneity If the Table has r-rows and c-columns: The test statistic is You reject the null hypothesis at Type I error (level)  if > Here “cuts off” area in the chisquared distribution with (r-1)x(c-1) degrees of freedom (Table 7)

28 Copyright (c) Bani K. Mallick28 Chisquared Tests for Independence and Homogeneity The chisquared statistic can be computed in SPSS, by going to “analyze”, “descriptives” “crosstabs” Then click on “statistics” and ask for “chisquared” The p-value is slightly different from the p- value using the t-test method, but is generally pretty close. If in dispute, use the Pearson chisquared reading, with 1 exception

29 Copyright (c) Bani K. Mallick29 Chisquared Tests for Independence and Homogeneity SPSS will print out a message if the expected count is < 5 in any cell, i.e., In this case, use the Fisher exact value If Fisher’s exact value does not exist in a package, use the likelihood ratio test p-value

30 Copyright (c) Bani K. Mallick30 US Availability and Rating: Are Better Beers More Widely Available? This is the table of observed counts. Under the null hypothesis that availability and beer quality are independent, how many very good, national beers would you expect?

31 Copyright (c) Bani K. Mallick31 US Availability and Rating: Are Better Beers More Widely Available? This is the table of observed counts. Under the null hypothesis that availability and beer quality are independent, how many very good, national beers would you expect? 12 x 11 / 35 = 3.77

32 Copyright (c) Bani K. Mallick32 US Availability and Rating: Are Better Beers More Widely Available? Availability in the U.S. * Very Good versus Other Crosstabulation Count 6 & 3.776 & 8.23 12 5 & 7.2318 & 15.77 23 112435 National Regional Availability in the U.S. Total Very GoodFair or Good Very Good versus Other Total This is the table of observed counts, with the expected counts under the null hypothesis that availability and beer quality are independent

33 Copyright (c) Bani K. Mallick33 US Availability and Rating: Are Better Beers More Widely Available? Availability in the U.S. * Very Good versus Other Crosstabulation Count 6 & 3.776 & 8.23 12 5 & 7.2318 & 15.77 23 112435 National Regional Availability in the U.S. Total Very GoodFair or Good Very Good versus Other Total The chisquared statistic is ( (6-3.77) x (6-3.77)/3.77) + ( (6 – 8.23) x (6-8.23) / 8.23) + ((5-7.23) x (5-7.23)/7.23) + ((18-15.77) x (18-15.77)/15.77) = 2.9

34 Copyright (c) Bani K. Mallick34 US Availability and Rating: Are Better Beers More Widely Available? Availability in the U.S. * Very Good versus Other Crosstabulation Count 6 & 3.776 & 8.23 12 5 & 7.2318 & 15.77 23 112435 National Regional Availability in the U.S. Total Very GoodFair or Good Very Good versus Other Total The chisquared statistic is = 2.9. Here r = 2, c = 2, (r-1) x (c-1) = 1, and the critical value from Table 7 is 3.8416. Note though that an expected cell count is < 5, so you have to use Fisher’s exact value

35 Copyright (c) Bani K. Mallick35 US Availability and Rating: Are Better Beers More Widely Available?: p = 0.130 Note the warning message Chi-Square Tests 2.922 b 1.087 1.7581.185 2.8541.091.130.094 2.8391.092 35 Pearson Chi-Square Continuity Correction a Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases Valuedf Asymp. Sig. (2-sided) Exact Sig. (2-sided) Exact Sig. (1-sided) Computed only for a 2x2 table a. 1 cells (25.0%) have expected count less than 5. The minimum expected count is 3.77. b. Note the warning message in red, indicating the need to use Fisher’s exact test

36 Copyright (c) Bani K. Mallick36 Chisquared Tests for Independence and Homogeneity SPSS also gives you expected counts and percentages in the table. You ask for “Cells” and then click on what you want SPSS demo

37 Copyright (c) Bani K. Mallick37 Chisquared Tests for Independence

38 Copyright (c) Bani K. Mallick38 Raw Counts Availability in the U.S. * Very Good versus Other Crosstabulation 66 12 3.88.212.0 50.0% 100.0% 54.5%25.0%34.3% 518 23 7.215.823.0 21.7%78.3%100.0% 45.5%75.0%65.7% 112435 11.024.035.0 31.4%68.6%100.0% Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other National Regional Availability in the U.S. Total.00Very Good Very Good versus Other Total

39 Copyright (c) Bani K. Mallick39 Expected Under Independence Availability in the U.S. * Very Good versus Other Crosstabulation 6612 3.88.2 12.0 50.0% 100.0% 54.5%25.0%34.3% 51823 7.215.8 23.0 21.7%78.3%100.0% 45.5%75.0%65.7% 112435 11.024.035.0 31.4%68.6%100.0% Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other National Regional Availability in the U.S. Total.00Very Good Very Good versus Other Total

40 Copyright (c) Bani K. Mallick40 % Within Rows Availability in the U.S. * Very Good versus Other Crosstabulation 6612 3.88.212.0 50.0% 100.0% 54.5%25.0%34.3% 51823 7.215.823.0 21.7%78.3% 100.0% 45.5%75.0%65.7% 112435 11.024.035.0 31.4%68.6%100.0% Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other National Regional Availability in the U.S. Total.00Very Good Very Good versus Other Total

41 Copyright (c) Bani K. Mallick41 % Within Columns Availability in the U.S. * Very Good versus Other Crosstabulation 6612 3.88.212.0 50.0% 100.0% 54.5%25.0% 34.3% 51823 7.215.823.0 21.7%78.3%100.0% 45.5%75.0% 65.7% 112435 11.024.035.0 31.4%68.6%100.0% Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other Count Expected Count % within Availability in the U.S. % within Very Good versus Other National Regional Availability in the U.S. Total.00Very Good Very Good versus Other Total

42 Copyright (c) Bani K. Mallick42 Chisquared Tests for Independence and Homogeneity SPSS also allows you to have categorical factors with more than two levels

43 Copyright (c) Bani K. Mallick43 Chisquared Tests for Independence and Homogeneity Note warning message Here you use Likelihood Ratio since there is no Fisher

44 Copyright (c) Bani K. Mallick44 Education and Raises in Construction: do you see any structure?

45 Copyright (c) Bani K. Mallick45 Education and Raises in Construction

46 Copyright (c) Bani K. Mallick46 # Companies & Raises in Construction Notice how those who have worked for lot of companies have small number of promotions.

47 Copyright (c) Bani K. Mallick47 47.5% of those with <= 5 companies have zero promotions, 14.3% have 3 or more 73.7% of those with 11+ companies have zero promotions, 5.3% have 3 or more May suggest a trend

48 Copyright (c) Bani K. Mallick48 # Companies & Raises in Construction General chisquared test is not significant Note the significant “Linear-by-Linear Association”. What is this?

49 Copyright (c) Bani K. Mallick49 # Companies & Raises in Construction Linear-by-Linear Association (Crosstabs) A measure of linear association between the row and column variables I This statistic should not be used for nominal (unordered) data. Also known as the Mantel-Haenszel chi-square test. This makes sense: there appears to be some ordered inverse relationship between # of promotions and # of companies

50 Copyright (c) Bani K. Mallick50 # Companies & Raises in Construction Note the negative trend


Download ppt "Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17."

Similar presentations


Ads by Google