Presentation 12 Chi-Square test.

Slides:



Advertisements
Similar presentations
Chi-square test Chi-square test or  2 test. Chi-square test countsUsed to test the counts of categorical data ThreeThree types –Goodness of fit (univariate)
Advertisements

Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.
Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables Chapter 15 1.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.
Presentation 12 Chi-Square test.
The Chi-Square Test Used when both outcome and exposure variables are binary (dichotomous) or even multichotomous Allows the researcher to calculate a.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
Chi-square test or c2 test
Chi-square test Chi-square test or  2 test Notes: Page Goodness of Fit 2.Independence 3.Homogeneity.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics: A First Course Fifth Edition.
Copyright © 2010 Pearson Education, Inc. Slide
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests and Nonparametric Tests Statistics for.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
Chi Square Test Dr. Asif Rehman.
Check your understanding: p. 684
CHAPTER 11 Inference for Distributions of Categorical Data
Comparing Counts Chi Square Tests Independence.
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Chapter 12 Chi-Square Tests and Nonparametric Tests
Chi-Square hypothesis testing
Warm Up Check your understanding on p You do NOT need to calculate ALL the expected values by hand but you need to do at least 2. You do NOT need.
Chi-square test or c2 test
5.1 INTRODUCTORY CHI-SQUARE TEST
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
Vocabulary Statistical Inference – provides methods for drawing conclusions about a population parameter from sample data Expected Values– row total *
Chapter 11 Chi-Square Tests.
Hypothesis Testing Review
Chapter 12 Tests with Qualitative Data
Hypothesis testing. Chi-square test
CHAPTER 11 Inference for Distributions of Categorical Data
Lecture #27 Tuesday, November 29, 2016 Textbook: 15.1
Chapter 11 Goodness-of-Fit and Contingency Tables
The Chi-Square Distribution and Test for Independence
Is a persons’ size related to if they were bullied
Testing for Independence
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Hypothesis testing. Chi-square test
Chapter 11: Inference for Distributions of Categorical Data
Statistical Inference about Regression
Association, correlation and regression in biomedical research
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 10 Analyzing the Association Between Categorical Variables
Contingency Tables: Independence and Homogeneity
Inference for Relationships
Chi-square test or c2 test
Chapter 11 Chi-Square Tests.
Inference on Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Lesson 11 - R Chapter 11 Review:
Analyzing the Association Between Categorical Variables
Chapter 13: Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Inference for Two Way Tables
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 11 Chi-Square Tests.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Presentation transcript:

Presentation 12 Chi-Square test

What does it mean for two categorical variables to be related? Remember that Chi-Square is used to test for a relationship between 2 Categorical variables. Ho: There is no relationship between the variables. Ha: There is a relationship between the variables. If two categorical variables are related, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if there is a relationship between these two variables, we are trying to determine if being part of a particular religion makes an individual more likely to be a smoker. If that is the case, then we can say that Religion and Smoking are related or associated.

Chi-Square test for 2-way tables Suppose we are studying two categorical variables in a population, where the first variable has r levels (i.e. possible outcomes) and the second one has s levels. We can summarize a sample from this population using a table with r rows and c columns. A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of cells is r xc) represents a combination of categories of the two variables. The following table presents the data on race and smoking. The two variables of interest, race and smoking, have r = 4 and c = 2, resulting in 4x2=8 combinations of categories. Race NSmoke Smoke Caucasian 620 75 Black 240 41 Hispanic 130 29 Other 190 38

Chi-Square test for 2-way tables By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: H0: The two variables are not associated. Ha: The two variables are associated. Two different experimental situations will lead to contingency tables If we have two populations under study, both of which have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of homogeneity among the two populations. If we have one population under study, and we are interested to check the relationship between two categorical variables. In this case the null hypothesis is a statement of independence between the two variables. For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the relationship between two variables.

Some Notation! For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as

Example Race NSmoke Smoke Total Caucasian O11 = 620 O12 = 75 R1 = 695 Black O21 = 240 O22 = 41 R2 = 281 Hispanic O31 = 130 O32 = 29 R3 = 159 Other O41 = 190 O42 = 38 R4 = 228 C1 = 1180 C2 = 183 n=1363 E11=(695x1180)/1363 E12=(695x183)/1363 E21=(281x1180)/1363 E22=(281x183)/1363 E31=(159x1180)/1363 E32=(159x183)/1363 E41=(228x1180)/1363 E42=(228x183)/1363

Chi-Square Analysis Details The 5 Steps in a Chi-Square Test: Step 1: Write the null and alternative hypothesis. H0: There is no relationship between the variables. Ha: There is a relationship between the variables. Step 2: Check conditions. A) All expected counts should be > 1. B) At least 80% of expected counts should > 5. Step 3: Calculate Test Statistic and p-value. The test statistic measure the difference between the observed counts and the expected counts assuming independence. This is called chi-square statistic because if the null hypothesis is true, then it has a chi-square distribution with (r-1)x(c-1) degrees of freedom.

Chi-Square Analysis Details Step 3 Cont. Find the p-value. If the χ2- statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' χ2 gives evidence against the null hypothesis, and supports the alternative. The p-value of the chi-square test is the probability that the χ2- statistic, is as large or larger than the value we obtained if H0 is true. Also, if H0 is true, the χ2- statistic has chi-square distribution with (r-1)x(c-1) df. Thus, the p-value for Chi-Square test is ALWAYS the area to the right of the test statistic under the curve, i.e. p-value = P(X> χ2), where X has a chi-square distribution with (r-1)x(c-1) df curve. To get this probability we need to use a chi-square distribution with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other statistical software, you can obtain the p-value form the output. Otherwise, you can report a range for the p-value using Table 4 (since usually you will not be able to find the exact p-value on the table.

Chi-Square Analysis Details Step 4: Decide whether or not the result is statistically significant. The results are statistically significant if the p-value is less than alpha, where alpha is the significance level (usually α = 0.05). Step 5: Report the conclusion in the context of the situation. The p-value is ______ which is < a, this result is statistically significant. Reject the H0 Conclude that (the two variables) are related. The p-value is ______ which is > a, this result is NOT statistically significant. We cannot reject the H0 Cannot conclude that (the two variables) are related.

Detailed Example Derek wants to know if the geographical area that a student grew up in is associated with whether or not that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students No Yes Total Big City 21 65 86 Rural 11 130 141 Small Town 18 198 216 Suburban 37 345 382 87 738 825

Detailed Example 1. Ho: There is no relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. Ha: There is relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. 2. To check the conditions we need to calculate the expected counts for each cell. E11 = (R1xC1)/n = (86x87)/825 = 9.07, E12 = (R1xC2)/n = (86x738)/825 = 76.93, … E32 = (R3xC2)/n = ___________________, …

Detailed Example Big_City 21 65 86 9.07 76.93 86.00 Rural 11 130 141 No Yes All Big_City 21 65 86 9.07 76.93 86.00 Rural 11 130 141 14.87 126.13 141.00 SmallTow 18 198 216 22.78 193.22 216.00 Suburban 37 345 382 40.28 341.72 382.00 All 87 738 825 87.00 738.00 825.00 Here is the Minitab output with the Observed and Expected counts for each cell. We can see that the conditions are satisfied!

Detailed Example 3. Chi- Square statistic and P-value: χ2 = sum {(Observed – Expected)2/Expected} = (21-9.07)2/9.07+ (65-76.93)2/76.93 + (11-14.87)2/14.87+ (130-126.13)2/126.13 + (18-22.78)2/22.78+ (198-193.22)2/193.22 + (37-40.28)2/40.28+ (345-341.72)2/341.72 = 20.091 df = (4-1)x(2-1) =3 p-value= P(X> 20.091) < P(X> 16.17) = 0.001 (Table A.4) 4. Since the p-value< 0.05, the test is significant, and we can reject the null. 5. We can conclude that there is a relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol.

Special Case - Analyzing 2x2 tables In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having two rows and two columns (i.e. r=c=2). The general form of a 2x2 table is In this case, the chi-square statistic has the following simplified form, Under the null hypothesis, χ2-statistic has chi-square distribution with (2-1)x(2-1)=1 degrees of freedom. Column 1 Column 2 Total Row 1 A B R1 Row 2 C D R2 C1 C2 n

Example for 2x2 table: Is there relationship between gender and smoking habits? Minitab Output C1 C2 Total 1 540 52 592 540.17 51.83 2 325 31 356 324.83 31.17 Total 865 83 948 Chi-Sq = 0.000 + 0.001 + 0.000 + 0.001 = 0.002 DF = 1, P-Value = 0.968 Minitab uses the general formula of the χ2 test statistic. Gender NSmoke Smoke Total Male 540 52 592 Female 325 31 356 865 83 948

Relationship Between Chi-Square and 2 Proportions Tests When do we use Chi-Square and when do we use 2 proportions? Situation 1: Both categorical variables of interest have exactly 2 levels. Question - Is there a relationship between the variables, or is there a difference in the proportions? Answer - Either Chi-Square or Two Sided Test of 2-proportions will lead to the same conclusion! In this case, the χ2 –statistic = (z-statistic)2, and the p-values of the two tests are equal, i.e. P(X(1df) > χ2 –stat) = 2 P (Z > |z-stat|). Situation 2: Both categorical variables of interest have exactly 2 levels. Question - Is one proportion greater/smaller than the other. Answer - This is a one-sided test and you MUST use a test of 2 proportions. Situation 3: At least one of the two categorical variables of interest has MORE than 2 levels. Question - Is there a relationship between the variables? Answer - MUST use a Chi-Square Test.

Examples of Chi-Square and 2-Proportions Gender NSmoke Smoke Male 540 52 Female 325 31 Q1: Is there a difference in the proportion of males and females that smoke? Solution: Either a Chi-Square or Test of 2 proportions is fine. 2-proportions Chi-Square H0: pm – pf = 0 H0: There is no relationship between Gender and Smoking. Ha: pm – pf ≠ 0 Ha: There is a relationship between Gender and Smoking. Q2: Is the proportion of males who smoke greater than the proportion of females who smoke? Solution: Test of 2 proportions, because the alternative is one sided! 2-proportions H0: pm – pf = 0 vs Ha: pm – pf > 0

Examples of Chi-Square and 2-Proportions Race NSmoke Smoke Caucasian 620 75 Black 240 41 Hispanic 130 29 Other 190 38 Q: Is there a relationship between Race and Smoking? Is there a difference in the proportion smokers of the different races? Solution: Chi-Square because Race has more than 2 levels! Chi-Square Test H0: There is no relationship between Race and Smoking. Ha: There is a relationship between Race and Smoking.