Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chi-Square Distributions

Similar presentations


Presentation on theme: "Chi-Square Distributions"— Presentation transcript:

1 Chi-Square Distributions

2 Recap Analyze data and test hypothesis Type of test depends on:
Data available Question we need to answer What do we use to examine patterns between categorical variables? Gender Location Preferences We’ve learned to use inferential stats to analyze data and test hypotheses. The type of test – or distribution – we used depends on the data we have available and the type of question we need to answer. When we wanted to make inferences based on samples, we used a t-test. To example multiple treatments, we used ANOVA or F-tests. What if we want to examine patterns between categorical (qualititative) variables?

3 t-distribution df = 4 df = 100
Recall with the student t-distribution that shape of the distribution was determined by its degrees of freedom (n-1) and that as df increased, the distribution became closer to a z (normal)

4 F-distribution We used the F-distribution to compare a ratio of variances in the ANOVA. The shape of the F-distribution is determined by 2 values: the d.f. of the numerator and the d.f. of the denominator (in this case, 6 and 10)

5 χ-square distribution
df = 2 Chi-square shape is also determined by d.f. (we’ll discuss these later). Properties of the X2: Mean of the distribution = # of d.f. Variance = 2 times # d.f. Max value of y = d.f. minus 2 (when d.f. >=2) As d.f. increase, X2 curve approaches normal distribution. df = 4 df = 10

6 Cumulative Probability
Total area under the curve = 1 Area between 0 and some chi=square value (A) is the cumulative probability associated with that value (P value falls between 0 and A) We have the chi-square calculator to figure that out

7 𝜒 2 distribution Goodness of fit Test for homogeneity
Test for independence Goodness of Fit One categorical variable from a single population Test for Homogeneity Single categorical variable from 2 populations Test if frequency counts are distributed identically across both populations Test for Independence 2 categorical values from a single population. Determine if there is a significant association between the 2 variables.

8 𝜒 2 Goodness of Fit Testing one categorical value from a single population Example: A manufacturer of baseball cards claims 30% of all cards feature rookies 60% feature veterans 10% feature all-stars Reference: Remember what a categorical value is. Discussed before, but first time we’ve used.

9 𝜒 2 Assumptions Data is collected from a simple random sample (SRS)
Population is at least 10 times larger than sample Variable is categorical Expected value for each level of the variable is at least 5 We can expect these assumptions to be met in our in-class work, but be careful of the last one. Example will follow. These assumptions will apply to all the X2 tests we will perform

10 Steps in the Process State the hypothesis Form an analysis plan
Analyze sample data Interpret results These should look familiar. It’s the same process we’ve used for other statistical tests

11 𝜒 2 Goodness of Fit State the hypothesis Baseball card example
Null: The data are consistent with a specified distribution Alternative: The data are not consistent with a specified distribution At least one of the expected values is not accurate Baseball card example 𝐻 0 : 𝑃 𝑅 =0.3, 𝑃 𝑉 =0.6, 𝑃 𝐴𝑆 =0.1 𝐻 𝑎 :At least one of the probabilities in inaccurate For the null, we could state this as it is here, or – better – we could say what the expected values are at each level. Alternative almost has to be worded as it is. This is not the same as repeating the null with the not equal signs in one or all of the equations.

12 𝜒 2 Goodness of Fit Analysis Plan Determine the test method
Specify the significance level Determine the test method Goodness of fit Independence Homogeneity Chi-square tests will typically be one-tailed (finding the confidence interval for a population variance is a 2-tail example) Chi-square is also commonly used to perform a gof test for normality. We will not be doing that.

13 𝜒 2 Goodness of Fit Analyze the sample data Interpret the results
Find the degrees of freedom d.f.= k-1, where k=the number of levels for the distribution Determine the expected frequency counts Expected frequency (E) = sample size x hypothesized proportion 𝐸 𝑖 =𝑛 x 𝑝 𝑖 Determine the test statistic Χ 2 =Σ 𝑂 𝑖 − 𝐸 𝑖 𝐸 𝑖 Interpret the results Excel doesn’t have a chi-square option in the data analysis section. BUT, we can use Excel to figure our test statistic We interpret the results the same way we’ve done in the past. P<alpha means evidence to reject the null hypothesis It’s best to use an example for these.

14 Goodness of fit example
Problem Acme Toy Company prints baseball cards. The company claims that 30% of the cards are rookies, 60% veterans, and 10% are All-Stars. The cards are sold in packages of 100. Suppose a randomly-selected package of cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this consistent with Acme's claim? Use a 0.05 level of significance. Solution: State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%, 60% and 10%, respectively. Alternative hypothesis: At least one of the proportions in the null hypothesis is false. Formulate an analysis plan. For this analysis, the significance level is Using sample data, we will conduct a chi-square goodness of fit test of the null hypothesis. Analyze sample data. Applying the chi-square goodness of fit test to sample data, we compute the degrees of freedom, the expected frequency counts, and the chi-square test statistic. Based on the chi-square statistic and the degrees of freedom, we determine the P-value. DF = k - 1 = = 2  (Ei) = n * pi (E1) = 100 * 0.30 = 30 (E2) = 100 * 0.60 = 60 (E3) = 100 * 0.10 = 10  Χ2 = Σ [ (Oi - Ei)2 / Ei ]  Χ2 = [ ( )2 / 30 ] + [ ( )2 / 60 ] + [ (5 - 10)2 / 10 ] Χ2 = (400 / 30) + (225 / 60) + (25 / 10) = = 19.58 where DF is the degrees of freedom, k is the number of levels of the categorical variable, n is the number of observations in the sample, Ei is the expected frequency count for level i, Oi is the observed frequency count for level i, and Χ2 is the chi-square test statistic. The P-value is the probability that a chi-square statistic having 2 degrees of freedom is more extreme than We use the Chi-Square Distribution Calculator to find P(Χ2 > 19.58) = Interpret results. Since the P-value (0.0001) is less than the significance level (0.05), we cannot accept the null hypothesis.

15 Using Excel to find 𝜒 2 Determine 𝐸 𝑖
Create 2 columns: n and p, and enter appropriate values In the 3rd column: 𝐸 𝑖 =𝑛 x 𝑝 𝑖 n p E sub i 100 0.6 60 0.3 30 0.1 10

16 Using Excel to find 𝜒 2 Determine 𝜒 2
Add a 4th column to the spreadsheet: 𝑂 𝑖 In the 5th column, calculate each element of the 𝜒 2 statistic =(D2-C2)^2/C2 Sum the values of the 5th column This is the 𝜒 2 value, the test statistic Use the 𝜒 2 calculator to find the value of p, and interpret the test results. n p E sub i O sub i 100 0.6 60 45 3.75 0.3 30 50 13.333 0.1 10 5 2.50 19.583

17 Another G of F problem Poisson Distribution
Automobiles leaving the paint department of an assembly plant are subjected to a detailed examination of all exterior painted surfaces. For the most recent 380 automobiles produced, the number of blemishes per car is summarized below. Level of significance: ∝=.05 Blemishes 1 2 3 4 # of cars 242 94 38 From p. 471 of the text Null hypothesis: sample was drawn from a population that is Poisson distributed. Alternative: not Poisson Using Excel: Determine E sub i. Find the mean number of blemishes Use this as an estimate of the population mean, lambda Determine probabilities using =poisson.dist nxp=E sub I Because the values for 3, 4, and 5 are less than 5, add them together and create a value of >3 Continue with the G of F procedure.

18 Goodness-of-Fit: An Example
Problem 13.18: It has been reported that 10.3% of U.S. households do not own a vehicle, with 34.2% owning 1 vehicle, 38.4% owning 2 vehicles, and 17.1% owning 3 or more vehicles. The data for a random sample of 100 households in a resort community are summarized below. At the 0.05 level of significance, can we reject the possibility that the vehicle-ownership distribution in this community differs from that of the nation as a whole? # Vehicles Owned # Households 0 20 1 35 2 23 3 or more 22 © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

19 Goodness-of-Fit: Problem 13.18, cont.
H0: p0 = 0.103, p1 = 0.342, p2 = 0.384, p3+ = 0.171 Vehicle-ownership distribution in this community is the same as it is in the nation as a whole. H1: At least one of the proportions does not equal the stated value. Vehicle-ownership distribution in this community is not the same as it is in the nation as a whole. © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

20 Goodness-of-Fit: Problem 13.18, cont.
II. Rejection Region: a = 0.05 df = k – 1 – m = 4 – 1 – 0 = 3 III. Test Statistic: c2 = IV. Conclusion: Since the test statistic of c2 = falls well above the critical value of c2 = 7.815, we reject H0 with at least 95% confidence. V. Implications: There is enough evidence to show that vehicle ownership in this community differs from that in the nation as a whole. © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

21 Test for homogeneity Single categorical variable from 2 populations
Test if frequency counts are distributed identically across both populations Example: Survey of TV viewing audiences. Do viewing preferences of men and women differ significantly? We make the same assumptions we did for the goodness of fit test Data is collected from a simple random sample (SRS) Population is at least 10 times larger than sample Variable is categorical Expected value for each level of the variable is at least 5 We use the same approach to testing

22 State the hypothesis Data collected from r populations
Categorical variable has c levels Null hypothesis is that each population has the same proportion of observations, i.e.: H0: Plevel 1, pop 1 = Plevel 1, pop 2 =… = Plevel 1. pop r H0; Plevel 2, pop 1 = Plevel 2, pop 2 - … = Plevel 2, pop r H0: Plevel c, pop 1 = Plevel c, pop 2=…=Plevel c, pop r Alternative hypothesis: at least one of the null statements if false

23 Analyze the sample data
Find Degrees of freedom Expected frequency counts Test statistic ( 𝜒 2 ) p-value or critical value

24 Analyze the sample data
Degrees of freedom d.f.=(r-1) x (c-1) Where r= number of populations c= number of categorical values

25 Analyze the sample data
Expected frequency counts Computed separately for each population at each categorical variable 𝐸 𝑟,𝑐 = 𝑛 𝑟 x 𝑛 𝑐 𝑛 Where: 𝐸 𝑟,𝑐 = expected frequency count of each population 𝑛 𝑟 = number of observations from each population 𝑛 𝑐 = number of observations from each category/treatment level

26 Analyze the sample data
Determine the test statistic Χ 2 =Σ 𝑂 𝒓,𝒄 − 𝐸 𝒓,𝒄 𝐸 𝒓,𝒄 Determine the p-value or critical value

27 Test for homogeneity Boys 50 30 20 100 Girls 80 70 200 110 90 300
Problem In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 300 fifth graders boys and 200 girls. Each child is asked which of the following TV programs they like best. Family Guy South Park The Simpsons Total Boys 50 30 20 100 Girls 80 70 200 110 90 300 Do the boys' preferences for these TV programs differ significantly from the girls' preferences? Use a 0.05 level of significance.

28 State the hypotheses Null hypothesis: The proportion of boys who prefer Family Guy is identical to the proportion of girls. Similarly, for the other programs. Thus: H0: Pboys who like Family Guy = Pgirls who like Family Guy H0: Pboys who like South Park = Pgirls who like South Park H0: Pboys who like The Simpsons = Pgirls who like The Simpsons Alternative hypothesis: At least one of the null hypothesis statements is false.

29 Analysis plan Compute 𝑑.𝑓.= 𝑟−1 x (c−1) Degrees of freedom
Expected frequency counts Chi-square test statistic 𝑑.𝑓.= 𝑟−1 x (c−1) Where: 𝑟 = number of population elements 𝑐 = number of categories/treatment levels In this case 𝑑.𝑓.= 2−1 x 3−1 =2

30 Analysis plan Boys Girls
Compute the expected frequency counts Er,c = (nr * nc) / n E1,1 = (100 * 100) / 300 = 10000/300 = 33.3 E1,2 = (100 * 110) / 300 = 11000/300 = 36.7 E1,3 = (100 * 90) / 300 = 9000/300 = 30.0 E2,1 = (200 * 100) / 300 = 20000/300 = 66.7 E2,2 = (200 * 110) / 300 = 22000/300 = 73.3 E2,3 = (200 * 90) / 300 = 18000/300 = 60.0 Again, it’s easiest to do this in Excel For example, we know that 100 children like Family Guy. If boys and girls were equal, we would expect 33 of the 100 boys and 67 of the 200 girls to like this show, etc. Family Guy South Park The Simpsons Total Boys 50 30 20 100 Girls 80 70 200 110 90 300

31 Analysis plan Boys Girls Determine the test statistic
Χ 2 =Σ 𝑂 𝒓,𝒄 − 𝐸 𝒓,𝒄 𝐸 𝒓,𝒄 Family Guy South Park The Simpsons Total Boys 50 (33.3) 30 (36.7) 20 (30.0) 100 Girls 50 (66.7) 80 (73.3) 70 (60.0) 200 110 90 300 Note: We could do goodness of fit for either boys or girls, but this is homogeneity – to see if the 2 populations are equal in their preferences. Use Excel to compute the chi-square statistic (should = 19.91)

32 Analysis plan p-value Interpret the results
use the Chi-Square Distribution Calculator to find P(Χ2 > 19.91) = Interpret the results The actual P-value, of course, is not exactly zero. If the Chi-Square Distribution Calculator reported more than four decimal places, we would find that the actual P-value is a very small number that is less than and greater than zero

33 Test for independence Almost identical to test for homogeneity Example
Test for homogeneity: Single categorical variable from 2 populations Test for independence: 2 categorical variables from a single population Determine if there is a significant association between the 2 variables Example Voters are classified by gender and by party affiliation (D,R,I). Use X2 test to determine if gender is related to voting preference (are the variables independent?)

34 Test for independence Same assumptions Same approach to testing
Hypotheses Suppose variable A has r levels and variable B has c levels. The null hypothesis states that knowing the level of A does not help you predict the level of B. The variables are independent. H0: Variables A and B are independent Ha: Variables A and B are not independent Knowing A will help you predict B Note: Relationship does not have to be causal to show dependence

35 Test for independence Men 200 150 50 400 Women 250 300 600 450 100
Problem A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by gender (male or female) and by voting preference (Republican, Democrat, or Independent). Do men’s preferences differ significantly from women’s? Republican Democrat Independent Total Men 200 150 50 400 Women 250 300 600 450 100 1000 Alpha = .05

36 Test for independence Hypotheses Analyze sample data
H0: Gender and voting preferences are independent. Ha: Gender and voting preferences are not independent. Analyze sample data Degrees of freedom Expected frequency counts Chi-square statistic p-value or critical value Interpret the results DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2  Er,c = (nr * nc) / n E1,1 = (400 * 450) / 1000 = /1000 = 180 E1,2 = (400 * 450) / 1000 = /1000 = 180 E1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E2,1 = (600 * 450) / 1000 = /1000 = 270 E2,2 = (600 * 450) / 1000 = /1000 = 270 E2,3 = (600 * 100) / 1000 = 60000/1000 = 60 Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ]  Χ2 = ( )2/180 + ( )2/180 + ( )2/40     + ( )2/270 + ( )2/270 + ( )2/40 Χ2 = 400/ / / / / /60 Χ2 = = 16.2 where DF is the degrees of freedom, r is the number of levels of gender, c is the number of levels of the voting preference, nr is the number of observations from level r of gender, nc is the number of observations from level c of voting preference, n is the number of observations in the sample, Er,c is the expected frequency count when gender is level r and voting preference is level c, and Or,c is the observed frequency count when gender is level r voting preference is level c. The P-value is the probability that a chi-square statistic having 2 degrees of freedom is more extreme than 16.2. We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) =

37 Chi-Square Tests of Independence
An Example, Problem 13.35: Researchers in a California community have asked a sample of 175 automobile owners to select their favorite from three popular automotive magazines. Of the 111 import owners in the sample, 54 selected Car and Driver, 25 selected Motor Trend, and 32 selected Road & Track. Of the 64 domestic-make owners in the sample, 19 selected Car and Driver, 22 selected Motor Trend, and 23 selected Road & Track. At the 0.05 level, is import/domestic ownership independent of magazine preference? Based on the chi-square table, what is the most accurate statement that can be made about the p-value for the test? © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

38 Chi-Square Tests of Independence
First, arrange the data in a table. Car and Motor Road & Driver (1) Trend (2) Track (3) Totals Import (Imp) Domestic (Dom) Totals Second, compute the expected values and contributions to c2 for each of the six cells. Then to the hypothesis test.... © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

39 Chi-Square Tests of Independence
I. Hypotheses: H0: Type of magazine and auto ownership are independent. H1: Type of magazine and auto ownership are not II. Rejection Region: a = 0.05 df = (r – 1) (k – 1) = (2 – 1)• (3 – 1) = 1 • 2 = 2 If c2 > 5.991, reject H0. © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.

40 Chi-Square Tests of Independence
III. Test Statistic: c2 = IV. Conclusion: Since the test statistic of falls beyond the critical value of 5.991, we reject the null hypothesis with at least 95% confidence. V. Implications: There is enough evidence to show that magazine preference is not independent from import/domestic auto ownership. p-value: In a cell on a Microsoft Excel spreadsheet, type: =CHIDIST(6.2747,2). The answer is: p-value = © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly accessible website, in whole or in part.


Download ppt "Chi-Square Distributions"

Similar presentations


Ads by Google