Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable : 

Similar presentations


Presentation on theme: "Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable : "— Presentation transcript:

1 Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable :  2 goodness of fit (7.1) Testing Goodness-of- Fit for a Single Categorical Variable

2 Statistics: Unlocking the Power of Data Lock 5 Statistics! Statistics might be the most important class you take in college  http://college.usatoday.com/2015/04/08/voices- statistics-might-be-the-most-important-class-you- take-in-college/ http://college.usatoday.com/2015/04/08/voices- statistics-might-be-the-most-important-class-you- take-in-college/  (4/8/15) Why you need to study statistics  https://www.youtube.com/watch?v=wV0Ks7aS7YI https://www.youtube.com/watch?v=wV0Ks7aS7YI  (4/2/15)

3 Statistics: Unlocking the Power of Data Lock 5 Multiple Categories So far, we’ve learned how to do inference for categorical variables with only two categories Today, we’ll learn how to do hypothesis tests for categorical variables with multiple categories

4 Statistics: Unlocking the Power of Data Lock 5 Genetic Variants for Fast-Twitch Muscles A gene called ACTN3 encodes a protein which functions in fast twitch muscles Three different variants of the gene: RR, RX, and XX In a sample, we observe 130 RR, 226 RX, and 80 XX. If both R and X are equally likely, then by the Hardy- Weinberg principle about 50% of the population should be heterozygotes (RX) and about 25% should be each of the homozygotes (25% RR, 25% XX) Do our data contradict these hypothesized proportions? Yang, N. et. al. (2003). “ACTN3 genotype is associated with human elite athletic performance,” American Journal of Human Genetics, 73: 627-631.

5 Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing 1.State Hypotheses 2.Calculate a statistic, based on your sample data 1.Create a distribution of this statistic, as it would be observed if the null hypothesis were true 2.Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)

6 Statistics: Unlocking the Power of Data Lock 5 Hypotheses Define the null hypothesized proportions in each category: H 0 : p RR = 0.25, p RX = 0.5, p xx = 0.25 H a : At least one p i is not as specified in H 0

7 Statistics: Unlocking the Power of Data Lock 5 Observed Counts The observed counts are the actual counts observed in the study RRRXXX Observed13022680

8 Statistics: Unlocking the Power of Data Lock 5 Test Statistic Why can’t we use the familiar formula to get the test statistic? We need something a bit more complicated…

9 Statistics: Unlocking the Power of Data Lock 5 Expected Counts The expected counts are the expected counts if the null hypothesis were true For each cell, the expected count is the sample size (n) times the null proportion, p i

10 Statistics: Unlocking the Power of Data Lock 5 Expected Counts RRRXXX Null Proportion0.250.50.25 Expected n = 436

11 Statistics: Unlocking the Power of Data Lock 5 Chi-Square Statistic Need a way to measure how far the observed counts are from the expected counts… Use the chi-square statistic : RRRXXX Observed13022680 Expected109218109

12 Statistics: Unlocking the Power of Data Lock 5 Chi-Square Statistic RRRXXX Observed13022680 Expected109218109

13 Statistics: Unlocking the Power of Data Lock 5 What Next? We have a test statistic. What else do we need to perform the hypothesis test? How do we get this? Two options: 1)Simulation 2)Distributional Theory

14 Statistics: Unlocking the Power of Data Lock 5 Upper-Tail p-value To calculate the p-value for a chi-square test, we always look in the upper tail Why?  Values of the χ 2 are always positive  The higher the χ 2 statistic is, the farther the observed counts are from the expected counts, and the stronger the evidence against the null

15 Statistics: Unlocking the Power of Data Lock 5 Simulation p-value

16 Statistics: Unlocking the Power of Data Lock 5 Chi-Square (χ 2 ) Distribution If each of the expected counts are at least 5, AND if the null hypothesis is true, then the χ 2 statistic follows a χ 2 –distribution, with degrees of freedom equal to df = number of categories – 1 Gene variants: df = 3 – 1 = 2

17 Statistics: Unlocking the Power of Data Lock 5 Chi-Square Distribution

18 Statistics: Unlocking the Power of Data Lock 5 p-value using χ 2 distribution

19 Statistics: Unlocking the Power of Data Lock 5 Conclusion Do our data provide evidence that the population proportions differ from 25% RR, 50% RX, and 25% XX? a) Yes b) No

20 Statistics: Unlocking the Power of Data Lock 5 Chi-Square Test for Goodness of Fit 1.State null hypothesized proportions for each category, p i. Alternative is that at least one of the proportions is different than specified in the null. 2.Calculate the expected counts for each cell as np i. 3.Calculate the χ 2 statistic: 4.Compute the p-value as the proportion above the χ 2 statistic for either a randomization distribution or a χ 2 distribution with df = (# of categories – 1) if expected counts all > 5 5.Interpret the p-value in context.

21 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment Source: Mendel, Gregor. (1866). Versuche über Pflanzen-Hybriden. Verh. Naturforsch. Ver. Brünn 4: 3–47 (in English in 1901, Experiments in Plant Hybridization, J. R. Hortic. Soc. 26: 1–32) In 1866, Gregor Mendel, the “father of genetics” published the results of his experiments on peas He found that his experimental distribution of peas closely matched the theoretical distribution predicted by his theory of genetics (involving alleles, and dominant and recessive genes)

22 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment Mate SSYY with ssyy:  1 st Generation: all Ss Yy Mate 1 st Generation: => 2 nd Generation  Second Generation S, Y: Dominant s, y: Recessive PhenotypeTheoretical Proportion Round, Yellow9/16 Round, Green3/16 Wrinkled, Yellow3/16 Wrinkled, Green1/16

23 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment PhenotypeTheoretical Proportion Observed Counts Round, Yellow9/16315 Round, Green3/16101 Wrinkled, Yellow3/16108 Wrinkled, Green1/1632 Let’s test this data against the null hypothesis of each p i equal to the theoretical value, based on genetics

24 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment PhenotypeNull p i Observed Counts Expected Counts Round, Yellow 9/16315 Round, Green 3/16101 Wrinkled, Yellow 3/16108 Wrinkled, Green 1/1632 The expected count for the round, yellow phenotype is a)177.2 b)310.5 c)312.75 d)318.25

25 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment PhenotypeNull p i Observed Counts Expected Counts Contribution to χ 2 Round, Yellow 9/16315312.75 Round, Green 3/16101104.25 Wrinkled, Yellow 3/16108104.25 Wrinkled, Green 1/163234.75 The contribution to the χ 2 statistic for the round, yellow phenotype is a)0.012 b)0.014 c)0.016 d)0.018 PhenotypeNull p i Observed Counts Expected Counts Contribution to χ 2 Round, Yellow 9/16315 Round, Green 3/16101104.250.101 Wrinkled, Yellow 3/16108104.250.135 Wrinkled, Green 1/163234.750.1218

26 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment χ 2 = 0.47 Two options: o Simulate a randomization distribution o Compare to a χ 2 distribution with 4 – 1 = 3 df

27 Statistics: Unlocking the Power of Data Lock 5 Mendel’s Pea Experiment p-value = 0.925 Does this prove Mendel’s theory of genetics? Or at least prove that his theoretical proportions for pea phenotypes were correct? a)Yes b)No

28 Statistics: Unlocking the Power of Data Lock 5 To Do Read Section 7.1 Do HW 7.1 (due Friday, 4/17)


Download ppt "Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan SECTION 7.1 Testing the distribution of a single categorical variable : "

Similar presentations


Ads by Google