1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit.

1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit test (§10.4). Contingency Tables: Tests of independence and homogeneity (§10.5). Generalized Linear Models: Logistic regression (§12.8) and Poisson regression. Problem: The response variable is now categorical. Goals: (i)Extend analyses for comparing means in quantitative data (one- sample t-test, two-sample t-test, ANOVA), to compare proportions in categorical data. (ii)Build linear regression models for predicting a categorical response in similar style as for a quantitative response.

2 Chi-Square Goodness-of-Fit Test (§10.4) Want to compare several (k) observed proportions (  i ), to hypothesized proportions (  io ). Do the observed agree with the hypothesized?  i =  io, for categories i=1,2,...,k? H 0 :  i =  io, for i=1,2,...,k H a : At least two of the observed cell proportions differ from the hypothesized proportions.

3 Mannan & Meslow (1984) studied bird foraging behavior in a forest in Oregon. In a managed forest, 54% of the canopy volume was Douglas fir, 40% was ponderosa pine, 5% was grand fir, and 1% was western larch. They made 156 observations of foraging by red-breasted nuthatches: 70 observations (45%) in Douglas fir; 79 (51%) in ponderosa pine; 3 (2%) in grand fir; and 4 (3%) in western larch. H o : The birds forage randomly. H a : The birds do NOT forage randomly. (Prefer certain trees.) Example: Do Birds Forage Randomly?

4 If the birds forage randomly, we would expect to find them in the following proportions: Douglas Fir54% Ponderosa Pine40% Grand Fir5% Western Larch1%. But the following proportions were observed: Douglas Fir45% (70) Ponderosa Pine51% (79) Grand Fir2% (3) Western Larch3% (4) Do the birds prefer certain trees? Perform a test using a Pr(Type I error)=0.05. Bird Example: Summary

5 What are the key characteristics of the sample data collected? The data represent counts in different categories. What is the basic experiment? The type of tree where each of the 156 birds was observed foraging was noted. Before observing, each has a certain probability of being in one of the four types. After observing they are placed in the appropriate class. We call any experiment of n trials where each trial can have one of k possible outcomes a Multinomial Experiment. For individual j, the response, y j, indicates which outcome was observed. Possible outcomes are the integers 1,2,…,k. QUESTIONS TO ASK

6 The experiment consists of n identical trials. Each trial results in one of k possible outcomes. The probability that a single trial will result in outcome i is  i i=1,2,...,k, (  i =1)  and remains constant from trial to trial  The trials are independent (the response of one trial does not depend on the response of any other). The response of interest is n i the number of trials resulting in a particular outcome i. (  n i =n). Multinomial distribution: provides the probability distribution for the number of observations resulting in each of k outcomes. This tells us the probability of observing exactly n 1,n 2,...,n k. 0!=1 The Multinomial Experiment

7 Douglas Fir54%  1 =0.54 Pond Pine40%  2 =0.40 Grand Fir5%  3 =0.05 West Larch1%  4 =0.01 Douglas Fir n 1 =70 Pond Pine n 2 =79 Grand Fir n 3 =3 West Larch n 4 =4. Hypothesized Observed If this probability is high, then we would say that there is good likelihood that the observed data come from a multinomial experiment with the hypothesized probabilities. Otherwise we have the probabilities wrong. How do we measure the goodness of fit between the hypothesized probabilities and the observed data? From the bird foraging example

8 In a multinomial experiment of n trials with hypothesized probabilities of  i i=1,2,...,k, the expected number of responses in each outcome class is given by: A reasonable measure of goodness of fit would be to compare the observed class frequencies to the expected class frequencies. Turns out (Pearson, 1900) that this statistic is one of the best for this purpose. cell probability observed cell count expected cell count Has Chi Square distribution with df = k-1 provided no sparse counts: (i) no E i is less than 1, and (ii) no more than 20% of the E i are less than 5.

9 ClassHypothesizedObservedExpected Douglas Fir54%  1 =0.547084.24 Pond Pine40%  2 =0.407962.40 Grand Fir 5%  3 =0.05 3 7.80 West Larch 1%  4 =0.01 4 1.56 Since 13.59 > 7.81 we reject H o. Conclude: it is unlikely that the birds are foraging randomly. (But: more than 20% of the E i are less than 5… Use an exact test.) Pr(Type I error  = 0.05

10 > birds = chisq.test(x=c(70,79,3,4), p=c(.54,.40,.05,.01)) Chi-squared test for given probabilities data: c(70, 79, 3, 4) X-squared = 13.5934, df = 3, p-value = 0.003514 Warning message: In chisq.test(x = c(70, 79, 3, 4), p = c(0.54, 0.4, 0.05, 0.01)) : Chi-squared approximation may be incorrect > birds$resid [1] -1.551497 2.101434 -1.718676 1.953563 R Looks like birds prefer Pond Pine & West Larch to the other two!

11 H 0 :  i =  i o for categories i=1,2,...,k (Specified cell proportions for k categories) H a : At least two of the true population cell proportions differ from the specified proportions. Test Statistic: Rejection Region: Reject H 0 if  2 exceeds the tabulated critical value for the Chi Square distribution with df=k-1 and Pr(Type I Error) = . Summary: Chi Square Goodness of Fit Test

12 McDonald et al. (1996) examined variation at the CVJ5 locus in the American oyster (Crassostrea virginica). There were two alleles, L and S, and the genotype frequencies observed from a sample of 60 were: LL: 14LS: 21SS: 25 Using an estimate of the L allele proportion of p=0.408, the Hardy- Weinberg formula gives the following expected genotype proportions: LL: p 2 = 0.167LS: 2p(1-p) = 0.483SS: (1-p) 2 = 0.350 Here there are 3 classes (LL, LS, SS), but all the classes are functions of only one parameter (p). Hence, the chi-square distribution has only one (1) degree of freedom, and NOT 3-1=2. Example: Genotype Frequency in Oysters (A Nonstandard Chi-Square GOF Problem)

13 > chisq.test(x=c(14,21,25), p=c(.167,.483,.350)) Chi-squared test for given probabilities data: c(14, 21, 25) X-squared = 4.5402, df = 2, p-value = 0.1033 R This p-value is WRONG! Must compare with chi-square with 1 df. Since 4.5402 > 3.841, we should reject H o. Conclude that the genotype frequencies do NOT follow the Hardy- Weinberg formula.

14 Suppose you want to do a genetic cross of snapdragons with an expected 1:2:1 ratio, and you want to be able to detect a pattern with 5% more heterozygotes than expected under Hardy-Weinberg. Power Analysis in Chi-Square GOF Tests ClassHypothesizedWant to detect (from data) aa25%  1 =0.2522.5% aA50%  2 =0.50 55% AA 25%  3 =0.25 22.5% The necessary sample size (n) to be able to detect this difference can be computed by the more comprehensive packages (SAS, SPSS, R). There is also a free package, G*Power 3 (correct as of Spring 2010): [http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/ Inputs: (0.25,0.50,0.25) for hypothesized; (0.225,0.55,0.225) for expected; and df=1. Should get n of approx 1,000.

15 Tests and Confidence Intervals for One and Two Proportions (§10.2, 10.3) First look at case of single population proportion (  ). A random sample of size n is taken, and the number of “successes” (y) is noted.

16 Since the sum of the proportions is equal to 1, we have: Since the sum of the cell frequencies equal the total sample size. If  is the probability of a “success” and y is the number of “successes” in n trials. Estimate of success probability is: Binomial Experiment = Multinomial Experiment with two classes

17 In general, for the probability of observing y or greater successes can be approximated by an appropriate normal distribution (see section 4.13). What about a confidence interval (CI) for  ? Using a similar argument as for y, we obtain the (1-  )100% CI: Use when  is unknown. Normal Approximation to the Binomial and CI for 

18 H 0 :  =  0 (  0 specified)H a :1.  >  0 2.  <  0 3.    0 Test Statistic: Rejection Region: 1.Reject if z > z  2.Reject if z < -z  3.Reject if | z | > z  Note: Under H 0 : Approximate Statistical Test for 

19 Suppose we wish to estimate  to within  E with confidence 100(1-  )%. What sample size should we use? Since  is unknown, do the following: 1.Substitute our best guess. 2.Use  = 0.5 (worst case estimate). Example: We have been contracted to perform a survey to determine what fraction of students eat lunch on campus. How many students should we interview if we wish to be 95% confident of being within  2% of the true proportion? Worst case: (  = 0.5) Best guess: (  = 0.2) Sample Size needed to meet a pre-specified confidence in 

20 Situation: Two sets of 60 ninth-graders were taught algebra I by different methods (self-paced versus formal lectures). At the end of the 4-month period, a comprehensive, standardized test was given to both groups with results: Experimental group:n=60,39 scored above 80%. Traditional group:n=60,28 scored above 80%. Is this sufficient evidence to conclude that the experimental group performed better than the traditional group? Each student is a Bernoulli trial with probability  1 of success (high test score) if they are in the experimental group, and  2 of success if they are in the traditional group. H 0 :  1 =  2 H a :  1 >  2 versus Comparing Two Binomial Proportions

21 PopulationExample 12 Population proportion  1  2 Sample sizen 1 n 2 6060 Number of successesy 1 y 2 3928 Sample proportion: 0.650.467 100(1-  )% confidence interval for  1 -  2. use Interpret … 0.183 ± 1.645(0.089)  (.036,.330) Ex: 90% CI is

22 H 0 :   -  2 =0 (or   =  2 =  H a :1.   -  2 > 0 2.   -  2 < 0 3.    2  Test Statistic: Rejection Region: 1.Reject if z > z  2.Reject if z < -z  3.Reject if | z | > z  Note: Under H 0 : Statistical Test for Comparing Two Binomial Proportions

23 PopulationExample 12 Population proportion  1  2 Sample sizen 1 n 2 6060 Number of successesy 1 y 2 3928 Sample proportion:0.650.467 Since 2.056 is greater than 1.645 we reject H 0 and conclude H a :  1 >  2. Test Statistic:

1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit.

Similar presentations

Presentation on theme: "1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit.

Similar presentations

Presentation on theme: "1 Categorical Data (Chapter 10) Inference about one population proportion (§10.2). Inference about two population proportions (§10.3). Chi-square goodness-of-fit."— Presentation transcript:

Similar presentations

About project

Feedback