Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions.

Similar presentations


Presentation on theme: "Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions."— Presentation transcript:

1 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions

2 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.2 Counts and Proportions  Coin flipping  What’s the chance that heads comes up 8+ times in 10 flips of a coin?  Binomial distribution  Batting average  Is Ichiro a.400+ hitter?  One-sample test of proportion  Which is the better offensive team, Mariners or Red Sox?  Two-sample test of two proportions  Biomedical Research…

3 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.3 Where we’ve been… and where we’re going Variables of Interest? One Variable Continuous (Methods from Before Midterm) Binary One-sample test of proportion Two-sample test t of proportion Exact Methods Normal Approximation Two Variables Both Continuous Interested in prediction Simple Linear Regression Interested in association Both variables normal Pearson Correlation Not normal Spearman Correlation One Continuous, one categorical ANOVA Both Binary (in two weeks) More than Two Variables Multiple Linear Regression

4 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.4 Binary Data  Often in medical and public health studies, our endpoint of interest is binary or dichotomous  Examples  disease vs. no disease  response vs. no response  death vs. no death  success vs. failure

5 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.5 Binary Data  Continuous endpoints often dichotomized into a binary endpoint as well  For example, in a study of the effect of a drug on LDL levels, for each subject, the LDL measurement at the end of the study (a continuous measure) may be dichotomized into “response” vs. “no response” based on a cut-point defining whether the LDL level has been reduced to acceptable, normal, or safe levels.

6 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.6 Binary Data 1.There are only two possible (mutually exclusive) outcomes, often called a “success” and a “failure”. 2.Each experiment is identical to all the others, and the probability of a “success” is p. Thus, the probability of a “failure” is 1-p.  Each experiment is called a Bernoulli trial. Such experiments include throwing the die and observing whether or not it comes up six, tossing a coin, or investigating the survival of a cancer patient, etc.

7 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.7 Binary Data: Smoking Status Example  Define:  X=1 for a smoker and  X=0 for a non-smoker.  If “success” is the event that a randomly selected individual is a smoker, and from previous research it is known that about 29% of the adults in the United States smoke, then: and  This is an example of a Bernoulli trial (we select just one individual at random, each selection is carried out independently, and each time the probability of that individual being a “success” is constant).

8 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.8 Binary Data: Smoking Status Example  If we selected two adults, and looked at their smoking status, then the possible outcomes are:  Neither is a smoker  Only one is a smoker  Both are smokers  If we define X as the number of smokers between these two individuals, then  X=0: Neither is a smoker  X=1: Only one is a smoker  X=2: Both are smokers

9 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.9 Binary Data: Smoking Status Example P(X=0) = (1-p) 2 = (0.71) 2 = 0.5041 P(X=1) = P( 1st individual is a smoker OR 2nd individual is a smoker ) =p(1-p)+(1-p)p = 2p(1-p) = 0.4118 P(X=2) = p 2 = (0.29) 2 = 0.0841  Notice that: P(X=0)+P(X=1)+P(X=2)=0.5041+0.4118+0.0841=1.000.

10 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.10 Binary Data: Smoking Status Example  Bar chart  A plot of the probability distribution of X, covering all of the possible numbers that X can attain (in the previous example those were n=0, 1, and 2).

11 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.11 Binomial Distribution  The binomial distribution is a special distribution that closely models the behavior of variables that corresponds to repeated Bernoulli experiments.  When describing the binomial distribution, we need to specify two parameters:  The probability of “success” (p)  The number of Bernoulli experiments (n)  One way of looking at p is as the proportion of time that an experiment is successful when repeated a large number of times.

12 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.12 Binomial Distribution  Given these parameters, the mean and standard deviation of a binomial distribution are:  If n is sufficiently large, then the statistic is approximately distributed as normal with mean 0 and standard deviation 1 (a std. normal distribution). A better approximation to the normal distribution is given by when X np. This is called a continuity correction. In general, we say n is “large” if np and n(1-p) are both ≥ 5

13 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.13 Binary Data: Smoking Status Example  For example, suppose that we want to find the proportion of samples of size n=30 in which at most six individuals smoke.  With p=0.29 and n=30, we have:  np=8.75 and n(1-p)=21.3>5  X=6<np=8.7  should apply the continuity correction…

14 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.14 Binary Data: Smoking Status Example  Normal approximation to binomial with continuity correction:  The exact binomial probability is 0.190, which is very close to the approximate value given above.

15 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.15 Sampling Distribution of Proportions  Our thinking in terms of estimation (including confidence intervals) and hypothesis testing does not change when dealing with proportions.  We may use the Central Limit Theorem for binary data also (as the CLT applies to all distributions)  But note that the CLT is an asymptotic result (as n  ∞ )  Thus, we must be careful when n is small

16 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.16 Sampling Distribution of Proportions  Since the proportion of a success in the general population will not be known, we must estimate it. In general, such an estimate is derived by calculating the proportion of successes in a sample of n experiments (trials) as follows: where x is the number of successes in the sample of size n.

17 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.17 Sampling Distribution of Proportions  The sampling distribution of a proportion has mean, μ, and standard deviation, σ: Mean: μ = p Standard deviation: σ = And the statistic: is distributed according to the standard normal distribution (again, the normal approximation is particularly good when np>5 and n(1-p)>5).  NOTE: Dividing our previous parameters by n (x  x/n)

18 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.18 Sampling Distribution of Proportions  Note that if we multiply the numerator and denominator of Z by n we have, which is the familiar form encountered in our study of the binomial distribution.

19 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.19 Hypothesis Testing  Similar to hypothesis testing with continuous data, one may perform hypothesis tests on binary data:  1-sample test of a proportion  H 0 : p=p 0  H A : p ≠ p 0  2-sample test comparing proportions  H 0 : p 1 =p 2  H A : p 1 ≠ p 2

20 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.20 Lung Cancer Example  Consider the five-year survival among patients under 40 who have been diagnosed with lung cancer. The mean proportion of individuals surviving is p=0.10 -- implying that the standard deviation of the 5-year survival is:.

21 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.21 Lung Cancer Example  If we select repeated samples of size n=50 patients diagnosed with lung cancer, what fraction of the samples will have 20% or more survivors? That is, “what percent of the time 50(0.20)=10 or more patients will be alive after 5 years”?  Since np = 50(0.1) = 5 and n(1-p) = 50(0.9) =45>5 the normal approximation should be adequate. Then,  Only 0.9% of the time will the proportion of lung cancer patients surviving past five years be 20% or more.

22 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.22 Lung Cancer Example  In the previous example, we did not know the true proportion of 5-year survivors among individuals under 40 years of age that have been diagnosed with lung cancer.  If it is known from previous studies that the five-year survival rate of lung cancer patients that are older than 40 years old is 8.2%, we might want to test whether the five-year survival among the younger lung-cancer patients is the same as that of the older ones.

23 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.23 Lung Cancer Example  From a sample of n=52 patients under the age of 40 that have been diagnosed with lung cancer, the proportion surviving after five years is.  Is this within sampling variability of the known 5-year survival of older patients? 0.082

24 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.24 Lung Cancer Example  The test of hypothesis is constructed as follows:

25 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.25  The test statistic is: Lung Cancer Example P(|Z|>0.87)=P(Z>0.87)+P(Z<-0.87) = 0.384 0.082

26 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.26  Since the p value associated with the two-sided test is 0.380 > α = 0.05, we do NOT reject the null hypothesis.  That is, there is not sufficient evidence to indicate that the five-year survival of lung cancer patients who are younger than 40 years of age is different than that of the older patients. Lung Cancer Example

27 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.27 Lung Cancer Example  STATA output for this normal approximation:  Again, we do not reject the null hypothesis. Note that any differences with the hand calculations are due to round-off error. (NOTE: See Lab #8 for appropriate STATA code.)

28 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.28 Lung Cancer Example  Carrying out an exact binomial test in STATA we have:  We see that the p value associated with the two-sided test is 0.318, which is close to that calculated previously.

29 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.29 Confidence Intervals  Similar to the testing of hypothesis involving proportions, we can construct confidence intervals for a population proportion (or for a difference between two proportions).  Again these intervals will be based on the statistic: where and are the estimates of the proportion and its associated standard deviation, respectively.

30 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.30 Confidence Intervals  One-sided Confidence Interval:  Two-sided Confidence Intervals:

31 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.31  In the previous example, if 6 out of 52 lung cancer patients under 40 years of age were alive after five years, and using the normal approximation (which is justified since np=52(0.115)=5.98>5, and n(1-p)= 52(1-0.115)=46.02>5), an approximate 95% confidence interval for the true proportion p is given by Lung Cancer Example

32 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.32  In other words, we are 95% confident that the true five-year survival of lung-cancer patients < 40 years of age is between 2.8% and 20.2%.  Note that this interval contains 8.2% (the five-year survival rate among lung cancer patients that are older than 40 years of age). Thus, it is equivalent to a hypothesis test that did not reject the null hypothesis of equal five-year survival between lung cancer patients that are older than 40 years old versus younger subjects. Lung Cancer Example

33 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.33 Lung Cancer Example  In the previous example, using exact binomial confidence intervals we have which is close to our calculations that used the normal approximation.

34 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.34 Two Proportions  Comparing two proportions is similar to a two- mean comparison…  First and second group proportions: and  Under assumptions of equality of the two population proportions, we may want to derive a pooled estimate of the sample proportion:

35 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.35 Two Proportions  Using this pooled estimate, we can derive a pooled estimate of the standard deviation of the unknown proportion (assumed equal between the two groups) as:  The hypothesis testing of comparisons between two proportions is based on the statistic:

36 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.36 Two Proportions: Hypothesis Test  Hypothesis test for difference in two proportions:

37 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.37 Car Accident Example  In a study investigating morbidity and mortality among pediatric victims of motor vehicles accidents, information regarding the effectiveness of seat belts was collected.  Two random samples were selected, one of size n 1 =123 from a population of children that were wearing seat belts at the time of the accident, and another of size n 2 =290 from a group of children that were not wearing seat belts at the time of the accident.

38 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.38 Car Accident Example  In the first case, x 1 =3 children died, while in the second x 2 =13 died.  Consequently, and.  The estimated difference between these proportions:

39 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.39 Car Accident Example  We wish to determine if the death rate is different in the two groups and carry out the test of hypothesis as proposed earlier:  Thus, there is not sufficient evidence to conclude that children not wearing seat belts are safer (die in different rates) than children wearing seat belts.

40 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.40 Two Proportions: Confidence Intervals  Confidence intervals of the difference of two proportions are also based on the statistic:  The standard deviation estimate is

41 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.41 Two Proportions: Confidence Intervals  Note discrepancy between hypothesis testing and CI’s:  We no longer need to assume that the two proportions are equal, so the estimate of the standard deviation in the denominator is not a pooled estimate, but simply the sum of the standard deviations in each group.  This deviation from hypothesis testing may lead to disagreements between decisions reached through usual hypothesis testing versus hypothesis testing performed using confidence intervals (infrequent issue).

42 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.42 Two Proportions: Confidence Intervals

43 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.43 Car Accident Example  A two-sided 95% confidence interval for the true difference in death rates among children wearing seat belts versus those that did not is given by:

44 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.44 Car Accident Example  STATA output:

45 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.45 Car Accident Example  That is, the true difference between the two groups will be between 5.7% in favor of those children wearing seat belts, to 1.6% in favor of those children not wearing seat belts. In this regard, since the zero (hypothesized under the null hypothesis) difference is included in the confidence interval we fail to reject the null hypothesis. There is no evidence to suggest a benefit of seat belts  Click it, or Ticket!!

46 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.46 Exact Confidence Intervals  “Exact” confidence intervals for a binomial parameter are possible  These do not rely on the normal approximation to the binomial (i.e., use of the CLT)  Computationally very intensive (particularly for large N)  May require special programming/software

47 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.47 Exact Confidence Intervals  General Rule:  Use exact confidence intervals whenever software is available and is feasible given the computing resources  If N is large…  It is OK to use normal approximation (as CLT kicks in)  If N is small…  The normal approximation may not be appropriate  Use exact CIs if possible

48 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.48 Special Case  What if we observe 0 events or responses?  How do we get a CI for the response rate when the variability is 0?  Example: ACTG A5129, “Absence of Sustained Hyperlactatemia Among HIV- Infected Patients with Risk Factors for Mitochondrial Toxicity”

49 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.49 Special Case  From abstract:

50 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.50 Special Case  From abstract:

51 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.51 Special Case  There were no episodes of symptomatic hyperlactatemia or lactic acidosis during the study.  A 95% confidence interval for the prevalence of hyperlactatemia given our data (i.e., 0 out of 83) is:  In other words, the true prevalence of hyperlactatemia as defined by this study is likely to be less than 3.54%. 95% CI = (0,1-α 1/n ) = (0, 0.0354)

52 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.52 Review: Coin Flipping  What’s the chance that heads comes up 8+ in 10 flips of a coin?  Since x=8>np=10(0.5)=5, we apply the normal approximation with continuity correction:  What’s the area in the upper tail of the standard normal, above 1.581?

53 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.53 Review: Coin Flipping  Thus, there’s a 5.7% chance that we could have 8, 9, or 10 heads on 10 flips of the coin.

54 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.54 Review: Baseball Statistics  As of April 16 th, Ichiro has had 9 hits out of 31 at bats:  How likely is it that he could still be a.400 hitter (just having a bad start)?  Ho: Ichiro bats 0.400 (p=0.400)  Ha: Ichiro bats <0.400 (p<0.400)  Set α=0.05

55 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.55 Review: Baseball Statistics  One-sample test of proportion (one-sided)  10.4% chance we would observe this data (Ichiro’s present batting average) given he is a.400 hitter.  Since P>0.05, we can’t say Ichiro is NOT a.400 hitter!  We don’t reject the null that Ichiro is a.400 hitter. 0.400

56 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.56 Review: Baseball Statistics  Another (better?) Japanese Mariner…  Johjima has 10 hits out of 21 at bats:  Carry out same hypothesis test – but now test if Johjima may be a.500 hitter…  Don’t reject… Johjima’s average falls nicely within a.500+ hitter distribution!! 0.500

57 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.57 Review: Baseball Statistics  Maybe a better test would be to see if he’s just been lucky…and is really just a moderately good hitter (.300):  Ho: p j =.300  Ha: p j >.300  We reject… Johjima’s average is significantly above.300!! 0.300

58 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.58 Review: Baseball Statistics  Comparing batting averages of Mariners and Red Sox (overall, as of April 16 th ):  Mariners:  Red Sox:  Combined:  And pooled estimate of S.D.:

59 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.59 Review: Baseball Statistics  Ho: and Ha:  Two-sample test of proportions:  Two-sided test this time…  No statistically significant difference between Mariners and Red Sox batting averages… 0

60 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.60 Where we’ve been… and where we’re going Variables of Interest? One Variable Continuous (Methods from Before Midterm) Binary One-sample test of proportion Two-sample test t of proportion Exact Methods Normal Approximation Two Variables Both Continuous Interested in prediction Simple Linear Regression Interested in association Both variables normal Pearson Correlation Not normal Spearman Correlation One Continuous, one categorical ANOVA Both Binary (in two weeks) More than Two Variables Multiple Linear Regression


Download ppt "Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions."

Similar presentations


Ads by Google