Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Inference I: Hypothesis testing; sample size

Similar presentations


Presentation on theme: "Statistical Inference I: Hypothesis testing; sample size"— Presentation transcript:

1 Statistical Inference I: Hypothesis testing; sample size

2 Statistics Primer Statistical Inference Hypothesis testing P-values
Type I error Type II error Statistical power Sample size calculations

3 What is a statistic? A statistic is any value that can be calculated from the sample data. Sample statistics are calculated to give us an idea about the larger population.

4 Examples of statistics:
mean The average cost of a gallon of gas in the US is $2.65. difference in means The difference in the average gas price in Los Angeles ($2.96) compared with Des Moines, Iowa ($2.32) is 64 cents. proportion 67% of high school students in the U.S. exercise regularly difference in proportions The difference in the proportion of Democrats who approve of Obama (83%) versus Republicans who do (14%) is 69%

5 What is a statistic? Sample statistics are estimates of population parameters.

6 Sample statistics estimate population parameters:
Sample statistic: mean IQ of 5 subjects Truth (not observable) Mean IQ of some population of 100,000 people =100 Sample (observation) Make guesses about the whole population

7 What is sampling variation?
Statistics vary from sample to sample due to random chance. Example: A population of 100,000 people has an average IQ of 100 (If you actually could measure them all!) If you sample 5 random people from this population, what will you get?

8 Sampling Variation Truth (not observable) Mean IQ=100

9 Sampling Variation and Sample Size
Do you expect more or less sampling variability in samples of 10 people? Of 50 people? Of 1000 people? Of 100,000 people?

10 Sampling Distributions
Most experiments are one-shot deals. So, how do we know if an observed effect from a single experiment is real or is just an artifact of sampling variability (chance variation)? **Requires a priori knowledge about how sampling variability works… Question: Why have I made you learn about probability distributions and about how to calculate and manipulate expected value and variance? Answer: Because they form the basis of describing the distribution of a sample statistic.

11 Standard error Standard Error is a measure of sampling variability.
Standard error is the standard deviation of a sample statistic. It’s a theoretical quantity! What would the distribution of my statistic be if I could repeat my experiment many times (with fixed sample size)? How much chance variation is there? Standard error decreases with increasing sample size and increases with increasing variability of the outcome (e.g., IQ). Standard errors can be predicted by computer simulation or mathematical theory (formulas). The formula for standard error is different for every type of statistic (e.g., mean, difference in means, odds ratio).

12 What is statistical inference?
The field of statistics provides guidance on how to make conclusions in the face of chance variation (sampling variability).

13 Example 1: Difference in proportions
Research Question: Are antidepressants a risk factor for suicide attempts in children and adolescents? Example modified from: “Antidepressant Drug Therapy and Suicide in Severely Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:

14 Example 1 Design: Case-control study
Methods: Researchers used Medicaid records to compare prescription histories between 263 children and teenagers (6-18 years) who had attempted suicide and 1241 controls who had never attempted suicide (all subjects suffered from depression). Statistical question: Is a history of use of antidepressants more common among cases than controls?

15 Example 1 Statistical question: Is a history of use of particular antidepressants more common among heart disease cases than controls? What will we actually compare? Proportion of cases who used antidepressants in the past vs. proportion of controls who did

16 Results 46% No (%) of cases (n=263) No (%) of controls (n=1241)
Any antidepressant drug ever 120 (46%)  448 (36%) 46% 36% Difference=10%

17 What does a 10% difference mean?
Before we perform any formal statistical analysis on these data, we already have a lot of information. Look at the basic numbers first; THEN consider statistical significance as a secondary guide.

18 Is the association statistically significant?
This 10% difference could reflect a true association or it could be a fluke in this particular sample. The question: is 10% bigger or smaller than the expected sampling variability?

19 What is hypothesis testing?
Statisticians try to answer this question with a formal hypothesis test

20 Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no association between antidepressant use and suicide attempts in the target population (= the difference is 0%)

21 Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—math theory (formula): The standard error of the difference in two proportions is: Thus, we expect to see differences between the group as big as about 6.6% (2 standard errors) just by chance…

22 Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—computer simulation: In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability. I used computer simulation to take 1000 samples of 263 cases and 1241 controls assuming the null hypothesis is true (e.g., no difference in antidepressant use between the groups).

23 Computer Simulation Results

24 What is standard error? Standard error: measure of variability of sample statistics Standard error is about 3.3%

25 Hypothesis Testing Step 3: Do an experiment We observed a difference of 10% between cases and controls.

26 Hypothesis Testing Step 4: Calculate a p-value P-value=the probability of your data or something more extreme under the null hypothesis.

27 Hypothesis Testing Step 4: Calculate a p-value—mathematical theory:
Observed difference between the groups. Difference in proportions follows a normal distribution. A Z-value of 3.0 corresponds to a p-value of .003. Standard error.

28 The p-value from computer simulation…
When we ran this study 1000 times, we got 1 result as big or bigger than 10%. We also got 2 results as small or smaller than –10%.

29 P-value P-value=the probability of your data or something more extreme under the null hypothesis. From our simulation, we estimate the p-value to be: 3/1000 or .003

30 Hypothesis Testing Here we reject the null.
Step 5: Reject or do not reject the null hypothesis. Here we reject the null. Alternative hypothesis: There is an association between antidepressant use and suicide in the target population.

31 What does a 10% difference mean?
Is it “statistically significant”? YES Is it clinically significant? Is this a causal association?

32 What does a 10% difference mean?
Is it “statistically significant”? YES Is it clinically significant? MAYBE Is this a causal association? MAYBE Statistical significance does not necessarily imply clinical significance. Statistical significance does not necessarily imply a cause-and-effect relationship.

33 What would a lack of statistical significance mean?
If this study had sampled only 50 cases and 50 controls, the sampling variability would have been much higher—as shown in this computer simulation…

34 263 cases and 1241 controls. 50 cases and 50 controls.
Standard error is about 3.3% 263 cases and 1241 controls. Standard error is about 10% 50 cases and 50 controls.

35 With only 50 cases and 50 controls…
If we ran this study 1000 times, we would expect to get values of 10% or higher 170 times (or 17% of the time). Standard error is about 10%

36 Two-tailed p-value Two-tailed p-value = 17%x2=34%

37 What does a 10% difference mean (50 cases/50 controls)?
Is it “statistically significant”? NO Is it clinically significant? MAYBE Is this a causal association? MAYBE No evidence of an effect  Evidence of no effect.

38 Example 2: Difference in means
Example: Rosental, R. and Jacobson, L. (1966) Teachers’ expectancies: Determinates of pupils’ I.Q. gains. Psychological Reports, 19,

39 The Experiment (note: exact numbers have been altered)
Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90). Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as “academic bloomers” (n=18). BUT: the children on the teachers lists had actually been randomly assigned to the list. At the end of the year, the same I.Q. test was re-administered.

40 Example 2 Statistical question: Do students in the treatment group have more improvement in IQ than students in the control group? What will we actually compare? One-year change in IQ score in the treatment group vs. one-year change in IQ score in the control group.

41 “Academic bloomers” (n=18)
The standard deviation of change scores was 2.0 in both groups. This affects statistical significance… Results: “Academic bloomers” (n=18) Controls (n=72) Change in IQ score: 12.2 (2.0)  8.2 (2.0) 12.2 points 8.2 points Difference=4 points

42 What does a 4-point difference mean?
Before we perform any formal statistical analysis on these data, we already have a lot of information. Look at the basic numbers first; THEN consider statistical significance as a secondary guide.

43 Is the association statistically significant?
This 4-point difference could reflect a true effect or it could be a fluke. The question: is a 4-point difference bigger or smaller than the expected sampling variability?

44 Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no difference between “academic bloomers” and normal students (= the difference is 0%)

45 Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—math theory: The standard error of the difference in two means is: We expect to see differences between the group as big as about 1.0 (2 standard errors) just by chance…

46 Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—computer simulation: In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability. I used computer simulation to take 1000 samples of 18 treated and 72 controls, assuming the null hypothesis (that the treatment doesn’t affect IQ).

47 Computer Simulation Results

48 What is the standard error?
Standard error is about 0.52 Standard error: measure of variability of sample statistics

49 Hypothesis Testing Step 3: Do an experiment We observed a difference of 4 between treated and controls.

50 Hypothesis Testing Step 4: Calculate a p-value P-value=the probability of your data or something more extreme under the null hypothesis.

51 Hypothesis Testing Step 4: Calculate a p-value—mathematical theory:
Difference in means follows a T distribution (which is very similar to a normal except with very small samples). Observed difference between the groups. A T88-value of 8.0 corresponds to a p-value of <.0001 A t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) p-value <.0001 Standard error.

52 Getting the P-value from computer simulation…
If we ran this study 1000 times, we wouldn’t expect to get 1 result as big as a difference of 4 (under the null hypothesis).

53 P-value P-value=the probability of your data or something more extreme under the null hypothesis. Here, p-value<.0001

54 Hypothesis Testing Here we reject the null.
Step 5: Reject or do not reject the null hypothesis. Here we reject the null. Alternative hypothesis: There is an association between being labeled as gifted and subsequent academic achievement.

55 What does a 4-point difference mean?
Is it “statistically significant”? YES Is it clinically significant? Is this a causal association?

56 What does a 4-point difference mean?
Is it “statistically significant”? YES Is it clinically significant? MAYBE Is this a causal association? MAYBE Statistical significance does not necessarily imply clinical significance. Statistical significance does not necessarily imply a cause-and-effect relationship.

57 What if our standard deviation had been higher?
The standard deviation for change scores in both treatment and control was 2.0. What if change scores had been much more variable—say a standard deviation of 10.0?

58 Std. dev in change scores = 2.0
Standard error is 0.52 Std. dev in change scores = 2.0 Std. dev in change scores = 10.0 Standard error is 2.58

59 With a std. dev. of 10.0… Standard error is 2.58
If we ran this study 1000 times, we would expect to get +4.0 or –4.0 12% of the time. P-value=.12

60 What would a 4.0 difference mean (std. dev=10)?
Is it “statistically significant”? NO Is it clinically significant? MAYBE Is this a causal association? MAYBE No evidence of an effect  Evidence of no effect.

61 Hypothesis testing summary
Null hypothesis: the hypothesis of no effect (usually the opposite of what you hope to prove). The straw man you are trying to shoot down. Example: antidepressants have no effect on suicide risk P-value: the probability of your observed data if the null hypothesis is true. Example: The probability that the study would have found 10% higher suicide attempts in the antidepressant group (compared with control) if antidepressants had no effect (i.e., just by chance). If the p-value is low enough (i.e., if our data are very unlikely given the null hypothesis), this is evidence that the null hypothesis is wrong. If p-value is low enough (typically <.05), we reject the null hypothesis and conclude that antidepressants do have an effect.

62 Summary: The Underlying Logic of hypothesis tests…
Follows this logic: Assume A. If A, then B. Not B. Therefore, Not A. But throw in a bit of uncertainty…If A, then probably B…

63 Error and power Type I error rate (or significance level): the probability of finding an effect that isn’t real (false positive). If we require p-value<.05 for statistical significance, this means that 1/20 times we will find a positive result just by chance. Type II error rate: the probability of missing an effect (false negative). Statistical power: the probability of finding an effect if it is there (the probability of not making a type II error). When we design studies, we typically aim for a power of 80% (allowing a false negative rate, or type II error rate, of 20%).

64 Type I and Type II Error in a box
Your Statistical Decision True state of null hypothesis H0 True (e.g., antidepressants do not increase suicide risk) H0 False (e.g., antidepressants do increase suicide risk) Reject H0 (e.g., conclude antidepressants increase suicide risk) Type I error (α) Correct Do not reject H0 (e.g., conclude there is insufficient evidence that antidepresssants increase suicide risk) Type II Error (β)

65 Reminds me of… Pascal’s Wager
Your Decision The TRUTH God Exists God Doesn’t Exist Reject God BIG MISTAKE Correct Accept God Correct— Big Pay Off MINOR MISTAKE

66 Type I and Type II Error in a box
Your Statistical Decision True state of null hypothesis H0 True H0 False Reject H0 Type I error (α) Correct Do not reject H0 Type II Error (β)

67 Review Question 1 If we have a p-value of 0.03 and so decide that our effect is statistically significant, what is the probability that we’re wrong (i.e., that the hypothesis test gave us a false positive)? .03 .06 Cannot tell 1.96 95%

68 Review Question 1 If we have a p-value of 0.03 and so decide that our effect is statistically significant, what is the probability that we’re wrong (i.e., that the hypothesis test gave us a false positive)? .03 .06 Cannot tell 1.96 95%

69 Review Question 2 Standard error is:
For a given variable, its standard deviation divided by the square root of n. A measure of the variability of a sample statistic. The inverse of sample size. A measure of the variability of a characteristic. All of the above.

70 Review Question 2 Standard error is:
For a given variable, its standard deviation divided by the square root of n. A measure of the variability of a sample statistic. The inverse of sample size. A measure of the variability of a characteristic. All of the above.

71 Review Question 3 A randomized trial of two treatments for depression failed to show a statistically significant difference in improvement from depressive symptoms (p-value =.50). It follows that: The treatments are equally effective. Neither treatment is effective. The study lacked sufficient power to detect a difference. The null hypothesis should be rejected. There is not enough evidence to reject the null hypothesis.

72 Review Question 3 A randomized trial of two treatments for depression failed to show a statistically significant difference in improvement from depressive symptoms (p-value =.50). It follows that: The treatments are equally effective. Neither treatment is effective. The study lacked sufficient power to detect a difference. The null hypothesis should be rejected. There is not enough evidence to reject the null hypothesis.

73 Review Question 4 Following the introduction of a new treatment regime in a rehab facility, alcoholism “cure” rates increased. The proportion of successful outcomes in the two years following the change was significantly higher than in the preceding two years (p-value: <.005).  It follows that: The improvement in treatment outcome is clinically important. The new regime cannot be worse than the old treatment. Assuming that there are no biases in the study method, the new treatment should be recommended in preference to the old. All of the above. None of the above.

74 Review Question 4 Following the introduction of a new treatment regime in a rehab facility, alcoholism “cure” rates increased. The proportion of successful outcomes in the two years following the change was significantly higher than in the preceding two years (p-value: <.005).  It follows that: The improvement in treatment outcome is clinically important. The new regime cannot be worse than the old treatment. Assuming that there are no biases in the study method, the new treatment should be recommended in preference to the old. All of the above. None of the above.

75 Statistical Power Statistical power is the probability of finding an effect if it’s real. What things are going to help statistical power?

76 Can we quantify how much power we have for given sample sizes?

77 study 1: 263 cases, 1241 controls Rejection region. Any value >= 6.5 (0+3.3*1.96) Null Distribution: difference=0. For 5% significance level, one-tail area=2.5% (Z/2 = 1.96) Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow) Clinically relevant alternative: difference=10%.

78 study 1: 263 cases, 1241 controls Power here = >80%
Rejection region. Any value >= 6.5 (0+3.3*1.96) Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow) Power here = >80%

79 study 1: 50 cases, 50 controls Critical value= 0+10*1.96=20 2.5% area
Z/2=1.96 Power closer to 20% now.

80 Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value= *1.96 = 1 Clinically relevant alternative: difference=4 points Power is nearly 100%!

81 Study 2: 18 treated, 72 controls, STD DEV=10
Critical value= *1.96 = 5 Power is about 40%

82 Study 2: 18 treated, 72 controls, effect size=1.0
Critical value= *1.96 = 1 Power is about 50% Clinically relevant alternative: difference=1 point

83 Factors Affecting Power
1. Size of the effect 2. Standard deviation of the characteristic 3. Bigger sample size 4. Significance level desired It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic.

84 1. Bigger difference from the null mean
average weight from samples of 100 Null Clinically relevant alternative

85 2. Bigger standard deviation
average weight from samples of 100

86 3. Bigger Sample Size average weight from samples of 100

87 4. Higher significance level
Rejection region. average weight from samples of 100

88 Sample size calculations
Based on these elements, you can write a formal mathematical equation that relates power, sample size, effect size, standard deviation, and significance level…

89 Simple formula for difference in proportions
Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) A measure of variability (similar to standard deviation) Represents the desired level of statistical significance (typically 1.96). Effect Size (the difference in proportions)

90 Simple formula for difference in means
Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) Represents the desired level of statistical significance (typically 1.96). Standard deviation of the outcome variable Effect Size (the difference in means)

91 Sample size calculators on the web…

92 These sample size calculations are idealized
They do not account for losses-to-follow up (prospective studies) They do not account for non-compliance (for intervention trial or RCT) They assume that individuals are independent observations (not true in clustered designs) Consult a statistician!

93 Review Question 5 Which of the following elements does not increase statistical power? Increased sample size Measuring the outcome variable more precisely A significance level of .01 rather than .05 A larger effect size.

94 Review Question 5 Which of the following elements does not increase statistical power? Increased sample size Measuring the outcome variable more precisely A significance level of .01 rather than .05 A larger effect size.

95 Review Question 6 Most sample size calculators ask you to input a value for . What are they asking for? The standard error The standard deviation The standard error of the difference The coefficient of deviation The variance

96 Review Question 6 Most sample size calculators ask you to input a value for . What are they asking for? The standard error The standard deviation The standard error of the difference The coefficient of deviation The variance

97 Review Question 7 For your RCT, you want 80% power to detect a reduction of 10 points or more in the treatment group relative to placebo. What is 10 in your sample size formula? a. Standard deviation b. mean change c. Effect size d. Standard error e. Significance level

98 Review Question 7 For your RCT, you want 80% power to detect a reduction of 10 points or more in the treatment group relative to placebo. What is 10 in your sample size formula? a. Standard deviation b. mean change c. Effect size d. Standard error e. Significance level

99 Homework Problem Set 3 Reading: continue reading textbook
Reading: p-value article Journal article/article review sheet


Download ppt "Statistical Inference I: Hypothesis testing; sample size"

Similar presentations


Ads by Google