Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Inference: Hypothesis Testing

Similar presentations


Presentation on theme: "Statistical Inference: Hypothesis Testing"— Presentation transcript:

1 Statistical Inference: Hypothesis Testing
© Scott Evans, Ph.D. and Lynne Peeples, M.S.

2 The Big Picture Populations and Samples Population Parameters μ, σ, σ2
Sample / Statistics x, s, s2 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

3 Statistical Inference
Inferences about a population are made based on a sample Inferences about population parameters (e.g., population mean μ) are made by examining sample statistics (e.g., sample mean and sample standard deviation) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

4 Statistical Inference
2 primary approaches (that are closely related) Hypothesis Testing Estimation Point Estimation Sample mean is an estimate of the population mean Interval Estimation Confidence Intervals © Scott Evans, Ph.D. and Lynne Peeples, M.S.

5 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Scientific Method Hypothesis testing is rooted in the scientific method To prove something: Assume that the complement is true, and look for contradictory evidence Then look for evidence to disprove the complement © Scott Evans, Ph.D. and Lynne Peeples, M.S.

6 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypotheses A prediction about: A population parameter or The relationship between parameters from 2 or more populations Studies are either: Hypothesis generating (exploratory studies) Hypothesis confirming For which hypotheses are stated BEFORE beginning the study © Scott Evans, Ph.D. and Lynne Peeples, M.S.

7 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing Used to evaluate hypotheses Often performed using p-values © Scott Evans, Ph.D. and Lynne Peeples, M.S.

8 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypotheses Null hypotheses Denoted H0 Often “no effect” or “no difference” E.g., H0: μ1= μ2 Assume true but look for evidence to disprove Alternative hypotheses Denoted HA Often indicating the presence of an effect or a difference between groups Try to identify supporting evidence © Scott Evans, Ph.D. and Lynne Peeples, M.S.

9 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypotheses Well-formulated hypotheses are testable using measures that are quantifiable It is often challenging to identify specific hypotheses Common problem: hypotheses are too vague What is the research question of interest? Requires discussion and careful thought © Scott Evans, Ph.D. and Lynne Peeples, M.S.

10 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing Make a decision to reject (or fail to reject) H0 by comparing what is observed to what is expected if the H0 is true (i.e., p-values) Note that the hypotheses concern (unknown and unobservable) population parameters We make decisions about these parameters based on the (observable) sample statistics © Scott Evans, Ph.D. and Lynne Peeples, M.S.

11 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing Analogous to a court trial: People are assumed innocent until proven guilty H0: Innocent HA: Guilty If VERDICT = Guilty (i.e., reject H0), then: We can say that enough evidence was found to prove guilt (to some standard of proof) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

12 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing If VERDICT = Not Guilty (i.e., do not reject H0), then: We cannot say that we have proven innocence We say that we failed to find enough evidence to prove guilt There is a subtle but important difference between the two “Absence of evidence is not evidence of absence.” © Scott Evans, Ph.D. and Lynne Peeples, M.S.

13 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing Evidence is used to disprove H0 We can prove the alternative hypothesis to some standard of proof We cannot prove the null hypothesis (we can only fail to reject it) – we cannot prove what we have already assumed to be true Thus we do not “accept” H0 We simply fail to reject it © Scott Evans, Ph.D. and Lynne Peeples, M.S.

14 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing The gap between the population and the sample always requires a leap of faith The only statements that can be made with certainty are of the form: “our conclusions are supported by the data” “our data are consistent with the theory that …” © Scott Evans, Ph.D. and Lynne Peeples, M.S.

15 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Used to make decisions concerning hypotheses Probability (Conditional) Ranges from 0-1 Based upon the notion of repeated sampling © Scott Evans, Ph.D. and Lynne Peeples, M.S.

16 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Probability of observing a result (data) as extreme or more extreme than the one observed (due to chance alone) given that H0 is true P(data|H0) It is NOT P(Hypothesis|data) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

17 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Thus if a p-value is small, then the probability of seeing data like that observed is very small if H0 is true In other words, we are either: Seeing a very rare event (due to chance), or H0 is not true © Scott Evans, Ph.D. and Lynne Peeples, M.S.

18 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values P-value is small  reject H0 P-value is not small  fail to reject H0 What is “small”? Arbitrary (but strategic) choice by investigator Often 0.05 (or 0.10 or 0.01) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

19 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Although commonly reported, p-values have limitations Draws an arbitrary line in the sand E.g., P=0.051 (fail to reject) vs. p=0.049 (reject) Rare events do occur by chance © Scott Evans, Ph.D. and Lynne Peeples, M.S.

20 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Do not reflect the strength of a relationship Do not provide information concerning the magnitude of the effect Important to check clinical relevance regardless of the p-value outcome © Scott Evans, Ph.D. and Lynne Peeples, M.S.

21 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values A function of sample sizes Be careful when N is very large (everything is significant) P-values are small Important to check for clinical relevance Be careful when N is very small (everything is non-significant) Difficult to obtain small p-values unless effect sizes are very large Random variable that varies from sample to sample Not appropriate to compare p-values from difference experiments © Scott Evans, Ph.D. and Lynne Peeples, M.S.

22 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
P-values Are NOT binary statistics. They are continuous. Remember the definition. Are only one tool for assessing the evidence Example: FDA Advisory Committee Meeting NeuroStar System for Major Depressive Disorder P=0.057 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

23 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Truth Ho True Ho False Test Result Reject Ho Type I error (α) correct Do not reject Ho Type II error (β) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

24 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
α Error α (Type I error) P(rejecting H0|H0 is true) Probability of rejecting when we should not reject Significance level This is under our control Conventionally, we often choose α=0.05 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

25 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
β (Type II Error) Power = 1-β P(rejecting H0|HA is true) Probability of rejecting when we should reject Control with sample size Larger sample size  higher power © Scott Evans, Ph.D. and Lynne Peeples, M.S.

26 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Determining α and β α is chosen by the investigator (or FDA) β is chosen by the investigator in prospectively designed studies (e.g., clinical trials) where sample size is determined by the investigator. If the investigator does not control the sample size (e.g., using historical data), then β is not “controlled-for” and is simply a function of the sample size © Scott Evans, Ph.D. and Lynne Peeples, M.S.

27 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Determining α and β Weigh the costs of making α versus β error α and β are inversely related Similar to sensitivity/specificity trade-off Examples of similar trade-offs Car alarms Biosurveillance © Scott Evans, Ph.D. and Lynne Peeples, M.S.

28 Statistical vs. Practical Significance
Statistical significance (e.g., p<0.05) does not imply practical relevance Results should be both: (1) statistically and (2) practically significant in order to influence policy Example: A drug may induce a statistically significant reduction in blood pressure. However, if this reduction is 1 mmHg in your systolic BP, then it is not a useful (practical and clinically relevant) drug. © Scott Evans, Ph.D. and Lynne Peeples, M.S.

29 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Testing Test only relevant hypotheses Multiple testing  multiple opportunities for error Example: Food supplement for sexual dysfunction Thus test only relevant hypotheses © Scott Evans, Ph.D. and Lynne Peeples, M.S.

30 Example - Interpretation
NCEP recommends LDL < 100 mg/dL (for patients with two or more CHD risk factors A new drug is being evaluated to investigate if it lowers LDL levels in such patients A group of 25 patients with high LDL levels and >2 CHD risk factors are recruited for a study of the new drug © Scott Evans, Ph.D. and Lynne Peeples, M.S.

31 Example - Interpretation
The LDL level of each patient is measured before the drug is given and again after 12 weeks of treatment with the drug The change (i.e., post-pre, Δ) is calculated for each patient © Scott Evans, Ph.D. and Lynne Peeples, M.S.

32 Example - Interpretation
The study investigators made the following statement: “data from this study are consistent with the hypothesis that this drug lowers LDL in patients with elevated LDL and >2 CHD risk factors (mean(Δ)=-30, p=0.008)” What does this mean? © Scott Evans, Ph.D. and Lynne Peeples, M.S.

33 Example - Interpretation
Wherever there is a p-value, there is a hypothesis test Determine H0 and HA In this case: H0: mean(Δ)=0; i.e., the drug has no effect HA: mean(Δ)≠0 Although these hypotheses are not stated explicitly, they are implied by the language This can be tricky when reading journal articles © Scott Evans, Ph.D. and Lynne Peeples, M.S.

34 Example - Interpretation
There are two possible reasons for observing mean(Δ)=-30 Chance (sampling variation) The drug truly decreases LDL P-values help distinguish between the two Indicating the probability that chance is producing the observed data © Scott Evans, Ph.D. and Lynne Peeples, M.S.

35 Interpreting the p-value
Assume that H0 is true (i.e., the drug has no effect) Hypothetically repeat the study a very large number of times and we observe the changes © Scott Evans, Ph.D. and Lynne Peeples, M.S.

36 Interpreting the p-value
Trial 1, Mean(Δ)=-25 Trial 2, Mean(Δ)=10 Trial 3, Mean(Δ)=0 Trial 4, Mean(Δ)=-5 Trial 5, Mean(Δ)=18 . © Scott Evans, Ph.D. and Lynne Peeples, M.S.

37 Interpreting the p-value
Note some differences are small, some are large, some are near 0 This is called sampling variability Differences between the trials are simply due the the random samples chosen for each study E.g., observed differences between studies are simply a function of who enrolled into the study Remember, we assume H0 is true … the drug has no effect © Scott Evans, Ph.D. and Lynne Peeples, M.S.

38 Remember the Big Picture
Populations and Samples Population Parameters μ, σ, σ2 Sample / Statistics x, s, s2 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

39 Interpreting the p-value
P=0.008 says that if the drug has no effect (i.e., H0 is true), then the probability of observing a result as extreme or more extreme that the one we observed in this study is 0.008 0.8% of the hypothetical trials on our list would have differences that large © Scott Evans, Ph.D. and Lynne Peeples, M.S.

40 Interpreting the p-value
We often reject at 0.05 (our standard of proof) If p<0.05 then the probability of observing a result this extreme if the drug had no effect, is too small (a very rare event) Thus we reject the null hypothesis that the drug has no effect and conclude that evidence suggests a drug effect © Scott Evans, Ph.D. and Lynne Peeples, M.S.

41 Interpreting the p-value
Note that when the drug truly has no effect, we will incorrectly conclude that it does 5% of the time © Scott Evans, Ph.D. and Lynne Peeples, M.S.

42 Hypothesis Testing – Step by Step
State H0 (i.e., what you are trying to disprove) State HA Determine α (at your discretion) Determine the test statistic and associated p-value Determine whether to reject H0 or fail to reject H0 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

43 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Hypothesis Tests There are thousands of statistical hypothesis tests Examples T-tests Chi-square tests Fisher’s exact test Log-rank test Mann-Whitney test Score test Wald test Wei-Johnson test Evans-Li test Wilcoxon test Hosmer-Lemeshow test …. Impossible for even experienced statistician to know a fraction of them all (and more are developed over time) But all use this hypothesis testing framework as a foundation © Scott Evans, Ph.D. and Lynne Peeples, M.S.

44 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

45 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

46 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

47 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

48 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

49 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

50 © Scott Evans, Ph.D. and Lynne Peeples, M.S.

51 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
When p-values are low… …thus indicating an effect… Then the natural question is “what is the effect” Answer this question with confidence intervals (next class) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

52 When p-values are high…
This does not imply “no effect” It only implies that you failed to exclude the possibility of “no effect” Construct confidence intervals to identify the effect sizes that can be “ruled out” (i.e., are inconsistent with your data) © Scott Evans, Ph.D. and Lynne Peeples, M.S.

53 © Scott Evans, Ph.D. and Lynne Peeples, M.S.
Thus… ALWAYS PRESENT CONFIDENCE INTERVALS WITH P-VALUES To be continued… © Scott Evans, Ph.D. and Lynne Peeples, M.S.


Download ppt "Statistical Inference: Hypothesis Testing"

Similar presentations


Ads by Google