Presentation on theme: "Understanding p-values Annie Herbert Medical Statistician Research and Development Support Unit 0161 2064567."— Presentation transcript:
Understanding p-values Annie Herbert Medical Statistician Research and Development Support Unit email@example.com 0161 2064567
Outline Population & Sample What is a p-value? P-values vs. Confidence Intervals One-sided and two-sided tests Multiplicity Common types of test Computer outputs
Timetable TimeTask 60 minsPresentation 20 minsCoffee Break 90 mins Practical Tasks in IT Room
‘Population’ and ‘Sample’ Studying population of interest Usually would like to know typical value and spread of outcome measure in population Data from entire population usually impossible or inefficient/expensive so take a sample (even census data can have missing values) Want sample to be ‘representative’ of population Randomise
Randomised Controlled Trial (RCT) POPULATIONSAMPLE RANDOMISATION GROUP 1 GROUP 2 OUTCOME
5 Key Questions What is the target population? What is the sample, and is it representative of the target population? What is the main research question? What is the main outcome? What is the main explanatory factor?
Example – Dolphin Study Population: people suffering mild to moderate depression Sample: outpatients diagnosed with suffering from mild to moderate depression - recruited through internet, radio, newspapers and hospitals Question: does animal-facilitated therapy help treatment of depression? Outcome: Hamilton depression score at baseline and end of treatment Explanatory Factors: whether patients participated in dolphin programme (treatment) or outdoor nature programme (control)
Dolphin Study - Making Comparisons Hamilton Depression Score Treatment Group N=15 Control Group N=15 Baseline Mean (SD) 14.5 (2.6)14.5 (2.2) 2 Weeks Mean (SD) 7.3 (2.5)10.9 (3.4) Reduction Mean (SD) 7.3 (3.5)3.6 (3.4) BMJ - Antonioli & Reveley, 2005;331:1231 (26 November)
Dolphin Study - does the treatment make a difference? For both groups the Hamilton depression score decreased between baseline and 2 weeks Clearly for our sample the treatment group has a better mean reduction by: 7.3 - 3.6 = 3.7 points What does this tell us about the target population?
What is a p-value? Assume that there is really no difference in the target population (this is the null hypothesis) p-value: how likely is it that we would see at least as much difference as we did in our sample? Dolphin study example: if treatments are equally effective, how likely is it that we would see a difference in mean reduction between the treatment and control groups of at least 3.7 points? P=0.007
Assessing the p-value Large p-value: –Quite likely to see these results by chance –Cannot be sure of a difference in the target population Small p-value: –Unlikely to see these results by chance –There may be a difference in the target population
What is a small/large p-value? Cut-off point (‘significance level’) is arbitrary Significance level set to 5% (0.05) by convention Regard the p-value as the ‘weight of evidence’ P < 5%: strong evidence of a difference P ≥ 5%: no evidence of a difference (does not mean evidence of no difference)
Types of Statistical Error Type I Error = Probability of rejecting the null hypothesis when it is in fact true. Type II Error = Probability of not rejecting the null hypothesis when it is false.
Confidence Intervals Confidence interval = “range of values that we can be confident will contain the true value of the population” The “give or take a bit” for best estimate Dolphin study example: what is the range of values that we can be confident contains the true difference of mean reduction between treatment and control group? (95% CI: 1.1 to 6.2)
p-values vs. Confidence Intervals p-value: -Weight of evidence to reject null hypothesis -No clinical interpretation Confidence Interval: -Can be used to reject null hypothesis -Clinical interpretation -Effect size -Direction of effect -Precision of population estimate
Statistical Significance vs. Clinical Importance p-value < 0.05, CI doesn’t contain 0: indicates a statistically significant difference. What is the size of this difference, and is it enough to change current practice? E.g. Dolphin study: - P=0.007 - 95% CI = (1.1, 6.2) Expense? Side-effects? Ease of use? Consider clinically important difference when making sample size calculations/interpreting results
One-sided & Two-sided Tests One-sided test: only possible that difference in one particular direction. Two-sided test: interested in difference between groups, whether worse or better. Dolphin study example: is the treatment reduction mean less or greater than the control reduction mean? In real life, almost always two-sided.
Multiplicity Number of testsChance of at least one significant value 10.05 20.10 30.14 50.23 100.40 200.64 E.g. Significance level = 0.05 1/20 tests will be ‘significant’, even when no difference in target population
Reducing Multiplicity Problems Pick one outcome to be primary Specify tests in advance Focus on research question and keep number of tests to a minimum Do not necessarily believe a single significant result (repeat experiment, use meta-analysis)
Types of Outcome Data Categorical Example: Yes/No Graphs: Bar/Pie Chart Summary: Frequency/Proportion Test: Chi-squared Numerical/Continuous Example: Weight Graphs: Histogram/Boxplot Summary: Mean (SD) Median (IQR) Test (two groups): t-test or Mann-Whitney U
Notable Exceptions Comparing more than two groups Continuous explanatory factors Paired Data: -Paired t-test -Wilcoxon -McNemar Time-to-event Data: Log-rank test (For all of the above, seek statistical advice)
Final Pointers Plan analyses in advance –Seek statistical advice Start with graphs and summary statistics Keep number of tests to a minimum Include confidence intervals ‘Absence of evidence is not evidence of absence’