Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.

Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr

IES Research Conference2 The Guidelines  Just a couple of comments on Peter’s presentation: Sensible advice, masterfully presented – I urge you to adopt them On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold)  Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power?

IES Research Conference3 The Guidelines  Just a couple of comments on Peter’s presentation: Sensible advice, masterfully presented – I urge you to adopt them On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold)  Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? Disclaimer: I was a member of the working group that developed the guidelines Peter presented. My remarks today represent my own views, not those of the working group.

IES Research Conference4 Different adjustments deal with different issues  Many (e.g., Bonferroni, Holm, Tukey-Kramer) test for a nonzero Family-wise Error Rate (FWER) – i.e., for any nonzero effects. That’s not usually what concerns us  Typical situation: we have some set of estimates that are significant by conventional standards; we want to be assured that most of them reflect real effects – i.e., we’re concerned with the False Discovery Rate  Benjamini-Hochberg attempts to control the false discovery rate

IES Research Conference5 The False Discovery Rate (FDR)  FDR = proportion of significant estimates that are false positives (Type I errors)  Example: Suppose we have: 20 statistically significant estimates 8 true nonzero impacts 12 are false positives FDR =.6 (= 12/20)  Low FDR is good

IES Research Conference6 An Example  Suppose we estimate impacts on 4 outcomes for each of the following subgroups: Gender Ethnicity (4 groups) Region (4 groups) School size (2 groups) Central city/Suburban/Rural SES (3 groups) Number siblings (4 groups) Pretest score (3 groups)  100 estimates – not atypical for an education study

IES Research Conference7 Example (cont’d)  Suppose 10 estimates are significant at.05 level  That might reflect: 10 true nonzero impacts 9 true nonzero impacts and 1 false positive 8 true nonzero impacts and 2 false positives …  Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50%

IES Research Conference8 Example (cont’d)  Suppose 10 estimates are significant at.05 level  That might reflect: 10 true nonzero impacts 9 true nonzero impacts and 1 false positive 8 true nonzero impacts and 2 false positives …  Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% But you can never know what the actual mix is, and you cannot know which is which

IES Research Conference9 Expected FDR as function of proportion true nonzero impacts (assumes no MC adjustment, significance level =.05; power =.80)

IES Research Conference10 Implications  When only 5% of all true impacts are nonzero, FDR =.5 – i.e., half of the significant estimates are likely to be Type I errors (but you cannot know which ones they are!)  FDR is quite high until proportion of true impacts that are nonzero rises above 25%  Only when > 50% of true impacts are nonzero, is the FDR relatively low (<.06)

IES Research Conference11 Simulations  Real education data from the ECLS-K Demo  4 Outcomes: reading, math, attendance, peers  25 subgroups (see earlier list)  Imputed zero or nonzero (ES=.2) impacts for varying proportions of subgroups  Measured FDR with and w/o B-H correction  500 replications of 100 estimates

IES Research Conference12 Simulation results: FDR as function of true zero impact rate, unadjusted vs. B-H adjusted B ased on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES =.20, using data from the ECLS-K Demonstration

IES Research Conference13 Implications  B-H does indeed control the FDR in real-world education data (at least, in these RW education data)  Even at very low nonzero impact rates, FDR is well below 5%  This comes at a price, however…

IES Research Conference14 The effect of the B-H adjustment on Type II errors Adjusted Unadjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES =.20, using data from the ECLS-K Demonstration

IES Research Conference15 The cost of adjusting for multiple comparisons within a fixed sample  For a given sample, reducing the chance of Type I errors (false positives) increases the chance of Type II errors (missing true effects)  In this case, for very low nonzero impact rates, Type II error rate for a typical subgroup (probability of missing a true effect when there is one) went from.28 to.70 (i.e., power fell from.74 to.30!)  For high nonzero impact rates, the power loss is much smaller – when nonzero impact rate is 95%, adjustment increases Type II error rate only from.27 to.33 (i.e., power falls from.73 to.67)

IES Research Conference16 Does this mean we must sacrifice power to deal with multiple comparisons? Yes – if you have already designed your sample ignoring the MC problem. BUT…if you take the adjustment into account at the design stage, you can build the loss of power associated with MC adjustments into the sample size, to maintain power This means, of course, larger samples and more expensive studies (sorry about that, Phoebe)

IES Research Conference17 For a copy of this presentation… Send an e-mail to: Larry.Orr@comcast.net

Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.

Similar presentations

Presentation on theme: "Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.

Similar presentations

Presentation on theme: "Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr."— Presentation transcript:

Similar presentations

About project

Feedback