Presentation on theme: "Thomas D. Cook Northwestern University"— Presentation transcript:
1Thomas D. Cook Northwestern University Comparing Results from RCTs and Quasi-Experiments that share the same Intervention GroupThomas D. Cook Northwestern University
2Why RCTs are to be preferred Statistical theory re expectationsRelative advantage over other bias-free methods--e.g., regression-discontinuity (RDD) and instrumental variables (IV)Ad hoc theory and research on implementationPrivileged credibility in science and policyClaim that non-exp. alternatives routinely fail to produce similar causal estimates
3Dissimilar EstimatesCome from empirical studies comparing exp. and non-exp. results on same topicStrongest are within-study comparisonsThese take an experiment, throw out the control group, and substitute a non-equivalent comparison groupGiven the intervention group is a constant, this is a test of the different control groups
4Within-Study Comparison Lit. 20 studies, mostly in job training. Of the 14 in job training reviews contend:(1) no study produces a clearly similar causal estimate, including Deheija & Wahba(2) Some design and analysis features associated with less bias, but still bias(3) the average of the experiments is not different from the average of the non-experiments--but be careful here and note the variance of the effect sizes differs by design type
5Brief History of Literature on Within Study Comparisons LaLonde; Fraker & Maynard12 subsequent studies in job trainingExtension to examples in education in USA and social welfare in Mexico, never yet reviewed
6Policy Consequences Department of Labor, as early as 1985 Health and Human Services, job training and beyondNational Academy of SciencesInstitute of Educational SciencesDo within-study comparisons deserve all this?
7We will:Deconstruct „non-experiment“ and compare experimental estimates to1. Regression-discontinuity estimates2. Estimates from difference-of-differences (fixed effects) designAsk: Is general conclusion about the inadequacy of non-experiments true across at least these different kinds of non-experiment
8Criteria of Good Within-Study Comparison Design 1. Variation in mode of assignment--random or not2. No third variables correlated with both assignment and outcome--e.g., measurement3. Randomized experiment properly executed4. Quasi-experiment good instance of “type”5. Both design types estimate the same causal entity--e.g, LATE in regression-discontinuity6. Acceptable criteria of correspondence between design types--ESs seem similar; not formally differ; stat significance patterns not differ, etc.
9Experiments vs. Regression-Discontinuity Design Studies
10Three Known within-Study Comparisons of Exp and R-D Aiken, West et al (1998)- R-D study; experiment; LATE; analysis; resultsBuddelmeyer & Skoufias (2003)-R-D study; experiment; LATE; analysis; resultsBlack, Galdo & Smith (2005)-R-D study; experiment; LATE; analysis; results
11Comments on R-D vs Exp.Cumulative correspondence demonstrated over three casesIs this theoretically trivial, though?Is it pragmatically significant, given variation in implementation in both the experiment and R-D?As “existence proof”, it belies over-generalized argument that non-experiments don’t workAs practical issue, does it mean we should support RDD when treatments are assigned by need, merit.Emboldens to deconstruct non-experiment further
12Experiment vs Differences-in-Differences Most frequent non-experimental design by far across many fields of studyAlso modal in within-study comparisons in job training, and so it provides major basis for past opinion that non-experiments are routinely biasedWe review: 3 studies with comparable estimates14 job training studies with dissimilar estimates2 education examples with dissimilar estimates
13Bloom et al Bloom et al (2002; 2005)--job training the topic Experiment 11 sites - 8 pre earning waves; 20 postNon-Experiment = 5 within-state comparisons; 4 within-city; all comparison Ss enrolled in welfareWe present only control/comparison contrast because treatment time series is a constant
14Issue is:Is there overall difference between control groups randomly or non-randomly formed?If yes, can statistical controls—OLS, IV (incl. Heckman models), propensity scores, random growth models—eliminate this difference?Tested 1O modes, but only one longitudinalWhy we treat this as d-in-d rather than ITS
17Implications of Bloom et al Averaging across the 4 within-city sites showed no difference-also true if 5th between-city site addedSelecting within-study comparisons obviated the need for statistical adjustments for non-equivalence--design alone did it.Bloom et al tested differential effects of statistical adjustments in between-state comparisons where there were large differencesNone worked, or did better than OLS
18Aiken et al (1998) Revisited The experiment. Remember that sample was selected on narrow range of test score valuesQuasi-Experiment--sample selection limited to students who register late or cannot be found in summer but who score in the same range as the experimentNo differences between experiment and non-experiment on test scores or pretest writing testsMeasurement identical in experiment and non-exp
19Results for Aiken et al Writing standardized test = .59 and .57 - sig Rated essay = .06 and .16 – nsHigh degree of comparability in statistical test results and effect size estimates
20Implications of Aiken et al Like Bloom et al, careful selection of sample gets close correspondence on important observables.Little need for stat adjustment for non-equivalence limited only to unobservablesStatistical adjustment minor compared to use of sampling design to construct initial correspondence
21What happens if there is an initial selection difference? Shadish, Luellen & Clark (2006)
22Figure 1: Design of Shadish et al. (2006) N = 445 Undergraduate Psychology StudentsPretests, and then Random Assignment toRandomizedExperimentn = 235Randomly Assigned toNonrandomizedExperimentn = 210Self-Selected intoMathematicsTrainingn = 79VocabularyTrainingn = 131MathematicsTrainingn = 119VocabularyTrainingn = 116All participants measured on both mathematics and vocabulary outcomes
23What’s special in Shadish et al Variation in mode of assignmentHold constant most other factors thru first RA--population/measures /activity patternsGood experiment? Pretests; short-term and attrition; no chance for contamination.Good quasi-experiment? - selection process; quality of measurement; analysis and role of Rosenbaum
25Implications of Shadish et al Here the sampling design produced non- equivalent groups on observables, unlike BloomHere the statistical adjustments worked when computed as propensity scoresHowever, big overlap in experimental and non-experimental scores due to first stage random assignment, making propensity scores more validExtensive, unusually valid measurement of a relatively simple selection process, though not homogeneous.
26Limitations to Shadish et al What about more complex settings?What about more complex selection processes?What about OLS and other analyses?This is not a unique test of propensity scores!
27Examine Within-Study Comparison Studies with different Results The Bulk of the Job Training ComparisonsTwo Examples from Education
28Earliest Job Training Studies: Adding to Smith/Todd Critique Mode of Assignment clearly variedWe assume RCT implemented reasonably wellBut third variable irrelevancies were not controlled, esp location and measurement, given dependence on matching from extant data setsLarge initial differences between randomly and non-randomly formed comparison groupsReliance on statistical adjustment to reduce selection, and not initial design
30Agodini & M. Dynarski (2004)Drop-out prevention experiment, 16 m/h schoolsIndividual students, likely dropouts, were randomly assigned within schools—16 replicatesQuasi-Experiment—students matched from 2 quite different sources: middle school controls in another study, and national NELS data.Matching on individual and school demographic factors4 outcomes examined and so in non-experiment128 propensity scores -16 x 4 x 2--computed basically from demographic background variables
31Results Only 29 of 128 cases were balanced matches obtained Why quality matching so rare? In non-experiment, groups hardly overlap. Treatment group is high and middle schools, but comparisons are middle only or from a very non-local national data setMixed pattern of outcome correspondences in 29 cases of computable propensity scores. Not goodOLS did as well as propensity scores
32CritiqueWho would design a quasi-experiment this way? Is a mediocre non-experiment being compared to a good experiment?Alternative design might have been:1. Regression-discontinuity.2. Local comparison schools, same selection mechanism to select similar comparison students. 3 Use of multi-year prior achievement data.
33Wilde & Hollister (2005)The Experiment—reducing class size in 11 sites; no pretest used at the individual levelQuasi-experimental design—individuals in reduced classes matched to individual cases from other 10 sitesPropensity scores; mostly demographicAnalysis treat each site as a separate experimentAnd so 11 replicates comparing an experimental and non-experimental effect size
34ResultsLow level of correspondence in experimental and non-experimental effect sizes across the 11 sitesSo for each site it makes a causal difference whether experiment or quasi-experimentWhen aggregated across sites, results closer: exp = .68; non-exp = 1.07But they do reliably differ
35CritiqueWho would design a quasi-exp on this topic without a pretest on same scale as outcome?Who would design it with these controls?Instead select controls from one or more matched schools on prior achievement historyAgain, a good experiment is being compared to a bad quasi-experimentWho would treat this as 11 separate experiments vs. a more stable pooled experiment? Even the authors, pooled results are much more congruent.
36Hypothesis is that...The job training and educational examples that produce different conclusions from the experiment are examples of poor quasi-experimental designTo compare good exp to poor quasi-exp is to confound a design type and the quality of its implementation—a logical fallacyBut I reach this conclusion ex post facto and knowing the randomized experimental results in advance
37Big Conclusions:R-D has given results not much different from experiment in three of three cases.Simpler Quasi-Experiments tend to give same results as experiment if: (a) population matching in the sampling design—Bloom and Aiken studies, or if (b) careful conceptualization and measurement of selection model, as in Shadish et.
38What I am not Concluding: That well designed quasi-experiment is as good as an experiment. Difference in:Number and transparency of assumptionsStatistical powerKnowledge of implementationSocial and political acceptanceIf you have the option, do an experiment becauseyou can rarely put right by statistics what you have messed up by design
39What I am suggesting you consider: Whether this be a unit on RCTs or quality causal studiesWhether you want to do RDD studies in cases where an experiment is not possible because resources are distributed otherwiseWhether you want to do quasi-experiments if group matching on the pretest is possible, as in many school-level interventions?
40More Contentiously if: The selection process can be conceptualized, observed and measured very well.An abbreviated ITS analysis is possible, as in Bloom et al.The instinct to avoid quasi-experiments is correct, but it reduces the scope of the causal issues that can be examined
45Results-Aiken et al pretest values on SAT/CAT, 2 writing measures Measurement framework the samePretest ACTs and writing - ns exp vs nonOLS testsResults for writing test = .59 and sigResults for essay = .06 and ns
46Bloom et al Revisited Analysis at the individual level Within city, within welfare to work center, same measurement designAbsolute bias- yesAverage bias none across 5 within-state sites, even w/o stat testsAverage bias limited to small site and non-within-city site-Detroit vs Grand Rapids
47Correspondence Criteria Random error and no exact agreementShared stat sig pattern from zero - 68%Two ESs not statistically different“Comparable” magnitude estimatesOne as percent of otherIndulgence, common sense and mix
48Our Research IssuesDeconstructing “non-experiment”--do experimental and non-experimental ESs correspond differently for R-D, for ITS, and for simple non-equivalent designs?How far can we generalize results about invalidity of non-experiments beyond job training?Do these within-study comparison studies bear the weight ascribed to them in evaluation policy at DoL and IES?