Presentation on theme: "Meta-analysis and the Synthetic Approach Luke Plonsky Current Developments in Quantitative Research Methods Day 2."— Presentation transcript:
Meta-analysis and the Synthetic Approach Luke Plonsky Current Developments in Quantitative Research Methods Day 2
Traditional Literature Reviews What do they look like? Think of a recent one you wrote: What was your process like? What are their strengths? Weaknesses? (As we discuss the meta-analytic process, keep a topic or domain of yours in mind.)
Meta-analysis as “the way forward”? (Rousseau, 2008, p. 9) Systematic, transparent, & quantitative means to Summarize (all) previous studies (A B; M x N) Provide a quantitative indication of a relationship Prevent over/under-interpreting results (Norris & Ortega, 2006; Rousseau, 2008) Increase statistical power and generalizability across learners, contexts, L2 features, outcomes, etc. (Plonsky, 2012) Examine relationships not visible in primary research (A on B when C vs. D) Identify substantive and methodological trends, weaknesses, and gaps (Plonsky & Gass, 2011)
Meta-analysis is here! (See Norris & Ortega, 2010; Oswald & Plonsky, 2010) +visibility +impact +citation (Cooper & Hedges, 2009) Understand/evaluate choices advance theory, research, and practice
Judgment and Decision-Making Art and Science Oswald & McCloy (2003) Norris & Ortega (2007) “There doesn’t seem to be a big role in this kind of work for much intelligent statistics, opposed to much wise thought” (Wachter, 1990, p. 182). vs.
Four major stages (parallel to primary research) 1. Defining the domain / locating primary studies 2. Developing and implementing a coding scheme 3. (Meta-)Analysis 4. Interpreting meta-analytic results
1. DEFINING THE DOMAIN / LOCATING PRIMARY STUDIES
“Best evidence synthesis” (Eysenck, 1995) Truscott (2007) – strict criteria (e.g., only “long-term” treatments) Vs. Inclusiveness (preferred) (Norris & Ortega, 2006; Plonsky & Oswald, 2012) Weaknesses mitigated by volume and assessed empirically (e.g., Russell & Spada, 2006) Reliability reported? Yes, d = 0.65; No, d = 0.42 (Plonsky, 2011) Control for bias? Tight, d = 0.51; Loose, d = 0.38 (Adesope et al., 2010) (Are there studies with certain methodological features that you would exclude?) 1. Defining the domain / locating primary studies: Methodological considerations
1. Defining the domain / locating primary studies: Publication status (& bias) Exclude unpublished studies (e.g., Keck et al., 2006; Lyster & Saito, 2010; Mackey & Goo, 2007) failsafe n (Abraham, 2008; Ross, 1998) lacking precision (e.g., Becker, 2005) funnel plot (Li, 2010; Norris & Ortega, 2000; Plonsky, 2011) Include unpublished studies (e.g., Li, 2010; Masgoret & Gardner, 2003, Won, 2008) Compare Published (g = 0.43) vs. unpublished (g = 0.56) (Taylor et al., 2006)
1. Defining the domain / locating primary studies: Substantive considerations Broad Strategy instruction (all skills; Plonsky, 2011) Multi-word instruction (all types) (Han, in preparation) Narrow (local) Strategy instruction (reading only; Taylor et al., 2006) Collocation instruction + tech.(Nurmukhamedov, in preparation) (Would you describe your domain as relatively broad or more narrow? If narrow, what broader domain does your belong to?)
Strict / convenient? quality criteria The Effectiveness of Bilingual Education Willig (1985) K = 23 d =.63 Rossell & Baker (1996) K = 72 (the “naysayers”; 228 unacceptable) Vote: % of studies helpful (22%), no diff (45%), harmful (33%) Greene (1998) K = 11 g =.18 (quasi-exp) /.26 (experiments); no Canada Slavin & Cheung (2003) K = 42; “best-evidence synthesis” No overall d; many subgroups Roessingh (2004) K = 12 Qual. synthesis; HS learners only; Canadian focus Rolstad, Mahoney, & Glass (2005) K = 17 (all post-Willig, 1985) d L2 =.23 (usually English); d L1 =.86 Reljić (2011) K = 7 European studies only; d = ? (See also Rossell & Kuder’s  meticulous critique and re-analysis of these studies.)
How effective is feedback? (Well, it depends…) Corrective Feedback ?
? (Effects of CF not calculated) d=-.15 d=1.16 How effective is feedback? (Well, it depends…) Corrective Feedback
1. Defining the domain / locating primary studies: Search Strategies a. Database searches (e.g., LLBA, ERIC, PsycInfo) (see In’nami & Koizumi, 2010; Plonsky & Brown, under review) b. Forward citations (Google/Scholar, Web of Science) (Plonsky, 2011) c. Manual journal searches (Keck et al., 2006; Plonsky & Gass, 2011) d. Textbooks and edited volumes e. Conference proceedings (15 in Lee et al., in press) f. Reference digging (‘ancestry’) g. Dissertations/theses (10 in Li, 2010; 19 in Lee et al., in press) h. Previous reviews (e.g., ARAL) i. Researchers’ websites, online bibliographies, listservs j. Contacting authors k. others? l. All of the above
1. Defining the domain / locating primary studies: Search Strategies (in Plonsky & Brown, under review) Narrow range of search techniques completeness+redundancy > incompleteness
2. Developing and implementing a coding scheme (the data collection instrument) Knowledge of… Substantive issues, relevant models, variables e.g., Taxonomies of instruction, CF moderators e.g., What constitutes a multi-word unit? Collocation? (Han, in prep; Nurmukhamedov, in prep.) moderators Research design(s) used Pre-post? Control-experimental only? Classroom/lab, FL/SL, correlational/experimental, length of treatment, researcher- or teacher-led, outcome measures… more moderators Methodological features (for analysis of study quality)
2. Developing and implementing a coding scheme Typically 5 different types of data are coded 1.Identification (year, author) 2.Sample and context (age, L1, L2, proficiency) 3.Design (pre-post/control-experimental, treatment features) 4.Outcome features (free response, constrained response) 5.Outcomes / effect sizes (r, d) Coding scheme example: Lee, Jang, & Plonsky (in press) Recommendations: code variables numerically/categorically whenever possible revise and add new variables as they emerge from coding (What types of substantive and methodological features would you code for?) (Which type of index would be most appropriate for your research/domain?)
2. Developing and implementing a coding scheme (cont’d) Decisions about… Interrater reliability Especially for high-inference items (e.g., L2 proficiency; task-essentialness) Percentage agreement; Cohen’s kappa Missing data (e.g., SDs VERY common: 31% in Plonsky & Gass, 2011) 1. Ignore/exclude (most common) 2. Impute (i.e., estimate) 3. Request (5/15 and 5/16 sent data in Plonsky, 2011, and Lee et al., in press, respectively)
3. (Meta-)Analysis Potentially very simple: Overall d = M(study 1, study 2, …) Level of analysis (e.g., study?, sample?, within vs. between groups?) Pre-post ESs generally larger than control-experimental ones Weighting/adjusting ESs for quality, statistical artifacts N (Norris & Ortega, 2000; Plonsky, 2011), inverse variance (Won, 2008) “Schmidt & Hunter” corrections (Jeon & Yamashita, under review; Masgoret & Gardner, 2003) Quality/control (e.g., random assignment, pretesting) Example/template for ES weighting (N; inverse variance)
3. (Meta-)Analysis “adds as well as summarizes knowledge” (Hall et al., 1994, p. 24) Moderator analyses (explain variance across studies): - Ross, 1998: listening; reading - Norris & Ortega, 2000: +explicitness; +constrained measures - Mackey & Goo, 2007: vocab > grammar - Li, 2010: labs > classrooms - Plonsky, 2011: longer treatments; fewer strategies; R & S - Lee et al., in press.: instruction + feedback; longer treatments Overall / mean (d, r) (Example of moderator analyses using SPSS) Totally essential! (and awesome)
3. (Meta-)Analysis: Treatment types as moderators Plonsky, 2011
3. (Meta-)Analysis: Multiple Moderators Spada & Tomita, 2010
3. (Meta-)Analysis: Treatment length as a moderator (Jeon & Kaya, 2006) (Norris & Ortega, 2000) (Lyster & Saito, 2010) SL LSBMB S-M L
More advanced (meta-)analytic / techniques Fixed vs. random effects modeling Bayesian meta-analysis (see Ross, 2013) Meta-regression Meta-SEM (See Borenstein et al., 2009; Cooper, Hedges, & Valentine, 2009) 3. (Meta-)Analysis
4. INTERPRETING RESULTS
What do they mean anyway? What implications do these effect have for future research, theory, and practice? What does d = 0.50 (or 0.10, or 1.00…) mean? small How big is ‘big’? And how small is ‘small’?
4. Interpreting findings (Plonsky & Oswald, under review) General and field-specific benchmarks (Cohen, 1988; Plonsky & Oswald, under review) Previous/similar meta-analyses in AL (e.g., Abraham, 2008; Lee et al., this colloquium; Mackey & Goo, 2007) meta-analyses in other fields (Plonsky, 2011) SD units (Taylor et al., 2006) Setting (e.g., Li, 2010; Mackey & Goo, 2007) Length/intensity, practicality (Lee & Huang, 2008; Lee et al., in press; Lyster & Saito, 2010; Norris & Ortega, 2000) Study quality (Plonsky, 2011, 2013, in press; Plonsky & Gass, 2011) Lab Classroom
Cohen’s (1988) “t-shirt” effect sizes ESs are best understood in relation to a particular discipline and, ideally, within a particular sub-domain of that discipline (e.g., Cohen, 1988; Valentine & Cooper, 2003) d = 0.20 d = 0.50 d = 0.80
d linguistics = economics = social work = …?
d values across 77 L2 meta-analyses (1,733 studies, N = 452,000+; Plonsky & Oswald, under review) 0.40 ≈ Small(ish) 0.70 ≈ Medium(ish) 1.00 ≈ Large(ish) M = 0.63
d values across 236 primary L2 studies th percentile large-ish th percentile medium-ish th percentile small-ish
≈ Small 0.70 ≈ Medium 1.00 ≈ Large M = th percentile large-ish th percentile medium-ish th percentile small-ish d values across 236 primary L2 studies
Additional Considerations: Theoretical Maturity Year ES (d) -fine-grained analyses +fine-grained analyses Example: d = 0.42, SD = 0.24, k = 46
Additional Considerations: Methodological Maturity Example: d = 0.42, SD = 0.24, k = 46 Year ES (d) -refined methods and instruments +refined methods and instruments
Additional Considerations: Theoretical & Methodological Maturity Example: d = 0.42, SD = 0.24, K = 92 ES (d) Year -refined methods and instruments +refined methods and instruments -fine-grained analyses +fine-grained analyses Where is your study?
ESs Over Time Plonsky & Gass (2011) Average Effect Sizes across Three Decades Effect Size (d) Decade
(Literal/Mathematical) SD Units Example: d = 0.73; the average EG participant outscored the average CG participant by about 3/4 a SD
Additional Considerations: Research Setting Lab vs. ClassroomFL vs. SL *Setting may change over time: L2 interaction (Plonsky & Gass, 2011) s ≈ 80% lab-based s-2000s ≈ 50/50% lab/classroom (Mackey & Goo, 2007)(Plonsky, 2011)Li (2010) (Taylor et al., 2006)
Additional Considerations: Manipulation of IVs (Practicality?) Lee & Huang (2008) The effect of input enhancement on L2 grammar learning: d = 0.22 Numerically small, but practically large/significant?
Additional Considerations: Publication Bias, Sample Sizes, & Sampling Error Pub. bias: The tendency only to publish studies with statistically significant (or theoretically appealing) findings (Rothstein, Sutton, & Borenstein, 2005; see Plonsky, 2013; Lee, Jang, & Plonsky, in press, for evidence of publication bias in L2 research. ) Two related statistical artifacts: 1. Smaller Ns +sampling error +variance/distance from population mean 2. Low instrument reliability smaller effects vs.
Challenges to meta-analysis 1) Domain maturity age, breadth and depth of research danger of pre-mature closure 2) Poor reporting practices (SDs, ESs) Missing data ( K = 19 in Nekrasova & Becker, 2009; 22 in Plonsky, 2011) 3) Instrument reliability low or unreported Reported in 6% of studies (Nekrasova & Becker, 2009) 4) Idiosyncratic/inconsistent research activity 5) Very few replications (see Polio & Gass, 1997; Porte, 2002, 2012) What challenges might one encounter in conducting a meta- analysis in your target domain and/or generally?
Challenges to meta-analysis (cont.) 6) Disagreement over definitions and operationalizations E.g., noticing Perhaps more “adversarial collaboration” is needed (see Tetlock & Mitchell, 2009) 7) Overreliance on individual studies (see Norris & Ortega, 2007) 8) Bias of primary (and secondary) researchers toward particular types of findings (e.g., in favor/against theory X; p <.05) 9) Tradition of overreliance on NHST (see Schmidt & Hunter, 2002) Crude Uninformative Unreliable
A synthetic approach to primary research? What might this look like generally and in terms of… Research agendas? Reporting practices and interpretations of findings? Researcher training? Journal calls and acceptance policies?
Conclusion: Judgment and decision-making play a major role in all meta-analyses Understanding the choices More appropriate execution and interpretation of meta-analytic findings More precise advances in theory, more efficient L2 research, and more accurately informed practice
Further Reading Synthesizing research on language learning and teaching (Norris & Ortega, 2006) Research synthesis and meta-analysis: A step-by-step approach (Cooper, 2010) Practical meta-analysis (Lipsey & Wilson, 2001)
Connections to Other Topics to be Discussed this Week NHST, effect sizes (MONDAY) Study Quality (WEDNESDAY) Replication (THURSDAY) Reporting practices (FRIDAY)
Tomorrow: Study Quality What does this mean? How can we operationalize study quality? What findings exist for studies of study quality in AL? Where and how can the findings of quality analyses be implemented?