Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measurement Challenges in Growth and Value Added Models Joseph A. Martineau Executive Director of Assessment & Accountability Michigan Department of Education.

Similar presentations


Presentation on theme: "Measurement Challenges in Growth and Value Added Models Joseph A. Martineau Executive Director of Assessment & Accountability Michigan Department of Education."— Presentation transcript:

1 Measurement Challenges in Growth and Value Added Models Joseph A. Martineau Executive Director of Assessment & Accountability Michigan Department of Education Presentation at the Florida State University Dean’s Consortium July 16, 2010

2 Slide 2 October 24, 2011 What is a Construct? Theoretical definition? –Content experts usually define constructs with high levels of dimensionality for sub-constructs with theoretically important differences in meaning –Each dimension or sub-construct can typically be considered its own construct worthy of individual study Statistical abstraction? –Psychometricians and statisticians usually define constructs with low levels of dimensionality –Depends on correlations among sub-constructs A combination? –I fall into this camp –A construct is a characteristic that is… Theoretically distinguishable Statistically distinguishable High correlations does not mean the constructs are indistinguishable Highly correlated constructs may be affected differently by different interventions

3 Slide 3 October 24, 2011 Construct Definition Traditional psychometric assumptions about constructed measures include that measures are… –Unidimensional Sensitive to only a single construct Measure only a single thing –On an interval-level scale Changes of the same magnitude on different parts of a scale indicate the same amount of change Implies that the measures are also linear

4 Slide 4 October 24, 2011 Counterexamples in the Simple Case Start out with counterexamples within a single grade level –Grade 8 –Mathematics –Based on Michigan Grade Level Content Expectations –Based on the Michigan Educational Assessment Program (or MEAP)

5 Slide 5 October 24, 2011 A Theoretical Counterexample to the Unidimensionality Assumption Theoretically, math is multidimensional. Traditional statistical tests say math is unidimensional. Depends on the claim that because the sub-scales are highly correlated, it’s all just undifferentiated mathematics Question: Are there interventions (including teachers) that help greatly with number & operations, but not so much with algebra? If so, can we claim that math is a unidimensional, undifferentiable construct? From a theoretical perspective high correlations do not mean that subscales should be modeled as a single construct Taken from the percentage of Michigan mathematics grade level content expectations covering each strand.

6 Slide 6 October 24, 2011 But, those are just theoretical concerns. The statistics show that you can just treat the subscales as a single overall scale Can you show me an empirical example of where this matters?

7 Slide 7 October 24, 2011 Created a composite math scale, placed subscales on the “same scale.” Ordered students on the composite math scale. Created 100 groups of students of about 1,100 students each. Obtained average composite math score of the 100 groups Obtained average sub- scale scores of 100 groups on the three subscales Plotted in a three- dimensional achievement space An Empirical Counterexample (from Martineau et al, 2007)

8 Slide 8 October 24, 2011 An Empirical Counterexample (from Martineau et al, 2007) Highest achieving group of about 1,100 students on the composite mathematics scale Lowest achieving group of about 1,100 students on the composite mathematics scale

9 Slide 9 October 24, 2011 An Empirical Counterexample 3-D Composite scale: passes all traditional unidimensionality tests. Projections of 3-D composite onto 2-D composite. If the composite scale is unidimensional and linear, all four plots should be linear and identical. However, changes near the lower end of the scale mostly represent improvement in number & operations and geometry.

10 Slide 10 October 24, 2011 An Empirical Counterexample Important results: composite scale changes meaning over its range, and is multidimensional, non-linear, and non- interval Statistical models that rely on these scale characteristics will result in distorted interpretations

11 Slide 11 October 24, 2011 Implications If scales traditionally considered unidimensional, linear, and interval are sometimes none of the three… –How badly affected are the results of statistical models that use those scales as outcomes? –Can powerful statistical models that require those scale characteristics still be used? Essentially, why should I care whether the assumptions are violated?

12 Slide 12 October 24, 2011 Theoretically, how badly can value-added be affected Simplest case theoretical thought experiment Scenario –Teacher A vs. Teacher B as a reading intervention –A true experiment, assigning students randomly to either Teacher A or Teacher B –A composite reading measure Sensitive to both decoding and comprehension More sensitive to decoding than comprehension Does not change meaning over its range –Known impacts Teacher A increases gains in comprehension Teacher B increases gains in decoding by the same amount Teacher A has no impact on decoding Teacher B has no impact on comprehension –Results should identify both as equally effective, but on different parts of the reading construct –Next slides—graphical representations of the thought experiment

13 Slide 13 October 24, 2011 Simple Thought Experiment (from Martineau et al, 2007)

14 Slide 14 October 24, 2011 Simple Thought Experiment

15 Slide 15 October 24, 2011 Simple Thought Experiment

16 Slide 16 October 24, 2011 Simple Thought Experiment

17 Slide 17 October 24, 2011 Simple Thought Experiment

18 Slide 18 October 24, 2011 Simple Thought Experiment

19 Slide 19 October 24, 2011 Simple Thought Experiment Comparisons –Results Accurate –Equal impact of Teacher A and Teacher B, but on different dimensions of reading achievement Observed –Teacher B is better than Teacher A at improving reading achievement –Policy recommendations Accurate –Assign teacher A to take PD on instruction in reading comprehension –Assign teacher B to take PD on instruction in decoding Observed –Give Teacher A PD in reading

20 Slide 20 October 24, 2011 Theoretically, how badly can value- added be affected? Next simplest case theoretical thought quasi-experiment Scenario –Teacher A vs. Teacher B as a mathematics intervention –A quasi-experiment Existing groups (e.g., classes taught by teacher A vs. teacher B) Select teacher A and B to assure matching samples on pre-test mathematics measure –A composite mathematics measure Sensitive to both algebra and geometry Scale changes meaning over its range –Known impacts Teacher A is more effective in eliciting growth in geometry achievement Teacher B is of average effectiveness in eliciting growth in geometry Teacher A and B are both of average effectiveness on eliciting growth in algebra –Results should identify teacher A as the more effective teacher –Next slides—graphical representations of the thought quasi-experiment

21 Slide 21 October 24, 2011 Simple Thought Quasi-Experiment (from Martineau et al, 2007) Used the Geometry/ Algebra composite from the empirical example as the scale in this thought experiment

22 Slide 22 October 24, 2011 Simple Thought Quasi-Experiment

23 Slide 23 October 24, 2011 Simple Thought Quasi-Experiment

24 Slide 24 October 24, 2011 Simple Thought Quasi-Experiment

25 Slide 25 October 24, 2011 Simple Thought Quasi-Experiment

26 Slide 26 October 24, 2011 Simple Thought Quasi-Experiment

27 Slide 27 October 24, 2011 Simple Thought Quasi-Experiment Comparisons –Results Accurate –Teacher A is more effective in eliciting geometry growth –Teacher A and B are equally effective in eliciting algebra growth Observed –Teacher B is more effective in eliciting mathematics growth –Policy recommendations Accurate –Reward Teacher A Observed –Reward Teacher B

28 Slide 28 October 24, 2011 Summary to this point Limited thus far to within-grade measures (or horizontal scales) –Theoretical demonstration that content standards within a content area are multidimensional –Empirical demonstration that a content achievement measure contains multiple dimensions (sub-scales) that behave differently –Theoretical demonstration that ignoring multidimensionality can distort the results of experiments and quasi-experiments in value- added; even to the point of reversing a finding

29 Slide 29 October 24, 2011 Multidimensionality in Cross-Grade (Vertical) Scales: Adding Another Layer of Complexity To this point, the presentation has been limited to within-grade (or horizontal) scales Value Added Models in education tend to… –Cover multiple years –Cover multiple grades –Cover broad ranges of achievement –Cover changing foci of instruction To allow for such broad coverage, we need… –Cross-grade (or vertical) scales

30 Slide 30 October 24, 2011 A Further Theoretical Counterexample to the Unidimensionality Assumption Mathematics is not only multidimensional, but the proportional coverage of dimensions change across grades Note, especially, the change from grade 6 to grade 7. The coverage of algebra raises from 0% in grade 6 to about 35% in grade 7. It seems unreasonable to claim that we are measuring the same thing across grades This cross-grade change in coverage/meaning is called “construct shift”

31 Slide 31 October 24, 2011 Cross-Grade (Vertical) Scale Terminology Types of vertical scale –Purely unidimensional scales Measure one and only one construct Non-construct-shifted, non-composite scales –Empirically unidimensional scales Measure more than one construct The proportional representation of the multiple constructs in the overall scale is the same across grades The scale does not change meaning across grades Non-construct-shifted, composite scales –Empirically multidimensional scales Measure more than one construct The proportional representation of the multiple constructs in the overall scale varies across grades The scale changes meaning across grades Construct-shifted, composite scales

32 Slide 32 October 24, 2011 How Does Construct Shift Impact the Results of Growth and Value-Added Models? Mathematical derivation –Growth models –Value-added models Empirical demonstration –Growth models

33 Slide 33 October 24, 2011 Let’s say… We want to measure the impact of a single teacher (or group of teachers using the same intervention) on student growth Let x represent whether a student is instructed by a certain teacher (or group of teachers)

34 Slide 34 October 24, 2011 Mathematical Derivation of Impact of Construct Shift on Growth Models (from Martineau 2004) Simplest growth model (2-level HLM, measurements within students, linear gains) What we think we are modeling (with a purely unidimensional measure as the outcome). Best case of what we are actually modeling (with an empirically unidimensional measure as the outcome). Results become more complex, less like what we think we are modeling. Most likely case of what we are actually modeling (with an empirically multidimensional measure as the outcome). Even more complex and less like what we think we are modeling.

35 Slide 35 October 24, 2011 Mathematical Derivation of Impact of Construct Shift on Growth Models Overall intercept (starting point) Effect of teacher x on intercept Overall slope (growth rate) Effect of teacher x on growth rate Analogous Terms Problem: red and blue should be in the intercept equation (β 0j ), green and black should be in the slope equation ( β 1j ). All four are in both for the model using an empirically multidimensional scale!

36 Slide 36 October 24, 2011 Mathematical Derivation of Impact of Construct Shift on Growth Models Another Problem: the intercept and slope equations from construct-shifted (empirically multidimensional) scales contain totally irrelevant terms from the regression of proportional construct representation (p c ) on time and the regression of (time multiplied by p c ) on time

37 Slide 37 October 24, 2011 Simple value-added model (two-level model with measurement occasions cross-nested within both teachers and students) Mathematical Derivation of Impact of Construct Shift on Growth- Based Value-Added Models (from Martineau, 2006) Teacher effect we think we are modeling (with a purely unidimensional measure as the outcome). Best case of what we are actually modeling (with an empirically unidimensional measure as the outcome). More complex and less like what we think we are modeling. Most likely case of what we are actually modeling (with an empirically multidimensional measure as the outcome). Even more complex and less like what we think we are modeling

38 Slide 38 October 24, 2011 Mathematical Derivation of Impact of Construct Shift on Growth-Based Value-Added Models Change in proportional representation of construct c from the previous grade Impact of all teachers previous to teacher a in year i on student gains on construct c Impact of teacher a on year i student gains on construct c Year i proportional representation of construct c Impact of teacher a on unidimensional student gains in year i (what we want) Proportional representation of construct c

39 Slide 39 October 24, 2011 Mathematical Derivation of Impact of Construct Shift on Growth-Based Value-Added Models Can be considered relevant (with problems) Definitely irrelevant We can calculate proportion of variance in teacher effects that is construct relevant in construct-shifted (empirically multidimensional) scales Definitely relevant

40 Slide 40 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006)

41 Slide 41 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) Proportion of teacher effects not attributable to prior teachers

42 Slide 42 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) The proportion that I can impact of the estimate of my effectiveness as a teacher depends on the balance of construct representation in the current grade level test, …

43 Slide 43 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) …on change in construct representation from grade to grade, …

44 Slide 44 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) …on the number of teachers who precede me in the analysis, and …

45 Slide 45 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) …on the population-wide correlation in value-added impacts on the multiple constructs (not on the correlation of the constructs themselves)

46 Slide 46 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) The proportion of the estimate of my effectiveness that I can impact can drop quickly and dramatically depending on my circumstances.

47 Slide 47 October 24, 2011 Mathematical Derivation of Percent of Variance in Teacher Effects from Construct Relevant Sources (from Martineau 2006) How much construct irrelevant variance is acceptable in teacher effects? –In a research study? –In a pay-for-performance measure? –In a teacher evaluation measure used for hiring, firing, promotion, and tenure decisions?

48 Slide 48 October 24, 2011 Empirical Demonstration of the Importance of Which Scale is Used Lockwood, J.R., et al., (2007) –Ran VAM on two mathematics subscales –Variation in VAM measures across subscales was greater than variation across model specifications –Variation within teacher across subscales was greater than variation within subscales across teachers –Correlation between value-added on the two subscales was low Based on the percent variance from construct relevant sources chart, reliability of value-added based on unidimensional mathematics would be low

49 Slide 49 October 24, 2011 From Doran and Cohen (2005) Bias in vertical linking is so great that they recommended: –Include uncertainty in gains arising from vertical linking bias in the results of Value Added Models –Consider not using value-added models [based on vertical scales] to make causal inferences [about individual teachers or schools], the data are too noisy

50 Slide 50 October 24, 2011 Summary to this point When using cross-grade (vertical scales) –Mathematical demonstration that results of growth models are seriously distorted Slopes contain intercept terms Intercepts terms contain slope terms Both intercept and slope terms contain terms from totally irrelevant regressions of proportional construct weights on assessment occasions –VAM sensitive to what sub-construct is measured –Cautions against causal interpretations based on measurement issues

51 Slide 51 October 24, 2011 Summary to this point When using cross-grade (vertical scales) –Mathematical demonstration that results of value added models are seriously distorted Intent of growth-based value-added models is to isolate individual teacher effects from each other, but… Individual teacher effect terms contain include the impact of all teachers that preceded the individual teacher! The impact of all preceding teachers in individual teacher effect terms is amplified by the degree of changes in construct representation in the achievement measure from grade to grade You can be harmed (by following either poor or excellent teachers) or helped (by following either poor or excellent teachers) by no fault of your own –Depends on how the construct representation changes on the test –Difficult to tell whether you will be harmed or helped

52 Slide 52 October 24, 2011 Summary to this point Everything to this point has been either –Theoretical Thought experiment using “unidimensional” measures as outcomes Thought quasi-experiment using “unidimensional” measures as outcomes Based on content-expert judgment of multidimensionality –Empirical, but in a limited sense Counts of content standards covering different sub-constructs Identification of non-linear composite (multidimensional) scales within grades –Mathematical Derivations of distortions in growth models using construct shifting vertical scales as outcomes Derivations of distortions in value-added models using construct shifting vertical scales as outcomes

53 Slide 53 October 24, 2011 What’s Missing? Demonstration of major variations in outcomes resulting from… –Growth models using real construct-shifted vertical scales –Value-added models using real construct-shifted vertical scales Overall Summary

54 Slide 54 October 24, 2011 Instrument & Sample (from Martineau, Wyse, & Zeng, 2010) Michigan English Language Proficiency Assessment (ELPA) –Level III (Grades 3-5) –Measures four domains Reading, Writing, Listening, Speaking Individual domains treated as purely unidimensional –Dimensionality analysis based on Zeng & Martineau (2010) –More sensitive than traditional dimensionality analyses –Number of dimensions between 3 and 5, inclusive –Likely nearly purely unidimensional

55 Slide 55 October 24, 2011 Constructing Vertical Scales Grade 3-5 receive same assessment Equating –Followed the same cohort of students across grades 3 through 5 –Common items from grade 3 to 4 and from grade 4 to 5 –Common item vertical equating based on non-equivalent groups –Allows for best-case development of vertical scales Same level of the test in all three grades No major differences in item difficulty across grades Calibration and scaling –Calibrated all items to the same base scale across years using WINSTEPS –Used the IWEIGHT command to weight items to create different types of vertical scales with differing domain contributions

56 Slide 56 October 24, 2011 Constructing Vertical Scales, continued… Types of scales –Purely unidimensional (R)eading, (W)riting, (L)istening, (S)peaking Weights for R are R=1.00, W=0.00, L=0.00, S=0.00 –Empirically unidimensional For example, (E)qual, (T)ext, (O)ral Weights for E are R=0.25, W=0.25, L=0.25, S=0.25 Weights for T are R=0.40, W=0.40, L=0.10, S=0.10 –Empirically multidimensional Text to Oral (T-O), Oral to Text (O-T), etc… Weights for T-O are –Grade 3: R=0.40, W=0.40, L=0.10, S=0.10 –Grade 4: R=0.25, W=0.25, L=0.25, S=0.25 –Grade 5: R=0.10, W=0.10, L=0.40, S=0.40 Weights for O-T reverse grades 3 and 5 weights for T-O

57 Slide 57 October 24, 2011 Constructing Vertical Scales, continued… Fifteen resulting scales –Purely unidimensional R, W, L, S –Empirically unidimensional (E)qual – most commonly used (T)ext – weighted toward R & W (O)ral – weighted toward L & S (C)omprehension – weighted toward R & L (P)roduction – weighted toward W & S –Empirically multidimensional Text to Oral (T-O) – transition from T weights to O weights Oral to Text (O-T) – transition from O weights to T weights Comprehension to Production (C-P) Production to Comprehension (P-C) Speaking to R/W/L (S-3) R/W/L to Speaking (3-S)

58 Slide 58 October 24, 2011 Correlations Among Vertical Scales (raw correlations above diagonal) (disattenuated correlations below)

59 Slide 59 October 24, 2011 Correlations Among Vertical Scales Moderate (0.40 to 0.59) to high (0.60 to 0.69) correlations

60 Slide 60 October 24, 2011 Correlations Among Vertical Scales Very high correlations (0.70 to 0.89) with a few high (0.60 to 0.69) and a few extreme (0.90 to 1.00) correlations

61 Slide 61 October 24, 2011 Correlations Among Vertical Scales Extreme (0.90 to 1.00) correlations with a few very high (0.70 to 0.89) correlations

62 Slide 62 October 24, 2011 Growth Model Grade 3-5 linear growth model with random (student) effects Student-level intercept and slope predicted by demographics

63 Slide 63 October 24, 2011 Growth Model Results Purely Unidimensional Scales

64 Slide 64 October 24, 2011 Growth Model Results Purely Unidimensional Scales Consistent resultsInconsistent results

65 Slide 65 October 24, 2011 Growth Model Results Purely Unidimensional Scales Range of Values (Max minus Min)

66 Slide 66 October 24, 2011 Growth Model Results Purely Unidimensional Scales Indicates whether the sign (+/-) of the coefficient differed depending on the scale used as the outcome in the model

67 Slide 67 October 24, 2011 Growth Model Results Purely Unidimensional Scales Indicates whether at least one scale’s model had statistical significance and another’s did not at the common alpha levels of 0.05, 0.01, and 0.001, respectively

68 Slide 68 October 24, 2011 Growth Model Results Composite Scales

69 Slide 69 October 24, 2011 Growth Model Results Composite Scales Large range changes, even with extreme correlations among the scales

70 Slide 70 October 24, 2011 Growth Model Results Composite Scales Changes in signs of coefficients and in statistical significance interpretations, even with extreme correlations among scales

71 Slide 71 October 24, 2011 Growth Model Results Composite Scales Impact of being an Arabic speaker on growth rate was consistently statistically significantly negative with all four purely unidimensional scales. Shows up as not statistically significant in one of the empirically multidimensional scales

72 Slide 72 October 24, 2011 Policy Recommendations? Translating statistical model results into policy recommendations –Negative effect on intercept (starting out lower) = focused startup resources –Negative effect on slope (growing more slowly) = continued resources All 4 Missing! Only on writing On 7 of 11

73 Slide 73 October 24, 2011 Findings The real (multidimensional) picture? –Messy and nuanced –Not consistent across subscales –Real differences in statistical significance and interpretation The distorted (composite) picture? –Can’t reflect the real picture (well, yeah, if the results aren’t consistent across subscales!) –Inconsistent even with extremely highly correlated scales –One construct-shifted scale even reverses a finding that is consistent across all four purely unidimensional scales –Sends policy and theory in the wrong directions

74 Slide 74 October 24, 2011 Findings Construct-shifted vertical scales can cause serious and practical distortions in the results of growth- based models (including value added models) Non-construct shifted, but composite scales also can cause serious and practical distortions in the results of growth-based models (including value-added models) Growth model (including value-added model) results are highly sensitive to un-modeled dimensionality, construct-shifted or not, whether the constructs are highly correlated or not

75 Slide 75 October 24, 2011 Overall Summary Growth and value-added models requiring linear scales that are used for theory development, policy development, and policy decisions should avoid the use of composite scales as outcomes Growth and value-added models requiring linear scales should be based on as nearly purely unidimensional scales as possible –Use more sensitive measures of dimensionality –Trust content expert judgment on what constitutes a dimension (construct) If one desires to use a composite scale, one should use methods that do not make the assumptions of linearity, unidimensionality, and interval-level scaling; for example… –Michigan’s grade 3-8 “growth” model –Michigan’s Race to the Top submission for teacher evaluations (including “growth”-based value-added measures)

76 Slide 76 October 24, 2011 References Doran, H., Cohen, J. (2005). The confounding effect of linking bias on gains estimated from value-added models. In Lissitz, R. (ed.) Value Added Models in Education: Theory and Applications. JAM Press: Maple Grove, MN. Lockwood, J.R., McCaffrey, D.F., Hamilton, L.S., Stecher, B., Li, V-N., Martinez, J.F., (2007). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44(1), 47-67. Martineau, J.A. (2004). The effects of construct shift on growth and accountability models. Unpublished Dissertation. Michigan State University. Martineau, J.A. (2006). Distorting value added: the use of longitudinal, vertically scaled achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics., 31(1), 35-62. Martineau, J.A., Subedi, D.R., Ward, K.H., Li, T., Lu, Y. Diao, Q. Pang, F-H., Drake, S., Song, T., Lao, S-C., Zheng, Y., Li, X. (2007). Non-linear unidimensional trajectories through multidimensional content spaces: a critical examination of the common psychometric claims of unidimensionality, linearity, and interval-level measurement. In Lissitz, R. (Ed.). Assessing and Modeling Cognitive Development in Schools: Intellectual Growth and Standard Setting. JAM Press: Maple Grove, MN. Martineau, J. A., Wyse, A. E., & Zeng, J. (2010, May). Distortions in empirical measures of growth arising from using traditional (vertical) scales as outcomes. Paper presented at the Annual Meeting of the National Council of Measurement in Education, Denver, CO.

77 Slide 77 October 24, 2011 Contact Information Joseph A. Martineau, Ph.D. –Executive Director of Assessment & Accountability –Michigan Department of Education –martineauj@michigan.gov


Download ppt "Measurement Challenges in Growth and Value Added Models Joseph A. Martineau Executive Director of Assessment & Accountability Michigan Department of Education."

Similar presentations


Ads by Google