PRESENTATION AT THE 12 TH ANNUAL MARYLAND ASSESSMENT CONFERENCE COLLEGE PARK, MD OCTOBER 18, 2012 JOSEPH A. MARTINEAU JI ZENG MICHIGAN DEPARTMENT OF EDUCATION.

Slides:

Advertisements

Similar presentations

Project VIABLE: Behavioral Specificity and Wording Impact on DBR Accuracy Teresa J. LeBel 1, Amy M. Briesch 1, Stephen P. Kilgus 1, T. Chris Riley-Tillman.

Advertisements

Standardized Scales.

Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)

English Language Proficiency Tests, One Dimension or Many?: Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction.

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

Douglas N. Harris University of Wisconsin at Madison Evaluating and Improving Value-Added Modeling.

Exploring Value-Added Across Multiple Dimensions: A Bifactor Approach Derek Briggs Ben Domingue University of Colorado Maryland Assessment Conference October.

Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.

Sensitivity Analysis for Observational Comparative Effectiveness Research Prepared for: Agency for Healthcare Research and Quality (AHRQ)

A procedure for dimensionality analyses of response data from various test designs Jinming Zhang William Stout.

Enquiring mines wanna no.... Who is it? Coleman Report “[S]chools bring little influence to bear upon a child’s achievement that is independent of.

Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.

Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.

Using Hierarchical Growth Models to Monitor School Performance: The effects of the model, metric and time on the validity of inferences THE 34TH ANNUAL.

Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.

Meta-analysis & psychotherapy outcome research

IRT Models to Assess Change Across Repeated Measurements James S. Roberts Georgia Institute of Technology Qianli Ma University of Maryland University of.

Jamal Abedi University of California, Davis/CRESST Presented at: The Race to the Top Assessment Program Public & Expert Input Meeting December 2, 2009.

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Measurement Challenges in Growth and Value Added Models Joseph A. Martineau Executive Director of Assessment & Accountability Michigan Department of Education.

1 The New York State Education Department New York State’s Student Reporting and Accountability System.

Determining the Size of

Multivariate Methods EPSY 5245 Michael C. Rodriguez.

Introduction to CFA. LEARNING OBJECTIVES: Upon completing this chapter, you should be able to do the following: Distinguish between exploratory factor.

NCAASE Work with NC Dataset: Initial Analyses for Students with Disabilities Ann Schulte NCAASE Co-PI

NCLB AND VALUE-ADDED APPROACHES ECS State Leader Forum on Educational Accountability June 4, 2004 Stanley Rabinowitz, Ph.D. WestEd

Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.

Human Capital Policies in Education: Further Research on Teachers and Principals 5 rd Annual CALDER Conference January 27 th, 2012.

Presentation to the Michigan State Board of Education September 13, 2011.

Evaluating the Vermont Mathematics Initiative (VMI) in a Value Added Context H. ‘Bud’ Meyers, Ph.D. College of Education and Social Services University.

Sensitivity of Teacher Value-Added Estimates to Student and Peer Control Variables October 2013 Matthew Johnson Stephen Lipscomb Brian Gill.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Review and Validation of ISAT Performance Levels for 2006 and Beyond MetriTech, Inc. Champaign, IL MetriTech, Inc. Champaign, IL.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

A Closer Look at Adequate Yearly Progress (AYP) Michigan Department of Education Office of Educational Assessment and Accountability Paul Bielawski Conference.

© 2014, Florida Department of Education. All Rights Reserved Annual District Assessment Coordinator Meeting VAM Update.

Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.

CRESST ONR/NETC Meetings, July 2003, v1 ONR Advanced Distributed Learning Impact of Language Factors on the Reliability and Validity of Assessment.

© 2007 Board of Regents of the University of Wisconsin System, on behalf of the WIDA Consortium WIDA Focus on Growth H Gary Cook, Ph.D. WIDA.

Understanding Alaska Measures of Progress Results: Reports 1 ASA Fall Meeting 9/25/2015 Alaska Department of Education & Early Development Margaret MacKinnon,

Michigan’s Experience Incorporating the ACT into a High School NCLB Assessment Joseph Martineau, Director Office of Educational Assessment & Accountability.

Fall 2007 MEAP Reporting 2007 OEAA Conference Jim Griffiths – Manager, Assessment Administration & Reporting Sue Peterman - Department Analyst, MEAP.

“Value added” measures of teacher quality: use and policy validity Sean P. Corcoran New York University NYU Abu Dhabi Conference January 22, 2009.

Strategies for estimating the effects of teacher credentials Helen F. Ladd Based on joint work with Charles Clotfelter and Jacob Vigdor CALDER Conference,

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

MOI UNIVERSITY SCHOOL OF BUSINESS AND ECONOMICS CONCEPT MEASUREMENT, SCALING, VALIDITY AND RELIABILITY BY MUGAMBI G.K. M’NCHEBERE EMBA NAIROBI RESEARCH.

MDE / OEAA 1 Un-distorting Measures of Growth: Alternatives to Traditional Vertical Scales Presentation on June 19, 2005 to 25 th Annual CCSSO Conference.

Release of Preliminary Value-Added Data Webinar August 13, 2012 Florida Department of Education.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Understanding the 2015 Smarter Balanced Assessment Results Assessment Services.

C R E S S T / U C L A Psychometric Issues in the Assessment of English Language Learners Presented at the: CRESST 2002 Annual Conference Research Goes.

PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.

University of Colorado at Boulder National Center for Research on Evaluation, Standards, and Student Testing Challenges for States and Schools in the No.

C R E S S T / CU University of Colorado at Boulder National Center for Research on Evaluation, Standards, and Student Testing Measuring Adequate Yearly.

January 11, Presentation on the Impact of Raising MEAP and MME Cut Scores to be Consistent with College and Career Readiness PRESENTED FOR DISCUSSION.

Value Added Model Value Added Model. New Standard for Teacher EvaluationsNew Standard for Teacher Evaluations Performance of Students. At least 50% of.

Using Latent Variable Models in Survey Research Roger E. Millsap Arizona State University Contact: (480)

Vertical Scaling in Value-Added Models for Student Learning

Preliminary Review of the 2012 Math SOL Results

Smarter Balanced Assessment Results

Student Growth Measurements and Accountability

Week 3 Class Discussion.

Dr. Robert H. Meyer Research Professor and Director

Partial Credit Scoring for Technology Enhanced Items

Common Core Update May 15, 2013.

National Conference on Student Assessment

Shasta County Curriculum Leads November 14, 2014 Mary Tribbey Senior Assessment Fellow Interim Assessments Welcome and thank you for your interest.

EPSY 5245 EPSY 5245 Michael C. Rodriguez

Neil T. Heffernan, Joseph E. Beck & Kenneth R. Koedinger

Using the RUMM2030 outputs as feedback on learner performance in Communication in English for Adult learners Nthabeleng Lepota 13th SAAEA Conference.

Presentation transcript:

PRESENTATION AT THE 12 TH ANNUAL MARYLAND ASSESSMENT CONFERENCE COLLEGE PARK, MD OCTOBER 18, 2012 JOSEPH A. MARTINEAU JI ZENG MICHIGAN DEPARTMENT OF EDUCATION Borrowing the Strength of Unidimensional Scaling to Produce Multidimensional Educational Effectiveness Profiles

2 Background Prior research showing that using unidimensional measures of multidimensional achievement constructs can distort value-added  Martineau, J. A. (2006). Distorting Value Added: The Use of Longitudinal, Vertically Scaled Student Achievement Data for Value-Added Accountability. Journal of Educational and Behavioral Statistics, 31(1),  Construct irrelevant variance can become considerable in value-added measures when a construct is multidimensional, but is modeled in value- added as unidimensional.  Common misunderstanding is that if the multiple constructs are highly correlated, value-added should not be distorted.  Correct understanding is that if value-added on the multiple constructs is highly correlated, value-added should not be distorted

3 Background Prior research showing that the choice of dimension/domain within construct changes value-added significantly  Lockwood, J.R et al. (2007). The Sensitivity of Value-Added Teacher Effect Estimates to Different Mathematics Achievement Measures. Journal of Educational Measurement, 44(1),  Depending on choices made in value-added modeling, the correlation between teacher value-added on Procedures and Problem Solving ranged from 0.01 to  This gives a surprisingly low correlation in value-added that indicates that at least in this situation, one needs to be concerned about modeling value- added in both dimensions rather than unidimensionally.  Only work I am aware of to date that has inspected inter-construct value- added correlations.

4 Background Prior research showing that commonly used factor analytic techniques underestimate the number of dimensions in a multidimensional construct  Zeng, J. (2010). Development of a Hybrid Method for Dimensionality Identification Incorporating an Angle-Based Approach. Unpublished doctoral dissertation, University of Michigan.  Common dimensionality identifications procedures make the unwarranted assumption that all shared variance among indicator variables arise because the indicator variables measure the same construct (shared variance can also arise because the indicator variables are influenced by a common exogenous variable)  Because of this unwarranted assumption, commonly used dimensionality identification techniques underestimate the number of dimensions in a data set.

5 Background Scaling constructs as multidimensional is a difficult task  Multidimensional Item Response Theory (MIRT) is time- consuming and costly to run  Replicating MIRT analyses can be challenging (there are multiple subjective decision points along the way)  Identifying the number of dimensions in MIRT can be challenging  Once the number of dimensions is identified, identifying which items load in which dimensions in MIRT can also be challenging  The factor analysis techniques underlying MIRT are techniques for data reduction, not dimension identification

6 Background Short of resolving the considerable difficulties in analytically identifying dimensions within a construct (and replicating such analyses), can another approach be used? Propose using/trusting content experts’ identifications of dimensions within constructs (e.g., the divisions agreed upon by the writers of content standards) as the best currently available identification of dimensions, for example…  Within English language proficiency, producing reading, writing, listening, and speaking scales.  Within Mathematics, producing number & operations, algebra, geometry, measurement, and data analysis/statistics scales.

7 Background However, separately scaling each dimension can also be difficult and costly compared to running a traditional unidimensional IRT calibration  Confirmatory MIRT  Bi-factor IRT model  Separate unidimensional calibration and year-to-year equating of each dimension score Another option:  Unidimensionally calibrate the total score  Unidimensionally equate the total score from year to year  Use (fixed) item parameters from the unidimensional calibration to create the multiple dimension scores as specified by content experts  Use of this method needs to be investigated Practical necessity for Smarter Balanced Assessment Consortium

8 Purpose Investigate the feasibility and validity of relying on unidimensional total score calibration as a basis for creating multidimensional profile scores…  For reporting multidimensional student achievement scores  For reporting multidimensional value-added measures Investigate the impact of separate versus fixed calibration of multidimensional achievement scores in terms of impact on…  Student achievement scores  Value-added scores …as compared to the impact of other common decisions in scaling, outcome selection, and value-added modeling

9 Methods Decisions Modeled in the Analyses  Psychometric decisions  Choice of psychometric model 1-PL vs. 3-PL PCM vs. GPCM  Estimation of sub-scores Separate calibration for each dimension vs. fixed calibration based on unidimensional parameters  Choice of outcome metric  Which sub-score is modeled  Value-added modeling decisions  Inclusion of demographics in models  Number of pre-test covariates (for covariate adjustment models)

10 Methods Outcomes  Correlations in student achievement metrics compared across each psychometric choice and outcome choice  Correlations in value-added modeling compared across each choice  Classification consistency in value-added compared across each choice for  Three-category classification decisions Based on confidence intervals around point-estimates placing programs/schools into three categories: (1) above average, (2) statistically indistinguishable from the average, and (3) below average  Four-category classification decisions Based on sorting programs’/schools’ point estimates into quartiles, representing arbitrary cut points for classification

11 Methods Data  Michigan English Language Proficiency Assessment (ELPA)  Level III (Grades 3-5)  3391 students each with 3 measurement occasions (10,173 total scores)  Measures  Total  Reading(domain)  Writing (domain)  Listening(domain)  Speaking(domain)  Calibrated the ELPA as a unidimensional measure using both 1- PL/Partial Credit Model and 3-PL/Generalized Partial Credit Model  Created domain scores both from fixed parameters from unidimensional calibration and in separate calibrations for each domain

12 Methods Data  Michigan Educational Assessment Program (MEAP) Mathematics  Grades 7 and 8 (not on a vertical scale)  Over 110,000 students per grade  Measures  Total(using items from the two domains)  Number & Operations(domain)  Algebra (domain)  Calibrated the MEAP Math tests as unidimensional measures using both 1-PL and 3-PL models  Created domain scores both from fixed parameters from unidimensional calibration and in separate calibrations for each domain

13 Methods

14 Methods Value-added modeling the ELPA  VAMs were run in a fully-crossed design with…  All outcomes (R, W, L, S)  PCM- and GPCM-calibrated outcomes  Fixed and separately calibrated outcomes  With and without demographics in the VAMs  32 real-data applications across design factors

15 Methods

16 Methods Value-added modeling MEAP mathematics  VAMs were run in a fully-crossed design with…  Both outcomes (algebra and number & operations)  1-PL and 3-PL calibrated outcomes  Fixed and separately calibrated outcomes  With and without demographics  With either one or two pre-test covariates  32 real-data applications across design factors

17 Results ELPA

18 Results: ELPA Student-Level Outcomes Correlations across fixed vs. separate calibrations

19 Results: ELPA Student-Level Outcomes Correlations across model choice (PCM vs. GPCM)

20 Results: ELPA Student-Level Outcomes Correlations across content areas Low to moderate inter-dimension correlations However, Rasch dimensionality analysis from WINSTEPS identified the total score as a unidimensional score

21 Results: ELPA Program District-Level Value-Added Outcomes Impact of fixed versus separate calibration

22 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Listening and Reading VA Min = 0.228, Max = Mean = 0.322, SD = 0.037

23 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Listening and Writing VA Min = 0.342, Max = Mean = 0.373, SD = 0.019

24 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Listening and Speaking VA Min = , Max = Mean = 0.046, SD = 0.035

25 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Reading and Writing VA Min = 0.335, Max = Mean = 0.412, SD = 0.047

26 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Reading and Speaking VA Min = 0.121, Max = Mean = 0.151, SD = 0.026

27 Results: ELPA Program District-Level Value-Added Outcomes Correlations between Speaking and Writing VA Min = 0.150, Max = Mean = 0.199, SD = 0.029

28 Results: ELPA Program District-Level Value-Added Outcomes Impact of choice of psychometric model

29 Results: ELPA Program District-Level Value-Added Outcomes Impact of Including/Not Including Demographics

30 Results MEAP Mathematics

31 Results: MEAP Math Student-Level Outcomes Correlations among variables based on psychometric decisions

32 Results: MEAP Math Student-Level Outcomes Very high correlations based on fixed versus separate calibrations

33 Results: MEAP Math Student-Level Outcomes Very high correlations based on fixed versus separate calibrations

34 Results: MEAP Math Student-Level Outcomes Not as high correlations based on 1-PL versus 3-PL calibrations

35 Results: MEAP Math Student-Level Outcomes Moderate to high correlations across dimensions

36 Results: MEAP Mathematics School-Level Value-Added Outcomes Impact of fixed versus separate calibration

37 Results: MEAP Mathematics School-Level Value-Added Outcomes Impact of choice of outcome (Algebra vs. Number)

38 Results: MEAP Mathematics School-Level Value-Added Outcomes Impact of choice of psychometric model

39 Results: MEAP Mathematics School-Level Value-Added Outcomes Impact of Including/Not Including Demographics

40 Results: MEAP Mathematics School-Level Value-Added Outcomes Impact of covarying on one vs. two pre-test scores

41 Conclusions Practically important impacts on value-added metrics and value-added classifications  Choice of psychometric model  Including/not including demographics  Including/not including multiple pre-test values Prohibitive impacts on value-added metrics and value-added classifications  Choice of outcome (i.e., domain within construct) Practically negligible impacts on value-added metrics and value-added classifications  Separate versus fixed calibrations of domains within construct

42 Conclusions, continued… Need to pay attention to modeling domains within constructs if constructs can reasonably be considered multidimensional Of the common psychometric and statistical modeling decisions one can make, the choice of which subscore to use as an outcome is the most influential Because subscores give different profiles of both student achievement and program/school value-added, each subscore should be modeled to the degree possible 4-category (i.e., quartile) classifications on value-added are appreciably impacted by every psychometric and statistical modeling choice evaluated here, but 3-category classifications are not  Discourage more than three categories  RTTT requires at least four categories

43 Conclusions, continued… 3- vs. 4-category distinction is actually a proxy for  Statistical decision categories (3-categories)  Arbitrary cut point categories (4-categories) Can leverage unidimensional calibrations of multidimensional achievement scales to create multidimensional profiles of value-added  Except where using four categories of classifications

44 Limitations Inductive reasoning  Results are likely to hold in similar circumstances  Still will need to investigate feasibility of using fixed parameters from unidimensional calibration for specific circumstances if those circumstances are high stakes  This is a proof of concept PCM and GPCM models were run using different software (WINSTEPS vs. PARSCALE)

45 Contact Information Joseph A. Martineau, Ph.D.  Executive Director  Bureau of Assessment & Accountability  Michigan Department of Education  Ji Zeng, Ph.D.  Psychometrician  Bureau of Assessment & Accountability  Michigan Department of Education 