Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.

Slides:



Advertisements
Similar presentations
WMS-IV Wechsler Memory Scale - Fourth Edition
Advertisements

Experiments and Variables
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
1 Effective Use of Benchmark Test and Item Statistics and Considerations When Setting Performance Levels California Educational Research Association Anaheim,
VERTICAL SCALING H. Jane Rogers Neag School of Education University of Connecticut Presentation to the TNE Assessment Committee, October 30, 2006.
Probability & Using Frequency Distributions Chapters 1 & 6 Homework: Ch 1: 9-12 Ch 6: 1, 2, 3, 8, 9, 14.
By Dr. Mohammad H. Omar Department of Mathematical Sciences May 16, 2006 Presented at Statistic Research (STAR) colloquium, King Fahd University of Petroleum.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Incomplete Block Designs
7-2 Estimating a Population Proportion
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Calibration & Curve Fitting
EPSY 8223: Test Score Equating
Chapter 13: Inference in Regression
Making MAP More Meaningful Winter 2008 David Dreher, Project Coordinator Office of Accountability Highline Public Schools.
The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods.  Kinge Mbella Liz Burton Rob Keller Nambury.
Linking & Equating Psych 818 DeShon. Why needed? ● Large-scale testing programs often require multiple forms to maintain test security over time or to.
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.
Correlation.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Second graded Homework Assignment: 1.80; 1.84; 1.92; 1.110; 1.115; (optional: 1.127) Due in Labs on Sept
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Review and Validation of ISAT Performance Levels for 2006 and Beyond MetriTech, Inc. Champaign, IL MetriTech, Inc. Champaign, IL.
8 th Grade Math Common Core Standards. The Number System 8.NS Know that there are numbers that are not rational, and approximate them by rational numbers.
Points in Distributions n Up to now describing distributions n Comparing scores from different distributions l Need to make equivalent comparisons l z.
1 EPSY 546: LECTURE 1 INTRODUCTION TO MEASUREMENT THEORY George Karabatsos.
Chapter 1: Introduction to Statistics
Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.
Measures of Dispersion & The Standard Normal Distribution 9/12/06.
Applying SGP to the STAR Assessments Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.
ELA & Math Scale Scores Steven Katz, Director of State Assessment Dr. Zach Warner, State Psychometrician.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Using the Iowa Assessments Interpretation Workshops Session 3 Using the Iowa Assessments to Track Readiness.
The ABC’s of Pattern Scoring
University of Ostrava Czech republic 26-31, March, 2012.
1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut.
Correlation & Regression Analysis
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Example 13.3 Quarterly Sales at Intel Regression-Based Trend Models.
LECTURE 14 NORMS, SCORES, AND EQUATING EPSY 625. NORMS Norm: sample of population Intent: representative of population Reality: hope to mirror population.
Unraveling the Mysteries of Setting Standards and Scaled Scores Julie Miles PhD,
Balancing on Three Legs: The Tension Between Aligning to Standards, Predicting High-Stakes Outcomes, and Being Sensitive to Growth Julie Alonzo, Joe Nese,
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.
Examining Achievement Gaps
Vertical Scaling in Value-Added Models for Student Learning
Assessments for Monitoring and Improving the Quality of Education
Growth: Changing the Conversation
California Educational Research Association
Booklet Design and Equating
Policy Approaches to Cut Scores for College & Career Readiness
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Formative Assessments Director, Assessment and Accountability
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Margaret Wu University of Melbourne
Assessment Literacy: Test Purpose and Use
Z-scores.
Analyzing Reliability and Validity in Outcomes Assessment
Lecture Slides Elementary Statistics Twelfth Edition
What is this PAT information telling me about my own practice and my students? Leah Saunders.
Presentation transcript:

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction

Overview Scaling –Definition –Purposes Equating –Definition –Purposes –Designs –Procedures Vertical Scale

What is Scaling? Scaling is the process of associating numbers with the performance of examinees What does 400 mean in WASL? It is not a raw score but a scaled score.

Primary Score Scale Many educational tests use one primary score scale for reporting scores Raw scores, scaled scores, percentile WASL and WLPT-II use scaled scores

Activity Grade 3 Mathematics Items

G3 Math Items Points Possible Test Difficulty Cut Score Form A6Easy5 Form B6Difficult3

Why Use a Scaled Score? Minimizing misinterpretations e.g. Emmy got 30 points last year and met the standard. I got 31points this year but did not meet the standard. Why? The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?

Why Use a Scale Score? Facilitate meaningful interpretation –Comparison of examinees’ performance on different forms –Tracking of trends in group performance over time –Comparison of examinees’ performance on different difficulty levels of a test

Raw Score and Scaled Score Linearly (Monotonic) related Based on Item Response Theory Ability Scale –Each observed performance is corresponding to an ability value (theta) –Scaled score = a + b *(theta)

Linear Transformation Simple linear trasformation: Scaled Score= a + b*(ability) Two parameters are used to describe that relationship: a and b. We obtain some sample data and find the values of a and b that best fit the data to the linear regression model.

WASL 400 = a + b*(theta 1) 375 = a + b*(theta 2) Theta 1 and theta 2 are established by the standard setting committees. a and b are determined by solving the equations above.

WLPT-II Min Scaled Score = 300 Max Scaled Score = = a + b*(theta 1) 900 = a + b*(theta 2)

WASL Scaling 375 is the cut between level 1 and level 2 for all grade levels and content areas 400 is the cut between level 2 and level 3 for all grade levels and content areas. Each grade/content has a separate scale (WASL) All grade levels are in the same scale (WLPT-II) - vertically linked

WASL G 3G 4G 5 G 6 G 7G 8 HS

WLPT-II (Vertical Scale) K

Equating

17 Purpose of Equating Large scale testing programs use multiple forms of the same test Differences in item and test difficulties across forms must be controlled Equating is used to ensure that scale scores are equivalent across tests

18 Requirements of Equating Four necessary conditions for equating (Lord, 1980) : Ability - Equated tests must measure the same construct (ability) Equity – After transformation, the conditional frequencies for each test are same Population invariance Symmetry

19 Ability - Equated Tests Must Measure the Same Construct (Ability) Item and test specifications are based on definitions of the abilities to be assessed –Item specifications define how the abilities are shown –Test specifications ensure representation of all aspects of the construct Tests to be equated should measure the same abilities in the same ways

20 Equity Scales on the tests to be equated should be strictly parallel after equating Frequency distributions should be roughly equivalent after transformation

21 Population Invariance The outcome of the transformation must be the same regardless of which group is used as the anchor If score Y 1 on Y is equated to score X 1 on X, the result should be the same as if score X 1 is equated to score Y 1 If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)

22 Symmetry The function used to transform the Y scale to the X scale is the inverse of the function used to transform the X scale to the Y scale If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale

23 Equating Design Used in WASL Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995) 1.A set of items in common (anchor items) 2.Different groups of examinees (in different years)

24 Equating Method Item Response Theory Equating uses a transformation from one scale to the other 1.to make score scales comparable 2.to make item parameters comparable

Equating of WASL The items on a WASL test differ from year-to- year (within grade and content area) Some items on the WASL have appeared in earlier forms of the test, and item calibrations (“b” difficulty/step values) were established. These are called “Anchor Items”. Each year’s WASL is equated to the previous year’s scale using these anchor items.

Equating Procedure 1.Identify anchor item difficulties from bank. 2.Calibrate all items on current test form without fixing anchor item difficulties. 3.Calculate mean of anchor items using bank difficulties. 4.Calculate mean of anchor items using calibrated difficulties from current test 5.Add constant to current test difficulties so the mean equals mean from bank values.

Equating Procedure 6.For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant). 7.Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item. 8.Repeat steps 3-7 using remaining anchor items.

Equating Example Item Calibrations before equating (Anchor items flagged on right with “Y”

Equating Example Item #17 was removed as an anchor item; other anchors were kept.

Equating Example Item Calibrations after equating (Anchor items fixed with “A” in Measure column

Transformed Scores Raw-to-Theta-to-Scale Procedures 1.Calibration software provides a Raw-to-Theta look-up table. 2.Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee:  (L2)  375  (L3)  400  (L4)  SS, obtained by solving for  (L4) in SS=m*  +b derived from  (L2) and  (L3)

Transformed Scores Example In Grade 4 Mathematics, the Standard Setting Committee established the following cut-scores: Setting  (L2) = 375 and  (L3) = 400, establishes this Theta-to-SS formula: SS = ( *  ) Solving for  (L4), SS(L4) =

Theta-to-SS Transformations The current Theta-to-SS transformations:

Transformed Scores Raw-to-Scale Score table from equating report

How to Determine Cut Score (Until 2006) If there is 400, the cut score is 400 If 400 does not exist, the nearest score becomes the cut score e.g , 400, 402: 400 is the cut score - 398, 401, 403: 401 is the cut score - 399, 402, 405: 399 is the cut score

How to Determine Cut Score (2007) If there is 400, the cut score is 400 If 400 does not exist, the next lowest score becomes the cut score e.g , 400, 402: 400 is the cut score - 398, 401, 403: 398 is the cut score - 399, 402, 405: 399 is the cut score

Vertical Scaling

Vertical Scale Examinee performance across grade levels on a single scale Measure individual student growth Locate all items across grade level on a single scale Proficiency standard from different grade levels to a single scale

Vertical Scaling vs. Equating Equating: scores on different test forms to be used interchangeably within grade level Vertical scaling: –Performance across all grade levels on the same scale –Measure students’ growth –Not equating

Data Collection Design Common item design –Common items between adjacent grade levels –Select appropriate level items to each grade Equivalent group design –Same examinees –Take on-grade test or off-grade test (usually lower grade test)

Common Item Design (WASL) Base Grade Test grade G3 4 G4 5 G5 6 G6 7 G7 8

Previous Vertical Linking Study Math in Grades 3, 4, and 5 Purpose of the study –How much are students growing over time? –What is the precision of these estimates?

Data The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005 Operational data for Grade 4 in 2005

Linking Design Items across all forms in three grades Each form within grade includes a common block of items Common item non-equivalent groups design

Common Item Design (WASL) Base Grade Test grade 345 3G3G4 4G3G4G5 5G4G5

Item Review (Item Means) Common ItemsGrade 3Grade

Item Review ItemGrade 4Grade

Results Comparing the p-values for the linking items across grades suggests some instability Growth is larger from grades 3 to 4 than grades 4 to 5 Pilot data vs. operational data Motivation factor (G4 to G5) Backward Equating

Future Plan Vertical linking study will be conducted in January 2008 using the 2007 reading WASL. The results will be presented next year.