 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.

Slides:

Advertisements

Similar presentations

Assessing Student Performance

Advertisements

Establishing Performance Standards for PARCC Assessments Initial Discussion PARCC Governing Board Meeting April 3,

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.

STAAR/EOC Overview of Assessment Program HISD Professional Support & Development High School Science Team.

HONG KONG EXAMINATIONS AND ASSESSMENT AUTHORITY PROPOSED HKDSE ENGLISH LANGUAGE ASSESSMENT FRAMEWORK.

Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

1 New England Common Assessment Program (NECAP) Setting Performance Standards.

Preparing for New Test Scores  Smarter Balanced assessments measure the full range of the Common Core State Standards. They are designed to let teachers.

Setting Performance Standards Grades 5-7 NJ ASK NJDOE Riverside Publishing May 17, 2006.

Assessing Assessment. Ideas For Varying Test Format Information about a student’s thinking and understanding can be obtained by modifying many multiple.

By: Michele Leslie B. David MAE-IM WIDE USAGE To identify students who may be eligible to receive special services To monitor student performance from.

State of Texas Assessments of Academic Readiness.

New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Standard Setting Inclusive Assessment Seminar Marianne.

Setting Alternate Achievement Standards Prepared by Sue Rigney U.S. Department of Education NCEO Teleconference March 21, 2005.

Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase.

Chapter 7 Correlational Research Gay, Mills, and Airasian

1 The New York State Education Department New York State’s Student Reporting and Accountability System.

Facts About the Florida Alternate Assessment Created from “Facts About the Florida Alternate Assessment Online at:

Measures of Central Tendency

COMPASS National and Local Norming Sandra Bolt, M.S., Director Student Assessment Services South Seattle Community College February 2010.

PUBLIC SCHOOLS OF NORTH CAROLINA STATE BOARD OF EDUCATION DEPARTMENT OF PUBLIC INSTRUCTION 1 Review of the ABCs Standards SBE Issues Session March 2, 2005.

Identifying the gaps in state assessment systems CCSSO Large-Scale Assessment Conference Nashville June 19, 2007 Sue Bechard Office of Inclusive Educational.

1 New York State Growth Model for Educator Evaluation 2011–12 July 2012 PRESENTATION as of 7/9/12.

Robert W. Arts, Ph.D. Professor of Education & Physics University of Pikeville Pikeville, KY The Mini-Zam: Formative Assessment for the Physics Classroom.

Fall Testing Update David Abrams Assistant Commissioner for Standards, Assessment, & Reporting Middle Level Liaisons & Support Schools Network November.

1 New England Common Assessment Program (NECAP) Setting Performance Standards.

Jasmine Carey CDE Psychometrician Interpreting Science and Social Studies Assessment Results September 2014.

Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.

Setting Cut Scores on Alaska Measures of Progress Presentation to Alaska Superintendents Marianne Perie, AAI July 27, 2015.

ELA & Math Scale Scores Steven Katz, Director of State Assessment Dr. Zach Warner, State Psychometrician.

Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.

Copyright © 2010, SAS Institute Inc. All rights reserved. How Do They Do That? EVAAS and the New Tests October 2013 SAS ® EVAAS ® for K-12.

An Analysis of Three States Alignment Between Language Arts and Math Standards and Alternate Assessments Claudia Flowers Diane Browder* Lynn Ahlgrim-Delzell.

0 PARCC Performance Level Setting Place your logo here.

PARCC Performance Level Setting

Building the NCSC Summative Assessment: Towards a Stage- Adaptive Design Sarah Hagge, Ph.D., and Anne Davidson, Ed.D. McGraw-Hill Education CTB CCSSO New.

Using the Many-Faceted Rasch Model to Evaluate Standard Setting Judgments: An IllustrationWith the Advanced Placement Environmental Science Exam Pamela.

AzMERIT Debuts Joe O’Reilly, Ph.D. Mesa Public Schools.

1 Children First Intensive 2008 Grade 5 Social Studies Analyzing Outcomes for ESO Network 14 March 25, 2009 Social Studies Conference, PS/MS 3 Deena Abu-Lughod,

Evaluation Institute Qatar Comprehensive Educational Assessment (QCEA) 2008 Summary of Results.

Vertical Articulation Reality Orientation (Achieving Coherence in a Less-Than-Coherent World) NCSA June 25, 2014 Deb Lindsey, Director of State Assessment.

Unraveling the Mysteries of Setting Standards and Scaled Scores Julie Miles PhD,

NAEP Achievement Levels Michael Ward, Chair of COSDAM Susan Loomis, Assistant Director NAGB Christina Peterson, Project Director ACT.

How was LAA 2 developed?  Committee of Louisiana educators (general ed and special ed) Two meetings (July and August 2005) Facilitated by contractor.

Proposed End-of-Course (EOC) Cut Scores for the Spring 2015 Test Administration Presentation to the Nevada State Board of Education March 17, 2016.

Kansas College and Career Ready Academic Assessment OR Kansas Assessment Program (KAP) Results.

Presentation to the Nevada Council to Establish Academic Standards Proposed Math I and Math II End of Course Cut Scores December 22, 2015 Carson City,

Dept. of Community Medicine, PDU Government Medical College,

Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.

1 New York State Growth Model for Educator Evaluation June 2012 PRESENTATION as of 6/14/12.

What does the Research Say About . . .

What is a CAT? What is a CAT?.

Assessments for Monitoring and Improving the Quality of Education

ARDHIAN SUSENO CHOIRUL RISA PRADANA P.

What does the Research Say About . . .

Update on Data Collection and Reporting

Next-Generation MCAS: Update and review of standard setting

Item pool optimization for adaptive testing

Criterion Referencing Judges Who are the best predictors?

English Language Development Assessment (ELDA)

Interpreting Science and Social Studies Assessment Results

Partial Credit Scoring for Technology Enhanced Items

Amy Clark, Meagan Karvonen, Russell Swinburne Romine, & Brooke Nash

Standard Setting for NGSS

TABE II: Using TABE® Results to Inform Instruction

Welcome Reporting: Individual Student Report (ISR), Student Roster Report, and District Summary of Schools Report Welcome to the Reporting: Individual.

Presentation transcript:

 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing Solutions, James C. Impara Psychometric Inquiries Chad W. Buckendahl Psychometric Consultant

What do standard setters do & how do they do it?  Standard setters recommend cut scores  An early step in the process is defining what the “borderline” examinee is expected to be able to do in terms of the test content. Specifically, they examine the performance level descriptors (PLDs) and define the borderline examinee at each performance level.  Using modifications of the Angoff or Bookmark methods, they review test items and judge the difficulty of the item for examinees who are at the borderline of one or more performance categories.  Typically they estimate item difficulty one or more times (rounds), sometimes with item or other data provided after their first round of item difficulty estimation.

How are tests developed?  Item writers, typically content “experts,” draft items that are responsive to the test specifications (or test blueprint).  The test blueprint may or, most often, may not include a description of the various performance levels.  The test blueprint virtually never provides a description of the “borderline” examinee at each performance level.

Example  Suppose you are a teacher who is writing an end of term test that has 12 questions and you will give the grades A, B, or C to your students.  Typically, you write your 12 questions and grade based on some arbitrary score scale.  Suppose that instead of using the arbitrary score scale you define the skills and knowledge of students at each mark. So you declare what a C student should know, what a B student should know and what an A students should know.

Example - Continued  Moreover, knowing that each definition represents a range of skills, you also define the skills and knowledge for a borderline student at each level.  Now you write four questions associated with each performance level. Some questions/items at the borderline and others above the borderline.  Grading is easy. If a student answers four or fewer questions he/she gets a C, if the score is between 5 and 8, a B, and 9 or more is an A. Does it matter which questions the students answers correctly? Maybe, but that’s another paper.

How does the example fit?  In the example the teacher is both the test developer and the standard setter.  In the case of large scale assessments these are tasks done by different groups.  If the test developers don’t know the PLDs, they may make the job of the standard setting panelists very difficult.  Let’s see what happened.

The study design  This was not a designed study, but an ad hoc study. That is, we developed the research question and looked for data that would provide some answers. Thus, there are some limitations.  The first data collection was in 2009 and the second was in Both in the same southeastern state in the USA.  Both related to the same assessment.

The study design - 2  2009  Performance level descriptors (PLDs) defined initially.  Borderline performance described for each PLD  Standard setting done for alternative assessments (for students with severe cognitive disabilities) in:  English Language Arts (ELA) grades 4 – 8  Mathematics grades 3 – 8  All tests had 15 items scored dichotomously (0 or 2 for each item)  Four performance levels were defined, thus three cut scores  There were separate panels for each content area.

The study design - 3  2013 Standard Setting was the same as 2009 except:  PLDs developed in 2009 were examined and refined. The original PLDs were known to test developers and drove the development process.  Scoring was modified from dichotomous to three-point scoring for each item – partial credit was permitted, so scores were 0, 1, or 2.  Slightly fewer panelists (17 – 20 for each grade span in 2009, 14 – 15 in 2013).

Study design - 4  A final difference in the two standard setting activities was the method used in the standard setting.  2009 used the Modified Angoff method as described by Impara & Plake (1997), often characterized as the Yes/No method.  2013 used the Extended Angoff method as described by Hambleton & Plake, 1995 and Plake & Hambleton, 2001  The reason for this difference was because of the change from dichotomous scoring to giving partial credit (3-points)  Both methods rely on item level judgments.

PLDs  There were four performance levels:  Achievement Level 1 (limited command),  Achievement Level 2 (partial command),  Achievement Level 3 (solid command), and  Achievement Level 4 (superior command).  The PLDs further defined each level  Level 1 students would need academic support,  Level 2 students would likely need academic support,  Level 3 students would be prepared, and  Level 4 students would be well prepared to be successful in further studies in that content area.  PLDs also contained specific abilities that students at that given level could demonstrate.

Study Expectations  The principal research question was: Will the consistency of ratings at the end of round 1 of the standard setting process increase? That is: Will developing items with known PLDs help panelists be more consistent with their initial ratings and more congruent with the item p-values prior to any feedback.

Why?  Why is this an important question?  If panelists are more consistent in their round 1 ratings, then they may come to closure faster in subsequent rounds, perhaps reducing the number of rounds (sometimes 3 rounds are used), thus making the process more efficient.  Panelists often become frustrated if there are no or too few items at a performance level, thus causing them to question the validity of the process.

How?  How will we know if there is greater consistency among panelists?  Distribution of students across levels will be consistent with expectations – most students will be classified at Levels 2 and 3.  There should be greater congruence between actual item difficulty and the panelists’ estimate of item difficulty  The correlation between actual item difficulty and panelists’ item difficulty estimate will be higher  The range of panelists’ cut scores will be lower.  Percentage of panelists who were within one point of the recommended cut score at the end of round 1 will be higher.  The standard deviation of the panelists’ cut scores at each level will be lower.

Result – 1  Distribution of students across levels

Distribution of students - ELA

 In virtually every grade in the 2009 standard setting many students were assigned to achievement level 4, the highest level.  In 2013, the distribution was much more appropriate, with most students assigned to levels 2 and 3.

Distribution of students - Math

 In 2009, several of the grades showed appropriate distributions, but many still have many students assigned to levels 1 and 4.  In 2013, relatively few students were assigned to levels 1 and 4 and the preponderance of students were classified as level 2 or 3.

Congruence of actual and panelist’s item difficulty  It was expected that the actual item difficulty value for an item (i.e., the percent of students in the population who get the item correct) would be greater than or equal to the corresponding Level 2 cut score rating and less than the corresponding Level 4 cut score rating.  Hence, the actual p-value would be between the Level 2 cut score and the Level 4 cut score.  Except there should be a relatively small number of items that have difficulties that are outside this range (those items that virtually all examinees answer (for the Level 1 targeted items) and those that virtually no one answers correctly (those targeted at Level 4).

Congruence of actual and panelist’s item difficulty

Congruence - summary

Correlation Analysis  A correlation analysis compared the relationship between actual item difficulty values and the average item rating at each achievement level.  Expectation: a direct relationship was expected between the item’s difficulty value and average item rating.  As the item difficulty value increases (i.e., the item becomes easier), the greater the chance a borderline student will correctly respond to the item correctly.  This trend was expected for all three cut scores.

Correlation Analysis  Results – the reverse of expectations.  2009 item ratings generally had moderate to strong positive correlations with the corresponding item difficulty values.  2013 ratings tended to have only moderate correlations at best.  The 2009 ratings correlated higher with the p-values than did the 2013 ratings

Correlation Analysis  Why were 2009 correlations higher?  One possible explanation: the 2009 panel only had to make Yes/No judgments whereas the 2013 panel had to make a judgment as to whether a student would score 0, 1, or 2 points on the item  Another possible explanation: is that the items on the 2013 exams may have had more similar difficulty values around the intended PLDs than the 2009 items.  Also, it was learned that in 2013 few students were assigned the 0 score, resulting in a restriction of range of p-values.

Internal Consistency – range  Internal consistency was evaluated several ways  First by comparing the range of recommended cut scores following round 1 for each level and panel.  Thus, a smaller range would indicate that the given year’s panel was more consistent with their ratings than the other year’s panel.

Internal Consistency – Range  Range –ELA

Internal Consistency - Range Range – Math

Internal Consistency – Proximity to the median  Internal consistency was evaluated several ways:  Second, calculating the percent of panelists whose ratings were within one point (plus or minus) of their panel’s recommended Round 1 median cut score (all median cut scores ended up as possible scores, i.e., no median cut score ended in “.5”).  Thus, if the percent of panelists’ cut scores were all relatively close together, they would be close to the median.  For example, the median Level 2 cut score recommendation for the “Math – 4” exam was 6 out of 30 points for 2009 and 7 out of 30 points for 2013.

Internal Consistency – Proximity to the median ELA

Internal Consistency – Proximity to the median Math

Internal Consistency – Standard Deviation  Internal consistency was evaluated several ways:  The third way to look at internal consistency is to compare the standard deviations of the panelists’ ratings across years.  Thus, if the panelists are more consistent, then the standard deviations will be smaller.

Internal Consistency – Standard Deviation ELA

Internal Consistency – Standard Deviation Math

Recall this slide  How will we know if there is greater consistency among panelists?  Distribution of students across levels will be consistent with expectations – most students will be classified at Levels 2 and 3.  There will be greater congruence between actual item difficulty and the panelists estimate of item difficulty  The correlation between actual item difficulty and panelists’ item difficulty estimate will be higher  The range of panelists’ cut scores will be lower.  Percentage of panelists who were within one point of the recommended cut score at the end of round 1 will be higher.  The standard deviation of the panelists’ cut scores at each level will be lower.

How did we do in terms of distribution of students?  Expected result:  Distribution of students across levels will be consistent with expectations – most students will be classified at Levels 2 and 3.  Actual result:  In both ELA and Math the results were as expected in virtually every grade and performance level.  Thus, positive results

How did we do in terms of congruence of item difficulty?  Expected result:  There will be greater congruence between actual item difficulty and the panelists estimate of item difficulty  Actual result:  There was greater congruence between actual item difficulties and panelists’ estimation of item difficulty at all levels and grades.  However, there were few cases in which the actual p-values were outside the cut score boundaries.  Thus, somewhat positive results, but somewhat problematic.

How did we do in terms of correlation of actual and estimated item difficulty?  Expected result:  The correlation between actual item difficulty and panelists’ item difficulty estimate will be higher in  Actual result:  The 2009 ratings correlated higher with the p-values than did the 2013 ratings  Thus, negative results

How did we do in terms of the ranges of panelists’ cut scores?  Expected results:  The range of panelists’ cut scores will be lower in  Actual results:  In the majority of grades and levels the range of cut scores was lower in 2013, particularly at Levels 3 and 4.  Thus, mostly positive results

How did we do in terms of the proximity of panelists’ cut scores to the median?  Expected result:  Percent of panelists who were within one point of the recommended cut score at the end of round 1 will be higher in  Actual result:  In the majority of comparisons, the percent of panelists ratings that were within one point of the median was higher in  Thus, mostly positive.

How did we do in terms of the standard deviations of panelists’ cut scores?  Expected results:  The standard deviation of the panelists’ cut scores at each level will be lower in  Actual results:  In the majority of comparisons the 2013 panels had lower standard deviations than the 2009 panels.  Thus, mostly positive

Overall & Conclusion  The results overall supported providing test developers with the PLDs.  More specifically designed research is needed. Many limitations to this study.  If future studies are supportive of providing the test developers with the PLDs and they are instructed to target item development to these PLDs, it could result in more efficiency in the standard setting process and in greater levels of satisfaction among panelists.

Questions?  Thank you for your attention.  Are there any questions?