Bringing Performance-Based Assessment into the Summative Space

Bringing Performance-Based Assessment into the Summative Space
June 28, 2017 Bringing Performance-Based Assessment into the Summative Space

Objectives To understand the foundational steps states and local education agencies must take to build capacity for using performance assessment in the summative space To understand the technical challenges that must be addressed in order to use performance assessments for summative uses To highlight examples from Ohio and other contexts about their journey to move in this direction

Our Presenters Dr. Raymond Pecheone, Stanford Center for Assessment, Learning and Equity (SCALE) Dr. Stuart Kahl, Measured Progress Dr. Ruth Chung Wei, SCALE Dr. Susan Zelman, the Ohio Department of Education

Agenda Principles for Designing an Innovative Assessment System FOR Learning Lessons Learned – What Have We Learned from Past Performance Assessment Initiatives? Examples of Emerging Approaches for Performance Assessment Use in the Summative Space A State’s Perspective: Insights from Ohio

Principles for Designing an Innovative Assessment System FOR Learning
Raymond L. Pecheone, Stanford University

Room to Innovate Innovative assessment systems may make use of:
“competency-based assessments, instructionally embedded assessments, interim assessments, performance-based assessments that combine into an annual summative determination for a student,” and “assessments that validate when students are ready to demonstrate mastery or proficiency and allow for differentiated support based on individual learning needs” --ESSA, 2015, Section 1204(a)

Why Principles for Design?
ESSA does not provide an underlying theory of action about the broader purposes assessments should serve (as did NCLB) States now have the autonomy to be the “master of their ship” Not all designs will lead to better assessment

Innovative Assessment FOR Learning ~Some Principles
Be part of a multiple measures system that is connected to curriculum and instruction Support both teacher and student learning Engage teachers throughout the process Provide timely, useful information to teachers, students, parents, and schools Be practical in the service of deeper learning Support language development in diverse classrooms

A. Assessments should be part of a multiple measures system that is connected to curriculum and instruction. Principle A in action: (An example) A collection of instructionally-connected Classroom Based Assessments (CBAs) comprised of performance assessments and other appropriate item types aligned to the learning outcomes to be measured By Instructionally-connected we mean: 1) the CBA is completed in the context of a sequence of relevant learning experiences that prepare students for and possibly scaffold students’ ability to complete the CBA; 2) the CBA is aligned to opportunities to learn, i.e., the assessment measures content understandings and skills that have previously been taught; and 3) the CBA provides immediate, actionable information that can be used by teachers to make instructional decisions. In this system, CBAs would be completed by students in all tested grades and subjects throughout the year with the support of their teachers. These CBAs could be comprised of performance assessments, each of which would be designed to provide evidence of meeting a cluster of key grade-level competencies delineated in the state’s content-area standards (e.g., research); a set of constructed-response questions that measure a smaller number of grade-level competencies (e.g., making inferences); and if appropriate to the measurement target, shorter selected-response tests that are completed in an on-demand setting and scored in an automated way (e.g., a measure of reading level). They should reflect the kinds of assessment formats that are used regularly by teachers to support and measure student learning so that students are familiar with the assessment formats

Principle B in action: (Examples)
B. Assessments should be supportive of both teacher and student learning. Principle B in action: (Examples) Rigorous standards-aligned tasks that prompt student thinking and allow students to exercise voice and choice Teachers score some of their own students’ work so they can get results quickly and act immediately on the results The assessment itself should provide an opportunity for students to learn and should spur student inquiry, experimentation, and reflection. A high-quality assessment also provides opportunities for teachers to learn what is and what is not working in their own instructional practice (Stiggins, et.al, 2004; Stiggins & Chappuis, 2006; Heritage, 2007).

C. Assessments should engage teachers throughout the assessment process.
Principle C in action: (Examples) Teachers select or design Classroom Based Assessments that fit best within their curriculum, Teachers determine when students are ready to demonstrate mastery of the targeted competencies and administer the CBAs they have selected in any sequence across the academic year, following instruction relevant to the content addressed by the task. Teacher involvement in developing and scoring assessments not only supports the quality of the assessment system, but is, in itself, a high-quality professional learning experience that can increase educator assessment literacy, improve instructional strategies, and lead to deeper student learning (Arnold, 2016; Darling-Hammond & Falk, 2013; Herman, Aschbacher, & Winters, 1992).

BUT What about comparability?
State-level CBA bank that includes multiple CBAs options for each measurement target per grade level, common scoring tools Teachers may design new tasks according to task specifications and submit for inclusion in bank in future years State-level assessment quality review board 1) An authorized state task bank comprised of grade-specific assessment instruments that have been developed, validated, and vetted for quality and alignment to measurement targets. These tasks and accompanying scoring tools, including common rubrics, would be developed by the state education agency (or their contractors) in collaboration with teams of educators (teachers, principals, school leaders) and local education agencies, and reviewed by stakeholders representing students with disabilities, English learners, and other children, parents, and civil rights organizations. 2) Locally developed assessment tasks that have been designed using approved design templates, task models, assessment specifications, and common rubrics, and that have been vetted for quality and alignment to target standards clusters through a state-level assessment quality review board that includes both educators and stakeholders. Locally developed assessments that are approved may be incorporated into the authorized state task bank for future years.

Intermittent use of standardized tests in some years (an example)
Grade ELA Mathematics Science 3 CBAs 4 CBAs + Standardized Test 5 6 7 8 9-12 To evaluate comparability, per the Innovative Assessment Demonstration Authority regulations, high-quality standardized assessments that include performance tasks (e.g., SBAC, PARCC) could be administered once in each of three grade spans (Grades 3-5, 6-9, 10-12). These assessments could be used to measure the claims that are most validly measured by standardized formats. For example, in English language arts, the Reading and Listening claims may be measured validly using standardized assessments while the Writing and Research claims could still be measured using CBAs. In years when the large-scale assessments are administered, the use of instructionally connected CBAs could be reduced but not eliminated (in order to reduce the burden for teachers and students). 1 In grade levels when state testing is not required by ESSA, formative CBAs not included in the state’s accountability system could be developed and used by teachers to support a consistent instructional and assessment approach across grade levels. 2 Testing in History/Social Studies is not required by ESSA, but in states that do assess this content area, CBAs could be used across all grade levels, with formative, teacher-developed CBAs being used in most years except at three grade levels across the Grade 3-12 span.

BUT What about validity and reliability?
Multiple measures – performance assessments along with other item types are used and scaled together (constructed response, selected-response) from across different assignments and tests. Scoring options: Local scoring with external audit, score adjustments Local scoring (10%) with external scoring (90%) External moderation should include teachers to support ownership 1) Multiple measures – multiple performance tasks and use of a range of item formats in the CBAs in addition to performance assessments allows for sampling across more of the standards and across multiple occasions, providing more opportunities for students to demonstrate their learning without relying on one type of test alone. This strategy supports both content and construct validity, and reliability. 2) Reliable scoring - Some proportion of students’ work on the selected CBAs would be scored by teachers themselves, so that immediate feedback to students could be provided and so that teachers could use information from these assessments to guide on-going instructional decisions. To reduce the burden of scoring for teachers, they could score a sample of the total number of samples in order to generate immediate, instructionally useful information with the remaining samples being scored by professional raters or a distributed blind-scoring approach in which teachers may be involved. Teachers could receive training through online training webinars or tutorials to score the common CBAs using common scoring tools, with the goal of meeting a calibration standard.

Principle D in action: (Examples)
D. Assessments should provide timely, useful information to teachers, students, parents, and schools. Principle D in action: (Examples) Teachers receive periodic release days to score student work on CBAs; Teachers participate in local “professional moderation boards” to audit local scores “Data dashboards” of student performance on all CBAs readily available to teachers and parents 1. Scoring CBAs built into teachers’ professional time. In order to score their students’ work on assessments immediately after they have been submitted, and in order to use the assessments to understand where each student’s learning is solid and where it is more fragile, teachers need to be supported to take the time needed to score student work carefully. For example, they could be entitled to receive one school day of release time per CBA, which they would schedule at their own convenience. These release days for scoring immediately following collection of each CBA are crucial for allowing teachers to provide immediate feedback on student learning and to effectively use the data from the assessments to guide decisions about day-to-day instruction 2. Teachers participate in a “professional moderation board”, a rotating committee primarily comprised of trained educators with some proportion of professional raters. Allows teachers to participate in setting the standard for rigorous scoring and deepens their own understanding of the meaning of student results. 3. State could invest in technology solutions to create a data dashboard that is updated by teachers and professional scoring vendors throughout the year, showing each student’s current levels of proficiency by content area domain, and then at the end of year the “rolled up” summative scores would be reported for school-level learning.

E. Assessments should be educative and practical.
Principle E in action: (Examples) Assessments “worth doing” Assessments spread across the year Assessments are administered like classroom assignments, not testing events Assessments generate actionable feedback Assessments and assessment systems should not be overly burdensome – logistically, time-wise, or otherwise – to teachers, students, or schools. Classroom Based Assessments should be instructionally connected and structured as learning events rather than siphoning time from instruction or opportunities to learn. Instructionally connected Classroom Based Assessments are administered as teacher-supported learning opportunities in the classroom, so students remain in their usual classroom setting, continuing to learn and demonstrate learning as they would at any other time in the school year. These assessments do not require adjustments to the school schedule or create “dead time” during which some students access the computer labs while others are held in classrooms for long periods of time waiting for their testing window. In other words, these assessments do not stop learning to start testing.

Principle F in action: (Examples)
F. Assessments should support language development in diverse classrooms. Principle F in action: (Examples) Assessments are designed and vetted with systematic attention to language development needs and opportunities. Teachers are involved in designing, vetting, and scoring of assessments with this design feature, which builds capacity to attend to their own students’ language development Classroom based assessments that go beyond requiring students to bubble in or write a correct answer, such as constructed response items, performance tasks, and investigations, are more likely to engage learners of English and disciplinary language in opportunities to develop both their social and disciplinary language skills (Bunch, Kibler, & Pimentel, 2012; Moschkovich, 2012; Quinn, Lee, & Valdes, 2012). These types of assessments may be open-ended and allow for multiple solution strategies, multiple entry points, and varied ways of demonstrating learning. This open-endedness and administration in a classroom context provides opportunities for students to interact with each other and use language in context, so that they develop content understandings, and social and disciplinary language skills. Such assessments also provide more substantive and actionable information about students’ understanding and skills that can be used by teachers and students to make decisions about their learning. Design features for supporting and assessing disciplinary understandings and language development are built directly into the task design tools and the scoring tools, including task specifications, quality criteria, task templates or task models, and rubrics. Furthermore, attention to language barriers and language supports are built into the vetting process for tasks submitted as candidates for both the authorized state task bank and the locally developed task bank, described above. The state-level assessment quality review board, which includes both educators and stakeholders, either includes or consults with experts in the field of language development in school contexts.

Other Technical Considerations in the Summative Space
Validity Tight alignment to standards, including deeper learning outcomes and more complex cognitive demands Comparability Common Architecture and Design Criteria (e.g. Cognitive complexity) State Assessment Quality Review Board, that includes educators Assessments should be tightly aligned to the full range of state and national standards (e.g., CCSS, NGSS, C3 Framework,) including deeper learning competencies and more cognitively demanding thinking and doing in the disciplines. Current state assessments do not and cannot measure the full range of the standards. Using a variety of multiple measures, including performance assessment, opens the door for the full range of student learning outcomes to be assessed. High quality embedded assessment tasks should be developed using well-vetted design principles. Many research centers and district networks use common frameworks such as Evidence Centered Design (Mislevy, Almond, & Lukas, 2003) and Learning Centered Design (SCALE, 2016) to guide their development of performance tasks. Being transparent and explicit about the design of tasks enables peer and expert review based on core design principles so that tasks can be objectively evaluated and independently certified and curated. Curation and certification should be implemented at all levels of the system (district, regional, and/or state). Using social moderation techniques to judge task quality accomplishes three important goals: (1) it creates a common language and interpretation of student work, (2) it provides a feedback loop for continuous improvement, and (3) it serves to ensure and certify that locally developed curriculum embedded assessments meet technical standards of quality (trustworthiness). Traditionally, comparability is an estimate of whether the test items/performance tasks are equally difficult (Klein & Hamilton, 1999). Comparability is the degree to which standards (identified learning targets) on multiple forms of assessment(s) measure the same learning targets at about the same level of difficulty and produce the same or similar results.

Comparability to Standardized Assessments? Two interpretations - Strong correlation with Standardized Test scale scores Proficiency level determinations are comparable The former interpretation raises a number of questions, including: 1) If an innovative assessment system is to be innovative, why would the measure of its validity be based on old assessment systems that clearly have limitations in what they can measure; 2) If an innovative assessment system is to be innovative, would we not expect that it would generate different and richer kinds of information about what students understand and can do with their content understandings? In that case, why would be want the innovative assessment to demonstrate “statistically valid comparisons” with the old state summative assessments? The second interpretation of the regulations – that proficiency level determinations should be comparable – is a more reasonable standard to meet, but even then, the assumption is that the old summative assessment systems are generating valid determinations of proficiency based on a limited source of evidence (primarily multiple choice items that measure a different kind of content knowledge than that measured by the innovation assessments).

Defensible Score Scales Common rubrics or scoring tools developed and accepted by state level stakeholders Hybrid – common state-level core rubrics with local add-on rubrics to address local priorities A persistent dilemma around using performance assessment in a large-scale assessment system is achieving agreement around the use and development of a scoring metric. Locally developed and designed performance assessments are often coupled with rubrics that are home grown and developed by local committees. While locally developed score scales have the feature of local buy-in, they pose real dilemmas for judging the quality and comparability of student performance within a large-scale state accountability system. In effect, they are “rubber rulers” that are difficult to aggregate and are not generalizable across classrooms or schools given differences in contexts, content, score levels, teacher expectations, and cognitive demand. To address this formidable challenge, rubrics and/or score scales for state accountability consideration should be developed and agreed to at the regional or state level by content experts and teacher leaders. Rubrics/score scales must be aligned to standards and/or critical abilities (e.g., 21st Century Learning) that can be uniformly applied to scoring student performance tasks across all schools and districts. The structure of the common rubrics could vary across states, but having a common scale is essential to the development of trustworthy, locally developed, curriculum embedded assessments. A hybrid approach to the development of common rubrics within or across disciplines is to identify core abilities aligned to content standards and treat the core abilities as the “common anchor” and allow for local schools/districts to add additional scoring criteria that is either not present or under-represented. In effect, it is a hybrid rubric with an anchor set of common rubrics.

Reinvention of the Summative Space
Working toward a Culture of Revision and Redemption – IS NOT “CHEATING” A competency based approach to measuring – rather than a “one and done” approach. Allows for opportunities for peer review and collaboration, feedback, “revise and resubmit”. Focus is on continuous improvement and growth (a “growth mindset”). The paradigm shift in testing using instructionally embedded CBAs as a core component of a new system of assessment is moving from an on-demand testing approach to a competency-based system of assessment. In traditional testing, your first attempt at answering the items (i.e., your “first draft”) is your only opportunity to demonstrate understanding. There are no opportunities to revise. In a culture of revision and redemption, opportunities for peer review, peer collaboration, feedback, and revision are built in to the assessment process, with a focus on student understanding and demonstrated competence. Therefore, student achievement is assessed through systematic collection of student evidence of learning leading to the completion of a culminating task – a “final draft.” Moreover, multiple-measures competency-based systems connected to instruction are designed to foster continuous improvement and growth. In this case, focusing on improving one’s scores is not “test-prep” but a worthwhile endeavor because it leads to more learning through improvement of a work product. It’s an “assessment worth teaching to”.

Let’s Give Performance Assessment a Chance
Presentation in session “Bringing Performance-Based Assessment into the Summative Space” at NCSA, Austin, June 28, 2017 Stuart Kahl, Measured Progress

35 years of involvement in large-scale performance assessment
I believe “real” performance assessment can play a significant role in state accountability assessments.

What have we learned from past performance assessments?
Performance Assessment Studies Connecticut Assessment of Educational Progress (CAEP) in Science Massachusetts Educational Assessment Program (MEAP) – 1989 NAEP – A Pilot Study of the Higher-Order Thinking Skills Assessment Techniques in Science and Math – 1986 Connecticut Assessment of Educational Progress (CAEP) in Science – 1983 * Small statewide sample * Trained external administrators * Materials provided * “Survival” investigation interesting finding: Students are better at identifying variables to control in an actual investigation than on a multiple-choice question. Massachusetts Educational Assessment Program (MEAP) – 1989 * “Cubes” subtask interesting finding: Most students can find the volume of a rectangular solid on a multiple-choice item, given the dimensions; but few (actually none) applied that skill when asked how many unit cubes could fit in a box, given only enough cubes to fill a portion of the box. NAEP – A Pilot Study of the Higher-Order Thinking Skills Assessment Techniques in Science and Math – 1986 * “Large” sample * Conclusion: Performance assessment of this type is not feasible for large-scale assessment.

What have we learned from past performance assessments?
Performance Assessments that “Count” (note: all central scoring) CAEP Business and Office Education (BOE) – 1985 Rhode Island Distinguished Merit Program – California Golden State Examination in Chemistry – Kentucky (KIRIS) Performance Events – Colorado Alt – 2001-… NECAP Science CAEP Business and Office Education (BOE) – 1985 * Teacher-administered * Secretary - shorthand/transcribing - letter composing, formatting, typing * General Office - timed typing - completing forms, filing, data entry * Accounting - journal entries - payroll/taxation - financial statements Rhode Island Distinguished Merit Program – * Optional * “Pass” multiple-choice to qualify for performance * “Distinguished Merit” designation on diploma * 18 areas, incl. automotive repair, hairstyling, …, and major academic areas * Extended constructed-response in academic areas, performance in “Voc-Tech” areas where local professionals assisted with admin/scoring California Golden State Examination in Chemistry – * 3 components: MC, extended CR, performance (lab) * All teacher-administered * Lab materials (kits) provided * Directions for making lab station dividers (for security of individual student lab work) Kentucky (KIRIS) Performance Events – * All students, 3 grades, 5 subjects * Materials provided * Trained external administrators * Whole class at a time - Students randomly assigned to tables (3-5 per table) - Different events at tables - Administrators & teachers set up table stations before each class - First half period group work, second individual to produce scorable products Colorado Alt – 2001-… * Sequenced mini-tasks administered by teachers to students individually * Interesting CSAP-Alt math story: - Original intent was to use readily available classroom materials and work tasks into daily routines - Example: In “clean-up,” student was to put marking implements (pencils, crayons, markers) away – involved sorting, counting, comparisons of numbers, and graphing - Ultimately, all materials custom-made and replaced every year NECAP Science * Grade 4 – hands-on materials provided * Grade 8 – hands-on or response to data/science info * Grade 11– response to data/science info

Attributes Limiting Feasibility
Using trained external administrators Providing materials 100% central scoring Time and cost! What are some attributes of the past performance assessments we’ve just seen that limit the feasibility of performance assessment for statewide assessment programs? Solutions Don’t use external administrators; use teachers. Don’t provide materials; use readily available materials in classroom or online. Don’t do 100 % central scoring; let teacher’s score with training and audit procedure. All of the above and use of instructionally-embedded performance testing.

What have we learned from state portfolio assessments?
Vermont – early ‘90s Kentucky – 1992… Vermont – early ‘90s * Writing and math * Intent to have no numerical scores * Issues – whose work, task quality, scoring Kentucky – 1992… * Same issues as above * Scoring consistency of writing achievable over time * Writing portfolios counted for over a decade, then optional * Math portfolios never counted

Problems with Portfolio Assessments
Task quality Scoring consistency Whose work is it? (Note: With respect to the first two, writing and math are different animals.) What are some problems in portfolio assessments we’ve just seen that should be considered when designing large-scale performance assessments for state programs? Writing skills are very generalizable and teachers can come up with prompts that elicit writing. Local task quality and scoring consistency are bigger problems in math than in writing. (Elaborate) Solution to the problems: I’ll get to that later…an alternative to portfolio assessment.

Reliability of Direct Writing Assessments
What have we learned from direct writing assessments? In the past, comparisons have been made between a comprehensive multiple-choice test and a single performance task. More items or tasks mean higher reliability. In writing, two might suffice if the results are to be combined with another component of language arts assessment. Given the emphasis in new standards on explanation and argumentation, a prompt for each mode of discourse would be good. Oftentimes, we combine results from a direct writing task with those from multiple-choice grammar items, with the latter counting much more. That’s crazy given how much more evidence is in a student’s writing sample.

Analytic Scoring of Direct Writing Average Inter-correlations of Analytic Traits
Prompt Scorer Correlation Ice Age Human .947 e-raterR .614 IEA .753 Reflection .917 .535 .648 What else have we learned from direct writing assessments? From the table, one should conclude that human analytic scoring of writing samples is useless unless specific feedback is also provided to students – and this is not likely in large-scale assessments of writing. The halo effect exhibited by human analytic scores is much too great. The large number of points from human analytic scoring are not independent and do not improve the reliability (computed correctly, not by internal consistency). You need more tasks or more independent measures to increase reliability. One solution, use two tasks with single holistic scoring by humans combined with computer analytic scoring.

Covering a Subject Area Domain
The statement that multiple-choice tests are more reliable than constructed-response tests is not true – it’s a question of how many items or tasks there are. Multiple-choice items are scored more accurately, but scoring accuracy is just one of many contributors to reliability. Constructed-response questions scored by humans make up for this in several ways – ways that also enhance validity – a topic for another time. With good 4-point constructed-response questions, one would need 8 to 12 questions to yield the same reliability as a 50-item multiple-choice test – even with human scoring of the former. Far fewer performance tasks would be needed, depending on how many independent points each is worth. Domain coverage (what you want your results to generalize to) still matters…this is why multiple measures are important.

Key Principles for Implementation of Innovative Accountability Assessments
Teachers should be able to see relatively immediate instructional benefit from at least one component of a program. Assessment innovation should not place an additional burden on teachers. Transition to an innovative approach should be phased over time. For more on these principles, see:

An Optimal Program Design
End-of-Year Summative Test Performance Component Common Matrix-Sampled Field Test MultipleChoice 4-pt Con-structed Response Multiple Choice 3 Curriculum-Embedded Performance Assessments 25 2 5 1 60 minutes Multi-Day Instructional Units All of this makes the design of the optimal program pretty obvious to me! (Elaborate)

Curriculum-Embedded Performance Assessment (CEPA) -- Definitions
A CEPA is a multi-day instructional unit or module that consists of a sequence of instructional activities, some of which lead to student work that is used for formative purposes and some of which lead to student work that can be used for summative assessment. CEPA (Curriculum-Embedded Performance Assessment) is the general approach of using CEPAs as a component of an assessment system. For more on CEPA go to:

Selected Specs for CEPAs (5 of 14)
A CEPA should: introduce or refresh foundational knowledge and skills, require students to demonstrate deeper learning by applying the foundational knowledge and skills, yield multiple, individually-produced, scorable student work products for summative use (independent measures worth a total of at least 20 points), involve the use of technology somewhere in the unit, involve small group work somewhere in the unit.

A CEPA Packet Includes:
Listing of relevant content standards and learning targets Descriptions of instructional activities and embedded performance tasks Scoring rubrics Sample student work

Teacher Input and “Ownership”
Flexibility regarding some CEPA activities Development of model CEPAs Submission of teacher-developed CEPAS Teacher choice of CEPAs for accountability Teacher scoring For more on teacher ownership and phasing in the innovative assessment model, see: * Teachers can have flexibility implementing instructional activities, except those yielding student work for summative use. * High quality CEPAs developed by state teacher committees, vendors, other specialists can serve as models for teachers developing their own CEPAs. * For state programs, teacher-developed CEPAS can be submitted to the state for review, revision, and piloting. * For the CEPA accountability component, teachers can choose the state-approved CEPAs they want to use in place of units they’ve used in the past, and they can use them when they fit in their scopes and sequences. * Teachers score their own students’ work on CEPAs, making results immediately available. (For state assessment, a score audit and adjustment process would be implemented on a sampling basis.)

The Two-Component Accountability Assessment Model
addresses all the major concerns about the use of performance assessment for statewide accountability and about state accountability assessments generally, is consistent with the principles listed earlier, contributes to a balance between: formative and summative assessment local and state assessment attention to foundational knowledge and deeper learning

and is FASTER, BETTER, CHEAPER!
For a discussion of shortcomings of other interim assessment approaches for ESSA, see:

It’s all about student learning.
Period. Thank you.

Discussion – Q&A

Ruth Chung Wei, Stanford University
Examples of Emerging Approaches for Performance Assessment Use in the Summative Space Ruth Chung Wei, Stanford University

Key Criteria for Innovative Assessment System (ESSA)
Same assessments must be used to measure the achievement of all participating students in the innovative assessment system Must be comparable to the state academic assessment and provide for an equally rigorous and statistically valid comparison between student performance on the innovative assessment and the existing statewide assessment

Key Criteria for Innovative Assessment System (ESSA)
Be aligned to the challenging state academic standards and address the depth and breadth of such standards Express student results or student competencies in terms consistent with the State’s aligned academic achievement standards Generate results that are valid and reliable, and comparable, for all students and for each subgroup of students

Working toward comparability
Task Comparability Common rubrics and scoring tools Clearly defined content and design specifications Common templates or task shells Comparability with State Tests – the challenge Address the same measurement targets as the standardized tests Address depth and breadth of the standards

Ohio Competency Based Education (CBE) Performance Assessment Pilot
Approach 1 – Mapping a Portfolio of Assessments to a State Test Blueprint Ohio Competency Based Education (CBE) Performance Assessment Pilot 6th grade mathematics Algebra 8th grade science U.S. History A common task approach

Framing a Portfolio of Assessments to Valued Outcomes

Approach 1 - Important Technical Considerations
“Common Task” Approach Common Scoring Guides – Scoring criteria aligned to specific standards Teacher designed with expert review and revision, supplemented with expert designed tasks Inclusion of varied item types Teachers calibrate and score, with external audit Task security

Approach 2 – Mapping a Collection of Performance Tasks to Assessment Claims
Evidence-Centered Design (Mislevy, Almond, & Lukas, 2003) Standards clustered into Claims, Targets, and Evidence Statements SMARTER Balanced Claims ELA Claims Mathematics Claims Reading Concepts & Procedures Writing Problem Solving Research Communicating Reasoning Listening Modeling & Data Analysis

Literacy Design Collaborative Modules http://coretools.ldc.org
Approach 2 – Mapping a Collection of Performance Tasks to Assessment Claims Literacy Design Collaborative Modules Writing/Research tasks – Explanatory and Argumentative Writing or Research Embedded mini-tasks within each module address a range of CCLS standards Independent mini-tasks 1600 writing modules 2175 mini-tasks

Example – LDC Writing Module
CCLS Writing 5.2 Write informative/explanatory texts to examine a topic and convey ideas and information clearly Explanatory Writing Prompt (Grade 5) What is the theme of the poem Mother to Son? After reading Mother to Son (and an informational text on metaphors), write an essay for our class literary magazine in which you analyze how Langston Hughes' use of metaphors contributes to an understanding of the theme of this poem. Support your response with evidence from the text/s. Include several examples from the texts in your response. Source:

Example – LDC Writing Module – Embedded mini-task
CCLS Reading Literature 5.4. Determine the meaning of words and phrases as they are used in a text, including figurative language such as metaphors and similes. Prompt: Paraphrase each stanza in the poem, demonstrating a clear understanding of the meaning of the metaphors the poet used. Source:

Example – Mapping SMARTER Balanced ELA Assessment Claims to LDC Writing/Research Tasks, Mini-tasks, and other item types

“Common Task Shell/Template” Approach Common Scoring Guides for each Assessment Claim - Scoring criteria aligned to specific Targets Teacher designed with peer review and feedback; national reviewers certify quality Inclusion of varied item types (essay, constructed response, on-demand performance, selected response) Teachers calibrate and score, with external audit

Approach 3 – Common Interim Performance Tasks (Math, Science)
Antelope Valley, California/ L.A. County Office of Education Teacher-developed performance tasks with expert review and feedback Administered as interim assessments, completed in an on-demand setting. Similar to earlier prototypes of SMARTER Balanced mathematics performance tasks.

“Common Task” Approach (Sampling topics and content domains from across a grade level course) Teacher designed with expert review and revision to ensure quality and comparability Limitations of using only performance tasks Task-specific scoring guides Teachers calibrate and score after local training, with district-level audit Task security

Performance Assessment – A State’s Perspective
Dr. Susan Tave Zelman – June 28, 2017

State Responsibility If we want our education system to promote authentic problem-based learning, the state has a responsibility to provide authentic performance-based assessments in a summative system. This is what business, higher education and Ohio workforce development board wants educators to do: The market tells us there is currently growth in formative assessments. Many new digital tools however, are not performance based and mirror traditional summative assessments. Our current assessments can measure what students are able to do, at least to a degree; just maybe not to the depth that we might like. Performance assessments will drive change in classroom practice, moving to assessing not only to what a student knows, but what they are able to do and how they think. This measure will help us assess a deeper understanding and appreciation of content knowledge, critical thinking and creative problem solving.

Moving Toward A Balanced Summative Assessment System
I personally like they way Paul Leather, Deputy Commissioner of Education in NH, represents Linda Darling Hammond’s continuum on assessment of deeper learning. I will argue that in 10 year from now, given new emerging tech and an outcry from the business and higher education community, state department’s of education will move toward performance tasks that require students to formulate and carry out their own inquiries, analyze and present findings and (sometimes) revise in response to feedback. However, Ohio as well as other states are still struggling in how to integrate performance tasks in a system with standardize tests which include multiple choice and open ended items. As you heard from Stuart there have been many attempts at this since the 1970’s, but no state has yet been successful in creating a state wide system integrated into state accountability. Paul Leather, NH Linda Darling Hammond

Issues Design State’s Role Alignment Integration into State Accountability Opportunity to Learn Resource Money Time Capacity: State Vendors Field Technology Platform Political Will Measurement Validity Reliability Timeliness of Results Comparability Scalability There are internal and external pressures that prevent the State from moving forward. First there are design issues. States must be sure that performance assessments are aligned to State’s standards, and assessments of multiple choice and constructed response items. I would suggest we try to perfect stage II of Linda Darling Hammonds continuum, however, there are stake holders who would like the state to move more quickly. Some educators would like us to drop assessments completely, others would like us to move to performance assessments exclusively in place of traditional summative tests for accountability purposes. I believe that it would be cost effective for the state to purchase a performance based technology platform system, and give educators training in creating performance assessments and providing immediate feedback testing 21st century skills. We would build capacity of educators to expand resources of skills and create opportunity to learn. Next we look at Resources as an Issue: Developing a performance based system that integrates into a state summative system requires: dollars, time and capacity at the state level and political will. Educators, legislators, higher education, Governor's office, and lobbyist need to want to do integrate performance assessments into a state accountability system. Third we should look at Measurement Issues if performance assessment is used for summative accountability: First we must overcome the challenges of combining the performance assessment with the scores of multiple choice and open ended responses. Depending on where we are on Paul Leather’s continuum, we need to decided how to get valid and reliable measures of the performance tasks. We need to get clarity on the role of the teacher in the performance assessment, as well as what role peer collaboration play in the assessment of an individual student outcome. .

Ohio Pilot Programs Ohio Performance Assessment Pilot Project (Phase 1) Ohio Performance Assessment Pilot Project (Phase 2) See: The Task Dyad Learning System: Concept and Experiments in Technology Assisted Performance Assessment, by Moore, Monowar-Jones, Xu New SCALE Project (Competency Based Education Pilot) STEM School Performance Assessment In 2008, the Ohio Department of Education, through grants received from the Gates Foundation and the Hewlett Foundation, began a collaboration with SCALE (Stanford Center for Assessment, Learning and Equity) to design, develop, and pilot a performance-based assessment system that could serve as one indicator (among multiple measures) of students’ readiness to graduate from high school, as well as their readiness for college and careers. This system includes performance outcomes, scoring rubrics, and performance tasks in three secondary content areas: ELA, mathematics, and science. The tasks were designed to be “curriculum-embedded” and to be completed by 11-12th grade students over a 1-4 week period in their classes. With Race to the Top dollars, In , OPAPP conducted its first pilot of these performance assessments across 15 pilot sites in the states, including approximately 150 teachers. The project completed a second pilot year in with the same group of pilot sites with streamlined versions of the performance tasks. As articulated in a paper presented at the CCSSO Conference in June 2012: Moore, Monowar-Jones, Xu, ODE staff in The implementation phase of the project had several problems, which were: 1. Tenuous alignment. 2. Scale insensitivity. 3. Excessive time to score. 4. The problematic distilling of individual inferences from group work. 5. Rater (in) consistency. 6. Teacher role ambiguity. You heard about the new SCALE project from my colleague Ray Petrone. Since this time, the Federal Government through ESSA has created conditions for the inclusion of the performance assessment in state assessment and accountability systems. New work has begun through a new office of Innovation, by defining high level competencies aligned to states standards and supporting districts engaged in developing local performance assessments, that may support students demonstrating these competencies. SCALE is working with 6 districts to design performance assessments and scoring tools, implement their performance assessments in participant teacher’s classrooms and score student work. SCALE works with these educators to improve their designs and reflect on their instructional practices as evidence of student competencies. This project includes over 50 classroom teachers and district leaders and has produced over 23 unique teacher develop performance assessments. The competency based pilot would like to work with the Innovation Lab Network, supported by CCSSO and believe that this assessment project would be a major step forward in creating a universal assessment, going beyond the traditional. 4 content areas: secondary algebra, secondary us history, 8th grade science, and 6th grade mathematics. In addition we are piloting the STEM School Performance Assessment: This project attempts to implement the - to know – into the –to know how – through authentic project based learning and assessed through performance based measurement. The project uses an online tool which captures evidence of the learners knowledge, processes and progress. It produces a rank order of student work, with high reliability. The method has been used more extensively in England and Australia, and we will test it this summer and early fall in an “Ohio” setting in the context of STEM schools.

education.ohio.gov

Join the Conversation OHEducation OHEducation @OHEducation
@OHEducationSupt Did you hear the news? We’re now on Instagram! Check out our official account and follow us at instagram.com/OHEducation. There, you can keep up with top news and updates and, this fall, be on the lookout for fun connections with Ohio teachers in their classrooms. You also can stay up to date by following our other social media channels. In addition to Instagram, connect with us on Facebook, Twitter, and YouTube. Our handles are on the slide. Superintendent of Public Instruction Paolo DeMaria even has his own Twitter account. Follow OhioEdDept

Discussion – Q&A

Thank you!

Bringing Performance-Based Assessment into the Summative Space

Similar presentations

Presentation on theme: "Bringing Performance-Based Assessment into the Summative Space"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bringing Performance-Based Assessment into the Summative Space

Similar presentations

Presentation on theme: "Bringing Performance-Based Assessment into the Summative Space"— Presentation transcript:

Similar presentations

About project

Feedback