Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice.

Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice Provost, Professional Education Programs Kathy Bohan, EdD Associate Dean, College of Education Sue Pieper, PhD Assessment Coordinator, Office of Curriculum, Learning Design, & Academic Assessment

Introduction Context: NAU Professional Education Programs (PEP) statistics 23 Initial Teacher Preparation Programs: Enrollment: 2,385 (average of last 3 years) Graduates: 607 (average of last 3 years) 18 Advanced Programs: Enrollment: 1,059 (average of last 3 years) Graduates: 456 (average of last 3 years) NCATE Accredited in 2011 CAEP Self-Study submitted July 2016 CAEP Site Visit occurred in March 2017 Accredited in November 2017 Main Campus in Northern Arizona (Flagstaff) 17 community campus sites on average 2 to 5 hours in multiple directions from main campus

Timeline for accreditation, if being pursued
What is Your Context? Size of EPP/program/department, including enrollment and number of programs Timeline for accreditation, if being pursued We would like a sense of the size of your programs and allow you to meet others with a potentially similar context. Please form a line ---- starting on the left if… you consider your institution to be small in terms of number of programs (<10) and enrollment of programs (<1,000) Middle if… you consider your institution to be medium sized in terms of programs (10-25) and enrollment of programs (1,000 to 2,000) Large if… you consider your institution to be large in terms of programs (25+) and enrollment of programs (2,000+)? We would like a sense of the timeline for accreditation or re-accreditation. Please find a place in line --- starting on the left if… your institution completes program review through a state process your institution is exploring CAEP accreditation or is in the candidacy process your institution has recently been re-accredited through NCATE or TEAC and is now transitioning to CAEP your institution is in the process of writing your CAEP Self-Study in the next 1 to 3 years your institution has submitted the CAEP Self-Study report and will be having this year or has recently had a Site Visit your institution has be accredited by CAEP

Session Outline Introductions
Overview of Validity Inquiry Process Model Discuss the implementation of the Validity Inquiry Process Model at NAU Interactive practice using two Validity Inquiry Model instruments Discuss lessons learned from the implementation of the Validity Inquiry Process Model at NAU Group discussion of process and how model might apply in participants’ settings Introductions - why did you choose this session? what you hope to learn?

Purpose of the Validity Inquiry Process (VIP) Model
The purpose of the Validity Inquiry Process (VIP) Model is to assist in examining and gathering evidence to build a validity argument for the interpretation and use of data from locally developed performance assessment instruments.

Validity Inquiry Process (VIP) Model

Theoretical Foundation & Approach
Theory to practice Utilized the existing validity and performance assessment literature to develop practical guidelines and instruments for examining performance assessment in relation to validity criteria (Kane, 2013; Linn, Baker, & Dunbar, 1991; Messick, 1994) Qualitative and reflective Process and instruments guide the review and facilitate discussion regarding performance assessments and rubrics Evidence gathered is used to develop a validity argument (Kane, 2013) Efficient Some steps require documenting foundational information, one provides a survey completed by students, and the other steps involve faculty discussion and review

Performance Assessment Validity Criteria
Domain coverage Content quality Cognitive complexity Meaningfulness Generalizability Consequences Fairness Cost and Efficiency (Linn, Baker, & Dunbar, 1991; Messick, 1994) Criterion 1: Domain Coverage Domain coverage is defined as the breadth and depth of content addressed (Messick, 1994). Similarly, Linn et al. (1991) refer to domain coverage in terms of how comprehensive the performance assessments are that make up a program evaluation plan. Involving or showing evidence that a “broad representation of content specialists” (Linn et al., 1991, p. 20) are or were involved in developing learning outcomes is also key to demonstrating domain coverage. Criterion 2: Content Quality Content quality is defined as being aligned to the “current understanding of the field” while also reflecting aspects of the discipline that are intended to “stand the test of time” (Linn et al., 1991, p. 19). The authors further define content quality as measuring tasks that “are worthy of the time and efforts of students and raters” (p. 19). To ensure content quality, Messick (1994) recommends a mix of measures including both “extended performance tasks and briefer structured exercises” (p. 15) and the importance of assessing both content knowledge and the application of knowledge and skills. Additionally, the performance assessment should not contain “anything irrelevant…leading to minimal construct-irrelevant variance” (Messick, 1994, p. 21). Criterion 3: Cognitive Complexity The next criterion, cognitive complexity, is defined as emphasizing “problem solving, comprehension, critical thinking, reasoning, and metacognitive processes” (Linn et al., 1991, p. 19). The authors also state the importance of measuring the “processes students are required to exercise” (p. 19) in order to complete the performance assessment. Messick (1994) adds that the complexity of the performance task should match the construct (i.e., student learning outcome/standard) being measured as well as the competency or “the level of developing expertise of the students” (p. 21). Criterion 4: Meaningfulness The value of performance assessments resides with the idea that these types of assessments “get students to deal with meaningful problems that provide worthwhile educational experiences” (Linn et al., 1991, p. 20). Messick (1994) recommends “favoring rich contextualization of problems or tasks … to engage student interest and thereby improve motivation and interest” (p. 19). Another way of defining meaningfulness is considering the authenticity of the assessment and how it contributes to the future success of the student. Documenting that the performance assessment is meaningful or authentic also addresses one of the major threats to validity, construct underrepresentation. Criterion 5: Generalizability Generalizability is concisely defined as how the response to the content and context of the performance assessment transfers to other related discipline situations or issues (Linn et al., 1991; Messick, 1994). In order to provide evidence of generalizability, it is important to evaluate the problems, projects, or scenarios presented in terms of whether they address multiple and varied topics or problems. The context or problem situation should be evaluated in terms of the richness and level of detail. Finally, the performance assessment should be reviewed to identify if an exemplar(s) is included that models a potential solution. Criteria 6 & 7: Consequences and Fairness Consequences and fairness are two additional criteria to consider when establishing validity. Linn et al. (1991) suggests evaluating consequences in terms of determining if the performance assessment takes a reasonable amount of time to implement as well as a similar amount of time if it is implemented among multiple sections of the same class. Another perspective to consider with respect to consequences is to determine if the performance assessment takes an excessive amount of time away from other course topics. Identification of the method for establishing the pass-fail or cut score and the extent to which the benefits of implementing the assessment outweigh any unintended adverse ramifications should also be considered . Fairness is defined by the authors as ensuring that all students have the same opportunity to gain the knowledge and skills necessary to complete the assessment. Additionally, they define fairness in relation to how student work is evaluated by confirming the same criteria are used for all students. Messick (1994) ties these ideas of consequences and fairness under the overarching topic of consequential aspects of construct validity. Criterion 8: Cost and Efficiency The final criterion is cost and efficiency. Cost and efficiency is described by Messick (1994) as the idea of “utility” or the “costs and efficiency relative to the benefits” (p. 21). Performance assessments can provide valuable insight related to learning. Thus, these benefits need to be considered in relation to the practicality of implementing the assessment. Costs need to be acceptable and sustainable by the unit responsible, and significant attention needs to be “given to the development of efficient data collection designs and scoring procedures” (Linn et al., 1991, p. 20).

Content Analysis Strategies

Validity Inquiry Form Validity Inquiry Form:

Metarubric for Examining Performance Assessment Rubrics

Student Survey Student Survey:

Conducting Review of Reliability
The key aspects of scores being “accurate, reproducible, and consistent” or reliable can be supported and investigated through several methods such as calibration training, inter-rater agreement, and analysis of performance assessment data. Items on following instruments contribute to Review of Reliability: Content Analysis Validity Inquiry Form Metarubric Resources: -Graham, Milanowski, and Miller (2012) provide guidance on calculating inter-rater agreement, including procedures for computing the percentage of absolute agreement and adjacent agreement. -Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014) provide further guidance related to investigating reliability -Guidelines for Constructed-Response and Other Performance Assessments (Baldwin, Fowles, & Livingston, 2005) published by Educational Testing Service, Office of Professional Standards Compliance, provides recommendations for reviewing reliability

Development of Validity Argument

Introduction to VIP Model and Use of Evidence for CAEP Self-Study Report
Standard 5: Provider Quality Assurance and Continuous Improvement Quality and Strategic Evaluation 5.2 The provider’s quality assurance system relies on relevant, verifiable, representative, cumulative and actionable measures, and produces empirical evidence that interpretations of data are valid and consistent.

Implementing Validity Inquiry Process Model
Timeline for Implementation Identified target programs and faculty developed performance assessments (1 semester in advance) Identified lead faculty member(s) for each performance assessment (1 month in advance) Provided brief announcement and description at department faculty meeting (1 month in advance) Associate Dean scheduled meeting with lead faculty including (at least 2 to 3 weeks in advance): Sent introduction describing purpose (CAEP Standard 5.2) and what to expect (sample available through website) Attached copy of Validity Inquiry Form, Metarubric, and Rigor/Relevance Framework Verified most recent copy of performance assessment to be reviewed Requested individual review of performance assessment using the Validity Inquiry Form and Metarubric prior to the meeting

Performance Assessment Review Meeting
Logistics for Meeting Individual review meetings were scheduled for 2 hours Skype was utilized for connecting with faculty at statewide campuses Participants included 2 to 3 lead faculty members, facilitators (Associate Dean & Assessment Coordinator), and a part-time employee or Graduate Assistant to take notes

Interactive Practice Performance Assessment Review Model Meeting Agenda Purpose of performance assessment (Activity #1) Validity Inquiry Form (Activity #2) Metarubric (Activity #3) Feedback on the meeting and overview of next steps Next Steps (meeting notes) sent to faculty within 1 week (included timelines, responsibilities, plan for writing Validity Argument)

Validity Inquiry Form Validity Inquiry Form:

Activity #1: Using the Validity Inquiry Form
Discuss in small groups the stated purpose of this performance assessment and if it is an effective purpose statement. Course & Name of Performance Assessment: Student Teaching: Candidate Work Sample Purpose of Performance Assessment The purpose of the Candidate Work Sample is to provide evidence of how your teaching impacts student learning.

Activity #1: Small Group Discussion
What are the results of your small group discussion?

Activity #1: Question Prompts to Promote Deep Discussion
Why are you asking candidates to prepare and deliver a candidate work sample? Why is it important? How does this assignment apply to candidates’ future professional practice? How does this assignment fit with the rest of your course? How does this assignment fit with the rest of your program curriculum?

Discuss in small group questions 2 and 3 on the Validity Inquiry Form and how you would rate the assignment. Questions 2 & 3: Content Quality: Q2: Does the performance assessment evaluate process or application skills as well as content knowledge? Cognitive Complexity: Q3: Analyze performance assessment using the Rigor/Relevance Framework (see framework.php) to provide evidence of cognitive complexity: Identify the quadrant that the assessment falls into and provide a justification for this determination.

Cognitive Complexity Rigor/Relevance Framework® Quadrant Definition A Acquisition “Students gather and store bits of knowledge and information. Students are primarily expected to remember or understand this knowledge.” B Application “Students use acquired knowledge to solve problems, design solutions, and complete work. The highest level of application is to apply knowledge to new and unpredictable situations.” C Assimilation “Students extend and refine their acquired knowledge to be able to use that knowledge automatically and routinely to analyze and solve problems and create solutions.” D Adaptation “Students have the competence to think in complex ways.” Daggett, W.R. (2018). Rigor/relevance framework®: A guide to focusing resources to increase student performance. International Center for Leadership in Education. Retrieved from

Activity #2: Discussion
What are the results of your small group discussion?

Activity #2: Question Prompts to Promote Deep Discussion
How well does the assignment or performance assessment evaluate content knowledge? How well does the assignment or performance assessment evaluate process or applications skills? Using the Rigor/Relevance Framework® What quadrant did the assessment fall into and why? How did your group establish consensus on the quadrant determination?

Metarubric for Examining Performance Assessment Rubrics

Activity #3: Using the Metarubric
As a large group, we will go through the following process for Question 2 (Q2) on the Metarubric: Read the example assignment rubric provided as well as the Metarubric questions. Criteria: Q2: Does each rubric criterion align directly with the assignment instructions? (Pieper, 2012)

Activity #3: Using the Metarubric
With the person(s) sitting next to you, complete the process again for the following questions: Descriptions: Q8: “Are the descriptions clear and different from each other?” (Stevens & Levi, 2005, p. 94) Overall Qualities: Q11: Do the assignment instructions “encourage students to use the rubric for self- and peer assessment?” (Pieper, 2012) Are there any other questions you wish to discuss?

Activity #3: Discussion
What are the results of your discussion?

Faculty Feedback Regarding Process
“I wanted to thank you all for a providing a really productive venue to discuss the progress and continuing issues with our assessment work. I left the meeting feeling very optimistic about where we have come and where we are going. Thank you.” –Associate Professor, Elementary Education “Thanks for your facilitation and leadership in this process. It is so valuable from many different perspectives, especially related to continuous improvement! Thanks for giving us permission to use the validity tools as we continue to discuss our courses with our peers. I continue to learn and grow...” –Assistant Clinical Professor, Special Education

Implementing Calibration Trainings & Determining Inter-Rater Agreement
Strategies for Implementing Calibration Trainings & Determining Inter-Rater Agreement strategies for implementing calibration trainings and determining inter-rater Agreement

Definitions Definitions
Calibration training is intended to educate raters on how to interpret criteria and descriptions of the evaluation instrument as well as potential sources of error to support consistent, fair and objective scoring. Inter-Rater Agreement is the degree to which two or more evaluators using the same rating scale give the same rating to an identical observable situation (e.g., a lesson, a video, or a set of documents). (Graham, Milanowski, & Miller, 2012) Calibration training and inter-rater agreement address issues of reliability and enhance confidence in the data collected. Purpose: Improve and provide evidence of consistency in data collection to enhance confidence in use of results.

Calibration Strategies
Select performance assessment artifacts (remove identifying information) that can serve as a model and that an expert panel agrees on the evaluation scores Request raters to review and score the example artifacts Calculate percentages of agreement and utilize results to focus discussion Discuss: Criteria with lowest agreement among raters to improve consistency of interpretation including: Requesting evaluators to cite evidence from artifact that support their rating Resolve differences Potential sources of rater error Request evaluators score another artifact and calculate agreement

Factors that affect Inter-Rater Agreement
Discussion regarding common types of rater errors: Leniency errors Generosity errors Severity errors Central tendency errors Halo effect bias Contamination effect bias Similar-to-me bias First-impression bias Contrast effect bias Rater drift (Suskie, 2009) If faculty are scoring many samples of student work, rescore the first few samples after they finish to guard against rater drift. If faculty are scoring large numbers of papers, periodically schedule a refresher scoring practice session in which they all compare their scores and discuss and resolve their differences. (Suskie, 2009)

Calculating Inter-rater Agreement
Percentage of Absolute Agreement Calculate number of times raters agree on a rating. Divide by total number of ratings. This measure can vary between 0 and 100%. Values between 75% and 90% demonstrate an acceptable level of agreement. Example: Raters scoring 200 student assignments agreed on the ratings of 160 of the assignments. 160 divided by 200 equals 80%. This is an acceptable level of agreement. (Graham, Milanowski, & Miller, 2012)

Inter-Rater Agreement
Summary of Inter-Rater Agreement Data Summary of University Supervisor Agreements Number of score pair agreements 36 Number of raters 47 % Score pair agreement 76.60% Average % Perfect Agreement 38.52% Average % Adjacent Agreement 46.47% Overall Average Agreement (Adjacent + Perfect) 84.99%

Activity #4: Implementing Calibration Strategies
Discuss with a partner a particular inter-rater agreement strategy that you might be able to implement at your own institution. What would be the benefits of implementing such a strategy? What challenges do you anticipate? What do you need to do to get started?

Development of Validity Argument

Validity Argument Drawing upon guidelines in the literature, the development of a validity argument should address: the instrument’s purpose and the intended interpretation and use of the data collected from the performance assessment quality of instrument and scoring guide, and evidence of reliability of the data collected. (AERA et al., 2014; Cook et al., 2015; Council for the Accreditation of Educator Preparation, 2015; Downing, 2003; Kane, 2013) The argument-based approach to validation (Kane, 2013) requires the development of a comprehensive validity argument that provides a clear expression of the purpose of the assessment instrument and how the data collected from the instrument will be used. Confidence in the data collected is connected to the deep review of the instrument in relation to the performance assessment validity criteria oultined in the literature (Linn, Baker, & Dunbar, 1991; Messick, 1994). The validity argument provides an evaluation of the proposed interpretation and use [(IUA)] of the test scores [i.e., results]. The proposed interpretation and use can be considered valid if the IUA is clear, coherent, and complete, its inferences are reasonable, and its assumptions are plausible. (Kane, 2014, p. 451) Cook, Brydges, Ginsburg, and Hatala (2015) state that “validation begins with a clear statement of the proposed use of the assessment scores (i.e., interpretations and decisions)” (p. 564). Thus, the validation process must begin with identifying or defining the purpose of the performance assessment including the interpretation and use of the results (Baldwin, Fowles, & Livingston, 2008; Cook et al., 2015; Mislevy & Haertel, 2006). These elements form the foundation for the examining the performance assessment and ultimately building the validity argument.

Feedback on Meeting & Overview of Next Steps
Notes from meetings are consolidated Assessment Coordinator develop a one page document outlining: Who participated Strengths Areas for improvement Next steps Initial follow-up documentation utilized to develop Validity Argument (CAEP Evidence Guide, 2015) “To what extent does the evaluation measure what it claims to measure? (construct validity)” “Are the right attributes being measured in the right balance? (content validity)” “Is a measure of subjectively viewed as being important and relevant? (face validity)”

Validity Argument Documentation for CAEP Standard 5
Creation of bundled pdf file with validity inquiry and metarubric forms Cover sheet to pdf should contain validity argument and information regarding experts involved in review process Store files in web-based, collaborative program for easy access by leadership, faculty, site visit team members

Improving the Process Timing the process and meetings so work concludes by Spring Break Encouraging chairs to be involved in process to understand and allocate appropriate department faculty meeting time (videotape meeting for review or include chair with instructions to be an observer rather than participant) Building capacity and sustaining process Value of small group meetings (faculty felt listened to and process appeared to improve faculty moral; faculty felt safe to discuss ideas) As a university with a large number of programs and over 200 faculty developed instruments, is there any way to retain value of small meetings through a more efficient process?

Resources & Contact Information Website: Contact Information: Cynthia Conn, PhD Assistant Vice Provost, Professional Education Programs Kathy Bohan, EdD Associate Dean, College of Education Sue Pieper, PhD Assessment Coordinator, Office of Curriculum, Learning Design, & Academic Assessment

Definitions Performance Assessment Validity Reliability
An assessment tool that requires test takers to perform—develop a product or demonstrate a process—so that the observer can assign a score or value to that performance. A science project, an essay, a persuasive speech, a mathematics problem solution, and a woodworking project are examples. (See also authentic assessment.) Validity The degree to which the evidence obtained through validation supports the score interpretations and uses to be made of the scores from a certain test administered to a certain person or group on a specific occasion. Sometimes the evidence shows why competing interpretations or uses are inappropriate, or less appropriate, than the proposed ones. Reliability Scores that are highly reliable are accurate, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. (National Council on Measurement in Education. (2014). Glossary of important assessment and measurement terms. Retrieved from:

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999; 2014). Standards for educational and psychological testing. Washington DC: American Educational Research Association. Conn, C., & Pieper, S. (2014, May). Strategies for examining the validity of interpretations and uses of performance assessment data. Presentation at the Association for Institutional Research Annual Conference, Orlando, Florida. Center for Innovative Teaching & Learning. (2005). Norming sessions ensure consistent paper grading in large course. Retrieved from Cook, D. A., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity arguments: A practical guide to Kane’s framework. Medical Education, 49, Council for the Accreditation of Educator Preparation. (2013). CAEP accreditation standards. Retrieved from Daggett, W.R. (2014). Rigor/relevance framework®: A guide to focusing resources to increase student performance. International Center for Leadership in Education. Retrieved from Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education, 37, Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction (6th Edition). White Plains, NY: Longman Publishers.

References (continued)
Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform. Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), Kane, M. (2013a). The argument-based approach to validation. School Psychology Review, 42(4), Kane, M. T. (2013b). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), Pieper, S. (2012, May 21). Evaluating descriptive rubrics checklist. Retrieved from Stevens, D. D., & Levi, A. J. (2005). Introduction to rubrics: An assessment tool to save grading time, convey effective feedback and promote student learning. Sterling, VA: Stylus Publishing, LLC. Suskie, L. (2009). Assessing student learning: A common sense guide. (2nd ed.). San Francisco, CA: Jossey-Bass.

Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice.

Similar presentations

Presentation on theme: "Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice.

Similar presentations

Presentation on theme: "Ensuring Meaningful Performance Assessment Results: A Reflective Practice Model for Examining Validity and Reliability Cynthia Conn, PhD Assistant Vice."— Presentation transcript:

Similar presentations

About project

Feedback