Padlet Question Board Why did you select this workshop and what are you hoping to learn from it? http://tinyurl.com/whyattending.

Padlet Question Board Why did you select this workshop and what are you hoping to learn from it?

Developing a Quality Assurance System, Building a Validity Argument for Locally Developed Performance Assessments, and Strategies for Calibrating Instruments Cynthia Conn, PhD Assistant Vice Provost, Professional Education Programs Kathy Bohan, EdD Associate Dean, College of Education Sue Pieper, PhD Assessment Coordinator, Office of Curriculum, Learning Design, & Academic Assessment Matteo Musumeci, MA-TESL Instructional Specialist, Professional Education Programs Goal: Audience to identify with one of our roles or be able to identify individuals on campus with similar role. Purpose: Who is on your team? Who could be on your team?

NAU Professional Education Programs
Context: NAU Professional Education Programs (PEP) statistics 23 Initial Teacher Preparation Programs: Enrollment: 2,385 (average of last 3 years) Graduates: 607 (average of last 3 years) 17 Advanced Programs: Enrollment: 1,059 (average of last 3 years) Graduates: 456 (average of last 3 years) NCATE Accredited in 2011 CAEP Self-Study submitted July 2016 CAEP Site Visit scheduled for March 2017 Main Campus in Northern Arizona (Flagstaff) 17 statewide sites on average 2 to 5 hours in multiple directions from main campus

Programs & Enrollment for Your EPP
Please respond to the following questions by raising your hand: Do you consider your institution to be large in terms of initial and/or advanced programs (25+) and enrollment of programs (2,000+)? Do you consider your institution to be medium sized in terms of initial and/or advanced programs (10-25) and enrollment of programs (1,000 to 2,000)? Do you consider your institution to be small in terms of number of initial and advanced programs (<10) and enrollment of programs (<1,000)? We would like a sense of the size of your programs.

Your EPP’s Timeline for Accreditation
Please respond to the following questions by raising your hand: Is your institution exploring CAEP accreditation or in the candidacy process? Has your institution recently been re-accredited through NCATE or TEAC and now transitioning to CAEP? Is your institution in the process of writing your CAEP Self-Study in the next 1 to 3 years? Has your institution submitted the CAEP Self-Study report and will be having a Site Visit this year? We would like a sense of the timeline for accreditation or re-accreditation.

Workshop Objectives Objectives
Discuss strategies for developing a comprehensive Quality Assurance System (CAEP Standard 5.1) Discuss framework and strategies for examining validity and reliability of the use and interpretation of locally developed performance assessments CAEP Standard 5.3 Validity Inquiry Process Model Strategies for calibrating performance assessments

Developing a Quality Assurance System
We are going to start by gathering your ideas on “What is a Quality Assurance System?”

What is a Quality Assurance System?
Ideas: Process for determining how to collect meaningful assessment data; how candidates help students learn Consistent collection of data Meaningful data collection used by faculty Program improvement Self-reflection to improve program Checks and balances – if not useful, don’t collect data or assess Variety of data – qualitative and quantitative, compare results across instruments

Quality Assurance System: Components/Structure
–Meaningful system accessible by faculty for reporting and continuous improvement of programs, but also to be used for retention, recruitment, tracking matriculation of candidates, research, grants, action plans related to curricular changes NAU PEP Quality Assurance System website: URL listed on table tent also QAS Resources:

Quality Assurance System: Definitions
What is a Quality Assurance System? CAEP STANDARD 5.1 The provider’s quality assurance system is comprised of multiple measures that can monitor candidate progress, completer achievements, and provider operational effectiveness. Evidence demonstrates that the provider satisfies all CAEP standards. CAEP Standard 5.3 REQUIRED COMPONENT The provider regularly and systematically assesses performance against its goals and relevant standards, tracks results over time, tests innovations and the effects of selection criteria on subsequent progress and completion, and uses results to improve program elements and processes. •What is a Quality Assurance System? – Quality Assurance System: Mechanisms (i.e., structures, policies, procedures, and resources) that an educator preparation provider (EPP) has established to promote, monitor, evaluate, and enhance operational effectiveness and the quality of the educator preparation provider’s candidates, educators, curriculum, and other program requirements. (CAEP Accreditation Handbook, p. 186) – Quality Assurance System: A system that … [relies] on a variety of measures, ..., [seeks] the views of all relevant stakeholders, [shares] evidence widely with both internal and external audiences, and [uses] results to improve policies and practices. (CAEP Glossary,

Quality Assurance System: Strategies
Reference - QAS: Guiding Questions for Strategies handout

Strategy: High Level Needs Analysis
Purpose: Document strengths and issues related to your current quality assurance system that will assist with prioritizing work Examples: At NAU, this first strategy was conducted by Assistant Vice Provost of NAU Professional Education Programs. Work provided necessary information for prioritizing work and developing a vision for the Quality Assurance System. NAU was collecting data well but needed to improve the systematic reporting and access to data We also recognized we needed to improve the quality of assessment instruments

Activity (partner-share/large-group share): How have you or could you gather this high level needs analysis data on your campus? Who did/could you talk to on your campus? What documentation did/could you review? Are there other initial approaches your campus took to develop a quality assurance system? Who could implement this strategy on your campus? With one or two partners sitting next to you, please take a 5 minutes to discuss these three questions. We will ask for highlights from your discussions to be shared with the whole group.

Ideas: How have you or could you gather this high level needs analysis data on your campus? Large meeting with stakeholders, find a leader/point person who could take on work; Early Instrument Review (CAEP) could help target efforts; local program assessments alignment to CAEP/SPA/discipline-specific Standards; CAEP Site Reviewer Rubric (in CAEP March 2016 Accreditation Handbook) as a resource for strengths and gaps in assessment system Are there other initial approaches your campus took to develop a quality assurance system? Integrate institutional values into CAEP Standards to encourage accreditation work; unified data structure with support from faculty and graduate assistants to aggregate and report data Who could implement this strategy on your campus? Experts or leaders with backgrounds in research and methodology; appraisal and assessment; statistics; people with strong interpersonal communication skills (to liaise with faculty); institutional or EPP leader to communicate consistently; balance with people who are working students With one or two partners sitting next to you, please take a 5 minutes to discuss these three questions. We will ask for highlights from your discussions to be shared with the whole group.

With the person(s) next to you, discuss who you could or should share these results with? Who should be part of next action steps? Who you might need to collaborate with or seek support from?

Strategy: Assessment Audit
Purpose: 1) Develop a detailed listing of current assessment instruments; 2) Document alignment to CAEP Standards, quality of the instruments, and implementation schedule Examples: Two of NAU’s EPP leaders conducted the initial assessment audit and discussed strengths and gaps with Coordinating Council members The Student Teaching Evaluation and Candidate Work Sample needed to be improved in terms of validity and reliability NAU’s EPP identified gaps with collecting data regarding graduates Assessment Audit Template The next strategy we would like to discuss in our Quality Assurance System is to conduct a detailed audit of the assessments as they align to the CAEP standards, consider the purpose, value, and quality of the instruments, and consider the strengths and gaps. Then prioritize efforts and set timelines for documenting validity arguments for the instruments, verifying reliability, and determining how to use the data to inform continuous improvement. A common misconception is that validity is about the the tests or measures. If the measure is determined to be “valid” then any data obtained from the measure must be valid. Evidence of validity must relate to CAEP Standard 5 asks “the provider to maintain a quality assurance system comprised of valid data from multiple measures. The purpose of the Assessment Audit is to provide evidence demonstrating the provider has measures that “monitor candidate progress, completer achievements, and operational effectiveness aligned with all CAEP standards.” In 2014 at NAU, two educator preparation program leaders completed the assessment audit using a template (available on our website resources page).

Assessment Audit Template CAEP Standard #1: Candidate Knowledge, Skills, and Professional Dispositions Standard Component Evidence/ Assess- ment Instrument Schedule: -Implementation -Reporting -Review -Administrations Use of Data Validity/ Reliability CAEP Assessment Review Criteria Show sample assessment audit with CAEP standard 1.1 Standard: Understanding of 10 InTASC standards. assessment instrument is Licensure Exams Implementation: data are managed externally since published instrument; data extracted by our NAU PEP staff each Sept 1 to add to report templates for SPA or EPP level analysis. Reviewed formally even years for programs/ odd years at EPP level Administrations: # administrations in reporting cycle Use of Data: subscale analysis informs for TPC candidate support Validity/Reliability: proprietary CAEP Review Criteria: disaggregated by licensure area, at least 2 cycles of data reported; provides evidence or each candidate’s knowledge, Another example that came from the audit (teacher candidate evaluation OR student teaching evaluation) We were using several different, locally developed teacher candidate observation/evaluation instrument. A goal was to identify one instrument better aligned to how we wanted to use the data, to look at how well our candidates are prepared to enter the teaching profession, but also to look at disaggregated data to drill down into specific knowledge, skills and dispositions strengths and areas needing improvement. For this, a representative committee of faculty, university supervisors, and staff reviewed several options and selected a proprietary (nationally standardized) measure to adapt and use across our programs. QAS Resources:

Questions? Suggestions? If we aren’t able to get to your question, please post it using the following URL. Check back after the presentation for a response.

Strategy: Identify & Implement Data Tools
Purpose: To identify the data tool functions that need to be present to have systematic collection, reporting, and use of data Vision: Reporting and use of data were seen as the most important functions that we needed to address in relation to our quality assurance system; Performance data integrated with demographic data; university-supported tools; led to the use of an online collaborative space for faculty (e.g., SharePoint) Provide the NAU example: At NAU, we were seeking a quality assurance system that would include multiple data tools to address the complex and large-scale nature of our data: For data collection, we integrated a rubric tool with a data management tool so we could analyze data by academic plan, campus location, etc. For data reporting, we also integrated data collection tools with data reporting and archiving tools The majority of our data tools do interface with each other to achieve a balance of data functions

Data Tools and Functions Self-Assessment Identify and address current areas of strength related to data tools on your campus. Types of Data Collection (e.g., rubric or survey tools) Reporting tools Sustainable, efficient Audiences Data Tool Data Tool Function Audience Sustainable and efficient? Name of tool Which function is aligned with the data tool? Who will view and use the data? Yes/No? Identify any gaps in relation to data functions in your quality assurance system.

What Data Tool Functions are Needed?
Ideas: Examples of data tool functions we’re currently using: Electronically collecting rubric and survey data Reporting and archiving data Transitioning to university-supported tools Connecting to university demographic data Enhancing access to data and instruments for a variety of audiences (e.g., faculty, teachers, university leadership)

Developing a Quality Assurance System: Strategies

Strategy: Assessment Policies & Procedures
Purpose: To develop a useful, efficient and sustainable Quality Assurance System Examples: Aligning systematic reporting with University and State reporting requirements Policies & Procedures for Program Level Assessment Reporting (developed in collaboration with NAU’s assessment office) Biennial Reporting Chart with expectations Self-Study and SPA report files are maintained and updated to eliminate duplicate work Develop a Master Assessment Plan and Calendar for EPP level assessments Show example files for Master Assessment Plan and Calendar and Biennial Reporting Chart with expectations

Questions? Suggestions? If we aren’t able to get to your question, please post it using the following url. Check back after the presentation for a response.

CAEP Self-Study Report
Iterative process... Formed CAEP Self-Study Writing Committees Evidence file templates available on QAS Resource website EPP level faculty meeting held on biennial basis to formally review data; utilize “speed sharing” technique Re-reviewing assessment instruments (high level needs analysis strategy) to consider options for streamlining data collection Conducting assessment audit with advanced programs to identify existing data and gaps related to standards approved in June 2016

Developing a Quality Assurance System
The wonders of inflight construction… [Credit: Eugene Kim & Brian Narelle, Prioritizing work... Based on the due date of your CAEP Self-Study Report, does your institution have sources of evidence (or a plan, if allowed) for all standards and components, especially required ones? Will your institution have enough data (3 administrations)? During the high level needs analysis, what did you identify as the most significant gaps? Is the data being collected valid and reliable?

Building a Validity Argument for Locally Developed Performance Assessments

The Purpose of the Validity Inquiry Process (VIP) Model
Validity Inquiry Process is a component of the Quality Assurance System Purpose The purpose of the Validity Inquiry Process (VIP) Model instruments is to assist in examining and gathering evidence to build a validity argument for the interpretation and use of data from locally or faculty developed performance assessment instruments. Leads to Making the Validity Argument Theory to practice approach Qualitative and reflective Efficient An important component to a Quality Assurance System is having confidence in the various assessments and sources of data that inform program decision-making efforts. This work aligns to CAEP 5.2: The provider’s quality assurance system relies on relevant, verifiable, representative, cumulative and actionable measures, and produces empirical evidence that interpretations of data are valid and consistent. Earlier in this presentation, we talked about conducting an Assessment Audit (see slide 16-18). A “discovery” from the audit were concerns about the quality of our instruments in our capstone experience, student teaching. I described how the audit led us to adopting a proprietary teacher candidate evaluation tool that included extensive training and evidence that raters were competent with reliably scoring the measure. Similarly, NAU was using a Candidate Work Sample modeled off of the work of faculty at Western Washington University. However, our locally developed assignment directions and rubric lacked a detailed description of the purpose, and provided only a general description of how the assignment should be completed. The 7-row rubric was vague listing only brief terms intended to align to learning outcomes or standards, and the rubric descriptors provide little guidance to the candidate or the evaluator on how to rate the candidates’ products. Faculty and university supervisors were also telling us they were uncomfortable with how to implement and score assignments. In 2013, we reviewed the literature related to performance assessments and establishing validity arguments to support their use. In general, we found theoretical models,, but little practical guidance or tools to support faculty with developing and/or reviewing their locally developed measures. The VIP presented here is an attempt to fill this void. The model’s intention is to use self-reflection, expert group consensus, and qualitative responses in a structured framework to guide making a written validity argument for why the measure produces valid and reliable data for interpretation.

The purpose of today’s presentation is not to go into detail about how to use the VIP. We have presented at previous CAEP conferences and other venues on this model. Rather we want to use our work with this model as an example of how this is a component to a Quality Assurance System. In the past three years, we have implemented the model with program faculty reviewing the performance assessments aligned to their SPA standards. Briefly, this process has been found to be efficient. Two or three faculty who are the primary instructors for the course using the assessment are asked to use the Validity Inquiry Form and the Metarubric to independently rate their assignment directions (VIP) and rubric (Metarubric). Then the committee meets and the process is facilitated by assessment staff. One staff records the discussion. The meeting tends to take 2 hours, and the results are then summarized in a Next Steps document. This document lists the identified strengths, areas needing improvement, and plans/timelines. In some cases, this leads to drafting the validity argument or justification including evidence of inter-rater agreement and reliability, or a plan for how to establish and maintain reliability. In most cases, the VIP has resulted in at least minor revisions to the instructions and/or rubric and then a follow-up meeting to determine the VA. At least in a couple of cases, it has resulted in “starting over” since the faculty agreed the instrument didn’t meet the purpose and learning outcomes for the course or program or didn’t readily support scaffolding of the knowledge or skills intended. For us, the VIP was used to evaluate our measures, but the model can also be used to guide development of performance assessment measures.

Validity Inquiry Process (VIP) Model Criteria
Purpose Domain coverage Content quality Cognitive complexity Meaningfulness Generalizability Consequences Fairness Cost and Efficiency (Linn, Baker, & Dunbar, 1991; Messick, 1994) The criteria used in the process is grounded in the literature. Interestingly, we had a place for describing the purpose of the assignment on the form, but we are now highlighting this more intentionally in our VIP meetings. We are finding rich conversations of 20 minutes or so as the faculty discuss WHY this assignment is essential and HOW it provides evidence of the candidates knowledge and skills. See VIP and Metarubric Forms The questions on the various rows guide documenting the evidence (or lack of) for these criteria. In a few minutes, we will look at detail at the Cognitive Complexity criteria from the VIP and the Fairness criteria from the Metarubric. Additional information about the VIP Domain coverage: breadth & depth of content (Messick, 1994) Evidence: 1.Content specialists involved in determining learning outcomes and developing assignment, faculty expertise; 2. Discipline organization (e.g., SPAs) establish standards or student learning outcomes Content Quality: assignment addresses content knowledge in the field and the application of the knowledge & skills Avoid including irrelevant items leading to error variance (threats to validity) Evidence: faculty meeting minutes from development of assessment through curriculum and assessment mapping; map entire program of study and determine where various performance assessments as well as other types of measures “fit” to assess knowledge and skills (multiple measures, yet efficient without redundancy or gaps; assignments authentic, balanced, cover standards/learning outcomes) Cognitive complexity: emphasis on problem solving, comprehension, critical thinking, reasoning, metacognition (Linn, et al., 1991); what are processes candidates need to Rigor/Relevance (Thinking and Action continuums) Source: International Center for Leadership in Education Meaningfulness: authenticity, (avoid validity threat of construct underrepresentation) Student survey, faculty who have taught the course feedback Generalizability: transfer of knowledge and skills to other related situations, disciplines, topics Consequences and fairness: reasonable amount of time to implement; weight to the course grade or stakes with the outcome fits, all candidates have equitable opportunity to be successful Cost and Efficiency: (external validity) benefits fit the costs; cumbersome, difficult to implement, burdensome

Validity Inquiry Forms
Metarubric Form Student Survey show the forms VIP Form: focus on the prompt (directions, instructions) to reflect on assignment Purpose and then domain coverage, content quality, cognitive complexity, meaningfulness, consequences, fairness, efficiency Metarubric: Fairness Student Survey is used as evidence of Meaningfulness criteria (Authenticity) can be combined with evidence from faculty supervisors that the experience was meaningful.

Using the Validity Inquiry Form
Cognitive Complexity International Center for Leadership in Education: Thinking Continuum (y-axis): Acquisition----Assimilation of Knowledge (Bloom’s Taxonomy) Action Continuum (x-axis): Acquisition----Assimilation of Knowledge (Daggett): use of knowledge to solve increasingly more complex, authentic, and unique problems in real-world situations. Quadrant A: Students gather and store bits of knowledge and information. Students are primarily expected to remember or understand this knowledge. Quadrant B; Students use acquired knowledge to solve problems, design solutions, and complete work. The highest level of application is to apply knowledge to new or unpredictable situations. Quadrant C: Students extend and refine their acquired knowledge to be able to use that knowledge automatically and routinely to analyze and solve problems and create solutions. Quadrant D. Students have the competence to think in complex ways. Daggett, W.R. (2014). Rigor/relevance framework®: A guide to focusing resources to increase student performance. International Center for Leadership in Education. Retrieved from

The Validity Inquiry Process: Example
Student Teaching Capstone Assignment: Candidate Work Sample (CWS) Background Spring 2014: revisions from 7-row CWS made/used August 2014: 19-row rubric based on Faculty & University Supervisors feedback July 2015: Inter-rater reliability session September 2015: Validity Inquiry Process meeting, further revisions; Next Steps Summary December 2015: change to CWS Evaluators February 2016: Implementation, Committee met to write the Validity Argument April 2016: CWS Evaluator (with student teachers) De-brief, additional Next Steps; Summer revisions August 2016: CWS Evaluator Calibration session (reviewed the revised CWS, inter-rater agreement training) Instrument development is continuous. Unit level measure in student teaching is the CWS--What we want to highlight from our experience is how this framework aligns to continuous improvement. Original 7-row CWS was vague including double barrelled descriptors that didn’t allow us to provide clear feedback to candidates OR inform the program on how to improve Fall 2013: revisions from 7-row CWS made/used; moved from TaskStream to Bb Learn using rubric tool and processes to extract data reports (glitches) Fall 2014: Faculty & University Supervisors feedback used to make further changes; discreet rows tagged to InTASC standards; must earn 2 in all categories Summer 2015: Inter-rater reliability session; Fifty university supervisors (US) rated a paper prior to attending and submitted ratings, absolute and adjacent agreement was calculated (absolute 38.52%, Adjacent 46.47%; overall 84.99%) and qualitative responses were organized for discussion; results should adequate overall agreement, but some rows had greater variability than others; conversations led to a list of suggestions to revise or add to the directions and the rubric (rows w/ greater variability differentiation of instruction, technology integration, assessment Fall 2015: Validity Inquiry Process meeting, US info shared, VIP followed, further revisions including breaking assignment into parts with overall directions, and then guiding questions with each part, refinements to descriptor wording, checks to alignments with InTASC standards, also decision to change to CWS Evaluators (n =15); Next Steps Summary Spring 2016: Implementation of revised CWS with scale reversed to match teacher candidate evaluation (NIET TAP), Next Steps notes used to guide making Making the Validity Argument April 2016: CWS Evaluator De-brief: endorsed frou parts with 6 sections; candidate must have an average score of “2” with no more than one “1” or developing and no “O” in each of the four Parts; allows scaffolding based on explicit feedback to candidates August 2016: CWS Evaluator Calibration session described later in presentation Fall 2016: Implementation of 4 part CWS with two opportunities for candidate improvement, revised scale descriptors (changed from Does NOt Meet, Approaches, Meets, Exceeds to Does Not Meet, Developing, Meets, Exceeds) candidate can get one developing in each part and opportunity to improve (avoids raters tendency to score Meets and provides program with information about any trends or patterns where more candidates are at a developing level to guide program improvement as well as candidate’s goal setting as they enter their first year of teaching. Summary: The process has led to an assignment with scaffolded directions and feedback procedures leading to greater confidence for interpretation and use of the results.

Activity: Using the Validity Inquiry Form
Discuss in pairs or small groups: What is the stated purpose of this performance assessment and is it an effective purpose statement? Q3: Using the Rigor/Relevance Framework Identify the quadrant that the assessment falls into and provide a justification for this determination. What are the results of your small group?

Using the Metarubric Read the example assignment instructions and the Metarubric question. Criteria: Q2: Does the rubric criterion align directly with the assignment instructions? What are the results of your small group?

Faculty Feedback Regarding Process
“I wanted to thank you all for a providing a really productive venue to discuss the progress and continuing issues with our assessment work. I left the meeting feeling very optimistic about where we have come and where we are going. Thank you.” – Associate Professor, Elementary Education “Thanks for your facilitation and leadership in this process. It is so valuable from many different perspectives, especially related to continuous improvement! Thanks for giving us permission to use the validity tools as we continue to discuss our courses with our peers. I continue to learn and grow...” – Assistant Clinical Professor, Special Education Feedback from Faculty

Strategies for Calibrating Instruments

Purpose of Calibrating Instruments
Strategies for calibrating instruments (Frame-of- Reference Training) Purpose for calibrating instruments Why is it important to have consistency and agreement among evaluators? Confidence in the assessment data collected and used for program improvement Ensure accountability and common expectations for evaluators to score using rubric indicator language Informed use of assessment results in making decisions about a candidate’s ability to complete the program Need for a valid and reliable EPP-wide instrument (CAEP 5.2) Strategies for calibrating instruments: Frame-of-Reference Training provides a useful framework of strategies for conducting a calibration training

Inter-rater Agreement and Reliability
Agreement: Measures the consistency/differences between absolute value of evaluators’ scores Reliability: Measures the variability of scores; relative ranking/ordering of evaluators’ scores Low Agreement, High Reliability High Agreement, High Reliability Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 Student 1 1 2 Student 2 3 Student 3 4 Agreement 0.0 1.0 Reliability Agreement vs. Reliability (NB: We use the term “evaluator” to be consistent with “CWS Evaluators”; other terms in the literature include raters, judges, etc.) Agreement is a measurement of the consistency between the absolute value of evaluators’ scores; reflects true strengths and weaknesses in candidate performance; criterion-referenced interpretation of scores Reliability is a measurement of the relative standing/ranking/ordering of evaluators’ scores; describes consistency of evaluators’ judgements about the levels of performance; norm-referenced interpretation of scores Importance of reporting percentages of agreement and an appropriate statistic for high-stakes assessments We contend that both are important and provide meaningful data Evaluators 1 and 2: Both evaluators agreed on the candidates’ relative ranking (both evaluators’ scores increased similarly), but 0 agreement on absolute level of performance Evaluators 3 and 4: Both evaluators agreed on the candidates’ ranking AND absolute level of performance (not typical) Adapted from Graham, Milanowski, & Miller (2012)

Calibration Training: Frame-of-Reference Training
Elements adapted from Frame-of-Reference Training: Explanation of rating system to evaluators Discussion of common evaluator errors and strategies for avoiding them Advice for making evaluations Practice calibrating a sample paper Considerations for selecting evaluators and expert panelists Common expectations for the implementation of the assessment Ongoing monitoring of evaluators’ ratings during semester for scoring consistency Redesign of instrument based on data, ad hoc focus groups, and evaluator feedback Strategies for Calibrating Instruments (adapted from CECR Frame-of-Reference Training Outline) Explanation of rating system (e.g., scoring system, differences in performance level descriptors, detailed description of alignment to InTASC and SPA standards at the indicator level, general overview of the domains of rubric, number of indicators, parts/sections) to evaluators Discussion of common evaluator errors and strategies for avoiding them Central tendency: Evaluator gives the same ratings (e.g., all scores of “2”) as an overall evaluation of paper. Evaluator needs to focus on connecting the evidence in the paper to the rubric indicator. Advice for making evaluations Practice calibrating a sample paper in person Summer 2016 calibration training session: Morning session involved familiarizing evaluators with rating system and common errors, then moved into a calibration exercise of the paper with the strongest agreement, as determined by the expert panel

Inter-rater Agreement: Calibration Strategies
Select anchor papers previously scored for the expert panel Select expert panel members to score the anchor papers Examine data from anchor papers to determine the strongest paper for calibration exercise Train group of evaluators Summer 2016 Calibration strategies: Selected 3 anchor papers for expert panel that represented a variety of subjects, grade levels, and quality (previously scored in Spring 2016) Selected expert panel members with broad representation of COE and Secondary Education programs; asked expert panelists to score the 3 anchor papers (Paper 1 received 3 evaluations, Paper 2 received 3 evaluations, and Paper 3 received 4 evaluations; total of 10 evaluations between 7 expert panelists) Examined data from anchor papers to determine the strongest paper for calibration exercise (calculated an overall mean percentage of agreement) Evaluator group training session: Morning calibration exercise (familiarization of rubric tool and scoring system; presentation of common rater errors and strategies for avoiding them; individual scoring of common paper; small group discussion of 3 questions; large group discussion and deeper discussion)

Questions for small group discussion given to evaluators participating at calibration session (after individual scoring): What evidence (i.e., quantity and quality) can you connect to each indicator of the rubric? What challenges to developing consensus did your group encounter? What qualitative feedback would you provide to help the candidate advance from one performance level to the next higher level?

Whole group discussion Analyze inter-rater agreement data Report inter-rater agreement data to evaluators and program faculty All Programs Whole group discussion: Representatives from various groups relayed the small group’s discussion Choosing appropriate statistics and reporting inter-rater agreement and descriptive statistics of evaluators’ score data: Differences, mean and standard percentages of agreement, and SDs with expert panel are calculated Frequency counts of score distributions Mean scores by rater Significant areas of disagreement and trends with rubric indicators are identified and communicated to Evaluators after training Following up with programs and program coordinators or key faculty: Prepare report with data findings, analysis, and interpretation of calibration exercise and next steps Share data at faculty assemblies and through your reporting data tools Musumeci, M., & Bohan, K. (2016). Candidate Work Sample (CWS) Evaluator inter-rater Agreement Calibration Training. Presentation at NAU University Supervisor and CWS Evaluator Annual Meeting, Phoenix, AZ.

Inter-rater Agreement: Summary of Agreement Data
Summary of inter-rater agreement data (Summer 2016 CWS Evaluator Training & Calibration Session) Summary of CWS Evaluators’ Percentages of Agreement with Expert Panel on Calibration Exercise Paper Average % Absolute Agreement 43.33% Number of evaluators 15 Average % Adjacent Agreement 50.91% Overall Average Agreement (Adjacent + Absolute) 94.24% Cronbach’s alpha (internal consistency reliability of scale) .897 Example from NAU PEP: General Candidate Work Sample completed in all ST courses except Math/Science. Definition of agreement measures: Absolute: perfect agreement between evaluators’ scores and expert panel average scores Adjacent: within 1 point of expert panel average scores Calculating an overall average percentage of agreement (abs + adjc) is acceptable, per the literature Cronbach’s alpha is calculated to measure how well the rubric indicators are measuring the same overall construct (which we call impact on student learning); values above .80 are acceptable, per the literature (Subkoviak)

Inter-rater Agreement: Activity
Group discussion: Complete the worksheet provided by filling in the details for your EPP Choose one unit-level (EPP-wide) assessment in which you could apply these strategies and discuss the questions on the worksheet. Spreadsheet template available on QAS Resources website: First, please work individually to answer the questions on the handout for Part I. Then, please work with the person(s) sitting next to you to answer the questions in Part II on the handout. We will ask for highlights from your discussions to be shared with the whole group.

Strategies for Calibrating Instruments
Questions? Suggestions? If we aren’t able to get to your question, please post it using the following url. Check back after the presentation for a response.

Resources & Contact Information Quality Assurance System Resources website: Contact Information: Cynthia Conn, PhD Assistant Vice Provost, Professional Education Programs Kathy Bohan, EdD Associate Dean, College of Education Sue Pieper, PhD Assessment Coordinator, Office of Curriculum, Learning Design, & Academic Assessment Matteo Musumeci, MA-TESL Instructional Specialist, Professional Education Programs

Definitions Performance Assessment Validity Reliability
An assessment tool that requires test takers to perform—develop a product or demonstrate a process—so that the observer can assign a score or value to that performance. A science project, an essay, a persuasive speech, a mathematics problem solution, and a woodworking project are examples. (See also authentic assessment.) Validity The degree to which the evidence obtained through validation supports the score interpretations and uses to be made of the scores from a certain test administered to a certain person or group on a specific occasion. Sometimes the evidence shows why competing interpretations or uses are inappropriate, or less appropriate, than the proposed ones. Reliability Scores that are highly reliable are accurate, reproducible, and consistent from one testing [e.g., rating] occasion to another. That is, if the testing [e.g., rating] process were repeated with a group of test takers [e.g., raters], essentially the same results would be obtained. (National Council on Measurement in Education. (2014). Glossary of important assessment and measurement terms. Retrieved from:

References Daggett, W.R. (2014). Rigor/relevance framework®: A guide to focusing resources to increase student performance. International Center for Leadership in Education. Retrieved from Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction (6th Edition). White Plains, NY: Longman Publishers. Graham, M., Milanowksi, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform. Retrieved from Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), Pieper, S. (2012, May 21). Evaluating descriptive rubrics checklist. Retrieved from Stevens, D. D., & Levi, A. J. (2005). Introduction to rubrics: An assessment tool to save grading time, convey effective feedback and promote student learning. Sterling, VA: Stylus Publishing, LLC.

Padlet Question Board Why did you select this workshop and what are you hoping to learn from it? http://tinyurl.com/whyattending.

Similar presentations

Presentation on theme: "Padlet Question Board Why did you select this workshop and what are you hoping to learn from it? http://tinyurl.com/whyattending."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Padlet Question Board Why did you select this workshop and what are you hoping to learn from it? http://tinyurl.com/whyattending.

Similar presentations

Presentation on theme: "Padlet Question Board Why did you select this workshop and what are you hoping to learn from it? http://tinyurl.com/whyattending."— Presentation transcript:

Similar presentations

About project

Feedback