JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability.

Slides:

Advertisements

Similar presentations

Standardized Scales.

Advertisements

Coding and Intercoder Reliability

The Research Consumer Evaluates Measurement Reliability and Validity

RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-

Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.

© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.

Chapter 4 – Reliability Observed Scores and True Scores Error

 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.

Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.

Inter-Rater Reliability March 1, 2013 Emily Phillips Galloway William Johnston.

Content Analysis: Reliability Kimberly A. Neuendorf, Ph.D. Cleveland State University Fall 2011.

Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.

Magia G. Krause Ph.D. Candidate School of Information University of Michigan Society of American Archivists Annual Meeting August 14, 2009 Undergraduates.

Data and the Nature of Measurement

Chapter 5: Improving and Assessing the Quality of Behavioral Measurement Cooper, Heron, and Heward Applied Behavior Analysis, Second Edition.

Reliability and Validity in Experimental Research ♣

Intermediate methods in observational epidemiology 2008 Quality Assurance and Quality Control.

Psych 231: Research Methods in Psychology

Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.

Variables cont. Psych 231: Research Methods in Psychology.

Robust Assessment Instrument for Student Problem Solving Jennifer L. Docktor Kenneth Heller, Patricia Heller, Tom Thaden-Koch, Jun Li, Jay Dornfeld, Michael.

Utility of Collateral Informants to Inform Treatment for Gambling Disorder Megan M. Petra, MSW Renee M. Cunningham-Williams, PhD.

Psychometrics Timothy A. Steenbergh and Christopher J. Devers Indiana Wesleyan University.

Statistical Methods for Multicenter Inter-rater Reliability Study

RELIABILITY AND VALIDITY OF DATA COLLECTION. RELIABILITY OF MEASUREMENT Measurement is reliable when it yields the same values across repeated measures.

An assessment tool for dactylitis Philip Helliwell.

3/25/2011 Rubric Development. Overview  A Little About You  Overview A Quick Note about Validity Introduction of Rubrics How to Create a Rubric Using.

1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.

Tests and Measurements Intersession 2006.

Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.

Independent vs Dependent Variables PRESUMED CAUSE REFERRED TO AS INDEPENDENT VARIABLE (SMOKING). PRESUMED EFFECT IS DEPENDENT VARIABLE (LUNG CANCER). SEEK.

Behavioral Assessment ã Selection and Definition of Target Behaviors ã Selection of a Measurement Device ã Recording behavior (Collection of Data) ã Assessment.

Rubric The instrument takes the form of a grid or rubric, which divides physics problem solving into five sub-skill categories. These sub-skills are based.

Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.

Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.

Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.

Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project Presented at the Third Annual Association for the Assessment of Learning in.

Review Characteristics This review protocol was prospectively registered with BEME (see flow diagram). Total number of participants involved in the included.

Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of.

Classification of Early Onset Scoliosis (C-EOS) Has Almost Perfect Inter and Intra Observer Reliability Micaela Cyr, BA Tricia St. Hilaire, MPH Zhaoxing.

Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.

©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

BEHAVIOR ASSESSMENT SYSTEM FOR CHILDREN 2 ND EDITION (BASC-2) Misti Rohde EDUC 535 February 26, 2012.

Inter-observer variation can be measured in any situation in which two or more independent observers are evaluating the same thing Kappa is intended to.

Chapter 6 - Standardized Measurement and Assessment

RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.

Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.

OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.

1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.

Oral Health Training & Calibration Programme

EVALUATING EPP-CREATED ASSESSMENTS

Measurement Reliability

Measures of Agreement Dundee Epidemiology and Biostatistics Unit

Nicole Michael, BA John Smith, MD Tricia St. Hilaire, MPH

Elayne Colón and Tom Dana

Validity and reliability of rating speaking and writing performances

PSY 614 Instructor: Emily Bullock, Ph.D.

Instrumentation: Reliability Measuring Caring in Nursing

Research Methods in Behavior Change Programs

Natalie Robinson Centre for Evidence-based Veterinary Medicine

Robust Assessment Instrument for Student Problem Solving

COM 633: Content Analysis Reliability

We will get started as soon as everyone arrives.

Sociology Outcomes Assessment

Mapping the ACRL Framework and Nursing Professional

The first test of validity

15.1 The Role of Statistics in the Research Process

Intermediate methods in observational epidemiology 2008

Interrater differences

Presentation transcript:

JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability

Background & Overview Piloted since Began with our own version of training & calibration Overarching question: Scorer interrater reliability  i.e., Are we confident that the TE scores our candidates receive are consistent from one scorer to the next?  If not -- the reasons? What can we do out it? Jenna: Analysis of data to get these questions Then we’ll open it up to “Now what?” “Is there anything we can/should do about it?” “Like what?”

Presentation Overview Is there interrater reliability among our PACT scorers? How do we know? Two methods of analysis Results What do we do about it? Implications

Data Collection 8 credential cohorts – 4 Multiple subject and 4 Single subject 181 Teaching Events Total – 11 rubrics (excludes pilot Feedback rubric) – 10% randomly selected for double score – 10% TEs were failing and double scored 38 Teaching Events were double scored (20%)

Scoring Procedures Trained and calibrated scorers  University Faculty  Calibrate once per academic year Followed PACT Calibration Standard-Scores must:  Result in same pass/fail decision (overall)  Have exact agreement with benchmark at least 6 times  Be within 1 point of benchmark All TEs scored independently once  If failing, scored by second scorer and evidence reviewed by chief trainer

Methods of Analysis Percent Agreement  Exact Agreement  Agreement within 1 point  Combined (Exact and within 1 point) Cohen’s Kappa (Cohen, 1960)  Indicates percentage of agreement accounted for from raters above what is expected by chance Cohen (1960). Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20 (1), pp.37–46.

Percent Agreement Benefits  Easy to Understand Limitations  Does not account for chance agreement  Tends to overestimate true agreement (Berk, 1979; Grayson, 2001). Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency, 83, Grayson, K. (2001). Interrater Reliability. Journal of Consumer Psychology, 10,

Cohen’s Kappa Benefits  Accounts for chance agreement  Can be used to compare across different conditions (Ciminero, Calhoun, & Adams, 1986). Limitations  Kappa may decrease if low base rates, so need at least 10 occurances (Nelson and Cicchetti, 1995) Ciminero, A. R., Calhoun, K. S., & Adams, H. E. (Eds.). (1986). Handbook of behavioral assessment (2nd ed.). New York: Wiley. Nelson, L. D., & Cicchetti, D. V. (1995). Assessment of emotional functioning in brain impaired individuals. Psychological Assessment, 7, 404–413.

Kappa coefficient Kappa= Proportion of observed agreement – chance agreement 1- chance agreement Coefficient ranges from -1.0 (disagreement) to 1.0 (perfect agreement) Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman and Hall. Fleiss, J. L. (1981). Statistical methods for rates and proportions. NY:Wiley. Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:

Percent Agreement

Pass/Fail Disagreement

Cohen’s Kappa

Interrater Reliability Compared

Implications Overall interrater reliability poor to fair  Consider/reevaluate protocol for calibration  Calibration protocol may be interpreted differently  Drifting scorers? Use multiple methods to calculate interrater reliability Other?

How can we increase interrater reliability? Your thoughts...  Training protocol  Adding “Evidence Based” Training- Jeanne Stone, UCI  More calibration

Contact Information Jenna Porter  David Jelinek 