Coding and Intercoder Reliability

Slides:



Advertisements
Similar presentations
Aggregate Data Research Methods. Collecting and Preparing Quantitative Data Where does a researcher find data for analysis and interpretation? Existing.
Advertisements

The Math Studies Project for Internal Assessment
Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability.
Content Analysis: Reliability Kimberly A. Neuendorf, Ph.D. Cleveland State University Fall 2011.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 7 Using Nonexperimental Research.
Funded through the ESRC’s Researcher Development Initiative Department of Education, University of Oxford Session 3.3: Inter-rater reliability.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Log-linear Analysis - Analysing Categorical Data
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
QUANTITATIVE DATA ANALYSIS
An Overview of Today’s Class
Categorical Data Analysis: Stratified Analyses, Matching, and Agreement Statistics Biostatistics March 2007 Carla Talarico.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Chapter 14 Inferential Data Analysis
CORRELATIO NAL RESEARCH METHOD. The researcher wanted to determine if there is a significant relationship between the nursing personnel characteristics.
Advertising Content Analysis A systematic, objective, and quantitative analysis of advertising conducted to infer a pattern of advertising practice or.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Measurement and Data Quality
Understanding Research Results
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Accuracy Assessment. 2 Because it is not practical to test every pixel in the classification image, a representative sample of reference points in the.
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
RESEARCH A systematic quest for undiscovered truth A way of thinking
Data Collection & Processing Hand Grip Strength P textbook.
Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Sampling: Theory and Methods
Instrumentation.
Content analysis (Holsti)
Analyzing and Interpreting Quantitative Data
Reliability & Validity
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
VALUE/Multi-State Collaborative (MSC) to Advance Learning Outcomes Assessment Pilot Year Study Findings and Summary These slides summarize results from.
Chapter Eight: Using Statistics to Answer Questions.
Chapter 6: Analyzing and Interpreting Quantitative Data
Calculating Inter-coder Reliability
Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Chapter 6 - Standardized Measurement and Assessment
Sample Size Mahmoud Alhussami, DSc., PhD. Sample Size Determination Is the act of choosing the number of observations or replicates to include in a statistical.
Chapter 13 Understanding research results: statistical inference.
Retrospective Chart Reviews: How to Review a Review Adam J. Singer, MD Professor and Vice Chairman for Research Department of Emergency Medicine Stony.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
OBJECTIVE INTRODUCTION Emergency Medicine Milestones: Longitudinal Interrater Agreement EM milestones were developed by EM experts for the Accreditation.
1 Measuring Agreement. 2 Introduction Different types of agreement Diagnosis by different methods  Do both methods give the same results? Disease absent.
Appendix I A Refresher on some Statistical Terms and Tests.
GOVT 201: Statistics for Political Science
Measures of Agreement Dundee Epidemiology and Biostatistics Unit
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Associated with quantitative studies
The Chi-Square Distribution and Test for Independence
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Inference for Categorical Data
COM 633: Content Analysis Reliability
Sociology Outcomes Assessment
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
15.1 The Role of Statistics in the Research Process
Chapter 10 Content Analysis
Presentation transcript:

Coding and Intercoder Reliability Su Li School of Law, U.C. Berkeley 2/12/2015

Outline Basics of data coding What’s intercoder reliability? Why does it matter? How to measure and report intercoder reliability? How to improve intercoder reliability? References

Data Coding Basics Start from a codebook Exhaustive and mutually exclusive value options for each variable Use multiple variables to code overlapping values or multiple values for one observation

Example Codebook (White collar lawyer project) c_graduate_year: Year graduated from law school (or received the highest degree, if not law degree) 999 if the information is not available Note: type in the applicable year (YYYY)   c_practice_area: Practice area 1. White collar (includes white collar defense, white collar crime, white collar litigation, etc.) 2. Government or corporate investigations 3. White collar and government/corporate investigations (if the practice area is described this way) 4. Criminal defense (if the practice area is described this way) Note: choose from one of the above 4 choices and type in the number. If the practice area has a different title, type in the title. See var14-18 in the WC project codebook

http://www.skadden.com/professionals/jon-l-christianson http://www.uria.com/en/oficinas/pekin/abogados.html?iniciales=FMB http://www.akingump.com/en/lawyers-advisors.html http://www.akingump.com/en/lawyers-advisors/michael-a-asaro.html Input data in Stata Label data in Stata Recode data in Stata Example 1: Graduation year– JD. 1989

What’s intercoder reliability Intercoder reliability is the widely used term for the extent to which independent coders evaluate a characteristic of a message or artifact and reach the same conclusion. (Also known as intercoder agreement, according to Tinsley and Weiss (2000). The intercoder reliability is not exactly the same as the correlation coefficient that measures the degree to which "ratings of different judges are the same when expressed as deviations from their means." Rather it measures only "the extent to which the different judges tend to assign exactly the same rating to each object" (Tinsley & Weiss, 2000, p. 98); http://astro.temple.edu/~lombard/reliability/

Why does it matter? Coding may involve coders’ judgments which vary among individuals. The quality of research depends on the coherence of coding judgments. Control the coding accuracy at the same time of monitoring intercoder reliability. Practically, make it possible for the division of labor among multiple coders.

Mathematical measures that are commonly reported on intercoder reliability Popping (1988) identified 39 different "agreement indices" for coding nominal categories. Commonly used ones: Percent agreement: PA0=totalAs/n Scott's pi (p): p=(PA0-PAe)/(1-PAe) [when PAe=Sigma(pi_squared)] Cohen's kappa (k): k=(PA0-PAe)/(1-PAe) [when PAe=(1/n_squared)*Sigma(pi_squared)] Krippendorff's alpha (a): (Krippendorff's Alpha 3.12a software) There is no consensus on a single, "best" one. Percent agreement is widely used, but is misleading. Tends to over estimate reliability. Cohen’s Kappa is being criticized but still the most frequently used. Hand calculations: http://astro.temple.edu/~lombard/reliability "Kappa means that the level of agreement is [that] percent greater than would be expected by change and thus indicates. . . . If Kappa equals 0 then the amount of agreement between the two coders is exactly what one would expect by chance. If Kappa equals 1, then the coders agree perfectly.

Example: binary var coding results of two coders   coder1 1 total coder2 50 3 53 94.34% 5.66% 89.83% 4 2 6 66.67% 33.33% 10.17% 54 5 59 91.53% 8.47% 100% PA0=50+2=52; n=59; PAe (in Scott’s i)=53/59* 53/59+6/59*6/59; PAe (in Cohen’s Kappa)=PAe(in Scott’s i)*1/(59*59)

Use SPSS to calculate Cohen’s Kappa CROSSTABS /TABLES=var1_coder2 BY var1_coder1 /FORMAT=AVALUE TABLES /STATISTICS=KAPPA /CELLS=COUNT /COUNT ROUND CELL.

Use Stata to calculate Cohen’s Kappa Kappa varlist; (each column shows the frequency of a value coded by different coders) Kap coder1 coder2 ….(each column is a coder) (see stata demo) According to Landis and Koch (1977a, 165) below 0.0 Poor 0.00 – 0.20 Slight 0.21 – 0.40 Fair 0.41 – 0.60 Moderate 0.61 – 0.80 Substantial 0.81 – 1.00 Almost perfect

Coder 1 and coder 2; coder 1 and coder 3, differences are random obs coder 1 coder 2 coder 3 coder 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Coder 1 and coder 2; coder 1 and coder 3, differences are random Coder 1 and coder 3 differences a systematic (e.g. coder 3 alwys code 2 as 1 and 3 as 4, compared with coder 2

Acceptance standard: Neuendorf (2002) No coherent standard. Some rules of thumb: “Coefficients of .90 or greater would be acceptable to all, .80 or greater would be acceptable in most situations, Below .8, there exists great disagreement” (p. 145). The criterion of .70 is often used for exploratory research. More liberal criteria are usually used for the indices known to be more conservative (i.e., Cohen’s kappa and Scott’s pi).

Hughes, Marie Adele, Garrett Dennis E. (1990)   Acceptance level recommend to use or not Percent agreement does not correct for chance agreement NO Scott's pi (p) 0.6 address chance correction and systematic coding error problem Acceptable Cohen's kappa (k) <0.00 Poor; 0.00-0.20 Slight; 0.21-0.40 Fair; 0.41-0.60 Moderate; 0.61-0.80 Substantial; 0.81-1.00 is Almost Perfect." (Landis&Koch 1977) Acceptable (most extensively discussed) Krippendorff's alpha (a) Pearson's correlation does not consider systematic coding bias

How to improve Intercoder reliability (Lombard et. Al. 2002) In Research Design: Assess reliability informally during coder training ( detailed instructions, close monitoring etc) Assess reliability formally in a pilot test. Assess reliability formally during coding of the full sample. Select and follow an appropriate procedure for incorporating the coding of the reliability sample into the coding of the full sample. (e.g. master coder quality control) In results report: Select one or more appropriate indices. Obtain the necessary tools to calculate the index or indices selected. Select an appropriate minimum acceptable level of reliability for the index or indices to be used. Report intercoder reliability in a careful, clear, and detailed manner in all research reports. http://astro.temple.edu/~lombard/reliability/

Reference http://astro.temple.edu/~lombard/reliability/ Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28, 587-604. Tinsley, H. E. A. & Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley & S. D. Brown, Eds., Handbook of Applied Multivariate Statistics and Mathematical Modeling, pp. 95-124. San Diego, CA: Academic Press. Popping, R. (1988). On agreement indices for nominal data. In Willem E. Saris & Irmtraud N. Gallhofer (Eds.), Sociometric research: Volume 1, data collection and scaling (pp. 90-105). New York: St. Martin's Press. Richard J. Landis & Gary G. Koch, The Measurement of Observer Agreements for Categorical Data, Biometrics 33:159-174 (1977) Hughes, Marie Adele, Garrett Dennis E. 1990. Intercoder Reliability Estimation Approaches in Marketing: A Generalizability Theory Framework for Quantitative Data. Journal of Marketing Research. Vol. 27, No. 2 (May, 1990), pp. 185-195