Item Response Theory Dan Mungas, Ph.D. Department of Neurology

Slides:



Advertisements
Similar presentations
MEASUREMENT Goal To develop reliable and valid measures using state-of-the-art measurement models Members: Chang, Berdes, Gehlert, Gibbons, Schrauf, Weiss.
Advertisements

Implications and Extensions of Rasch Measurement.
Test Development.
Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.
Item Response Theory in Health Measurement
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Scaling Session Measurement implies “assigning numbers to objects or events…” Distinguish two levels: we can assign numbers to the response levels for.
Introduction to Item Response Theory
AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova
Galina Larina of March, 2012 University of Ostrava
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Latent Change in Discrete Data: Rasch Models
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Measurement and Data Quality
Item Response Theory Psych 818 DeShon. IRT ● Typically used for 0,1 data (yes, no; correct, incorrect) – Set of probabilistic models that… – Describes.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.
Translation and Cross-Cultural Equivalence of Health Measures.
Introduction Neuropsychological Symptoms Scale The Neuropsychological Symptoms Scale (NSS; Dean, 2010) was designed for use in the clinical interview to.
The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
SAS PROC IRT July 20, 2015 RCMAR/EXPORT Methods Seminar 3-4pm Acknowledgements: - Karen L. Spritzer - NCI (1U2-CCA )
1 Assessing the Minimally Important Difference in Health-Related Quality of Life Scores Ron D. Hays, Ph.D. UCLA Department of Medicine October 25, 2006,
Variables and their Operational Definitions
Friday Harbor Psychometrics 2012 Scientific Summary UC Davis / SENAS (Spanish and English Neuropsychological Assessment Scales)
1 11/17/2015 Psychometric Modeling and Calibration Ron D. Hays, Ph.D. September 11, 2006 PROMIS development process session 10:45am-12:30 pm.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
1 EPSY 546: LECTURE 1 SUMMARY George Karabatsos. 2 REVIEW.
The ABC’s of Pattern Scoring
Item Response Theory (IRT) Models for Questionnaire Evaluation: Response to Reeve Ron D. Hays October 22, 2009, ~3:45-4:05pm
University of Ostrava Czech republic 26-31, March, 2012.
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire Design and Testing.
Item Factor Analysis Item Response Theory Beaujean Chapter 6.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Item Response Theory in Health Measurement
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Chapter 6 - Standardized Measurement and Assessment
The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
Considerations in Comparing Groups of People with PROs Ron D. Hays, Ph.D. UCLA Department of Medicine May 6, 2008, 3:45-5:00pm ISPOR, Toronto, Canada.
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
Overview of Item Response Theory Ron D. Hays November 14, 2012 (8:10-8:30am) Geriatrics Society of America (GSA) Pre-Conference Workshop on Patient- Reported.
Lesson 2 Main Test Theories: The Classical Test Theory (CTT)
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Instrument Development and Psychometric Evaluation: Scientific Standards May 2012 Dynamic Tools to Measure Health Outcomes from the Patient Perspective.
Evaluating Patient-Reports about Health
Friday Harbor Laboratory University of Washington August 22-26, 2005
Psychometric Evaluation of Items Ron D. Hays
UCLA Department of Medicine
Evaluating Patient-Reports about Health
UCLA Department of Medicine
Classical Test Theory Margaret Wu.
Item Analysis: Classical and Beyond
Paul K. Crane, MD MPH Dan M. Mungas, PhD
Evaluation of measuring tools: reliability
Mohamed Dirir, Norma Sinclair, and Erin Strauts
A Multi-Dimensional PSER Stopping Rule
Spanish and English Neuropsychological Assessment Scales - Guiding Principles and Evolution Friday Harbor Psychometrics Workshop 2005.
Friday Harbor Laboratory University of Washington August 27-31, 2006
Introduction to IRT for non-psychometricians
Item Analysis: Classical and Beyond
Item Analysis: Classical and Beyond
Presentation transcript:

Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis

What is it? Why should anyone care?

IRT Basics

Item Response Theory - What Is It Modern approach to psychometric test development Mathematical measurement theory Associated numeric and computational methods Widely used in large scale educational, achievement, and aptitude testing More than 50 years of conceptual and methodological development

Item Response Theory - Methods Dataset consists of rectangular table rows correspond to examinees columns correspond to items IRT applications simultaneously estimate examinee ability and item parameters iterative, maximum likelihood estimation algorithms processor intensive, no longer a problem

Basic Data Structure Subject Item1 Item2 Item3 Item4 S1 X11 X12 X13

Item Types Dichotomous Multiple Choice Polytomous Information is greater for polytomous item than for the same item dichotomized at a cutpoint

What is the item level response Smallest discrete unit (e.g. Object Naming) Sum of correct responses (trials in word list learning test) For practical reasons, continuous measures might have to be recoded into ordinal scales with reduced response categories (10, 15)

Item Response Theory - Basic Results Item parameters difficulty discrimination correction for guessing most applicable for multiple choice items Subject Ability (in the psychometric sense) Capacity to successfully respond to test items (or propensity to respond in a certain direction) Net result of all genetic and environmental influences Measured by scales composed of homogenous items Item difficulty and subject ability are on the same scale

Item Characteristic Curves

Item Response Theory - Outcomes Item-Level Results Item Characteristic Curve (ICC) non-linear function relating ability to probability of correct response to item Item Information Curve (IIC) non-linear function showing precision of measurement (reliability) at different ability points Both curves are defined by the item parameters

Item Characteristic Curves

Information Curves

Item Response Theory - Outcomes Test-Level Results Test Characteristic Curve (TCC) non-linear function relating ability to expected total test score Test Information Curve (TIC) non-linear function showing precision of measurement (reliability) at different ability points Both sum of item level functions of included items

Test Characteristic Curve Mini-Mental State Examination

Information Curves

Item Response Theory - Fundamental Assumptions Unidimensionality - items measure a homogenous, single domain Local independence - covariance among items is determined only by the latent dimension measured by the item set

IRT Models 1PL (Rasch) 2PL 3PL Only Difficulty and Ability are estimated Discrimination is assumed to be equal across items 2PL Discrimination, Difficulty and Ability are estimated Guessing is assumed to not have an effect 3PL Discrimination, Difficulty, Guessing, and Ability are estimated (multiple choice items)

Item Response Theory - Invariance Properties Invariance requires that basic assumptions are met Item parameters are invariant across different samples Within the range of overlap of distributions Distributions of samples can differ Ability estimates are invariant across different item sets Assumes that ability range of items spans ability range of subjects that is of interest

Why Do We Care - Applications of IRT in Health Care Settings Refined scoring of tests Characterization of psychometric properties of existing tests Construction of new tests

Test Scoring IRT permits refined scoring of items that allows for differential weighting of items based on their item parameters

Physical Function Scale Hays, Morales & Reise (2000) Item LIMITED LIMITED NOT LIMITED A LOT A LITTLE AT ALL Vigorous activities, running, Lifting heavy objects, Strenuous sports 1 2 3 Climbing one flight 1 2 3 Walking more than 1 mile 1 2 3 Walking one block 1 2 3 Bathing / dressing self 1 2 3 Preparing meals / doing laundry 1 2 3 Shopping 1 2 3 Getting around inside home 1 2 3 Feeding self 1 2 3

How to Score Test Simple approach: there are numbers that will be circled; total these up, and we have a score. But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block?

How IRT Can Help IRT provides us with a data-driven means of rational scoring for such measures Items that are more discriminating are given greater weight In practice, the simple sum score is often very good; improvement is at the margins

Description of Psychometric Properties The Test Information Curve (TIC) shows reliability that continuously varies by ability Depicts ability levels associated with high and low reliability The standard error of measurement is directly related to information value (I(Q)) SEM(Q) = 1 / sqrt(I(Q)) SEM (Q) and I(Q) also have a direct correspondence to traditional r r (Q) = 1 - 1/ I(Q)

I(Q), SEM, r I(Q) SEM (s.d. units) r 1 1.00 0.00 2 0.71 0.50 4 0.75 9 0.33 0.89 12 0.29 0.92 16 0.25 0.94 25 0.20 0.96 36 0.17 0.97

TICs for English and Spanish language Versions of Two Scales Mungas et al., 2004

Construction of New Scales Items can be selected to create scales with desired measurement properties Can be used for prospective test development Can be used to create new scales from existing tests/item pools IRT will not overcome inadequate items

TICs from an Existing Global Cognition Scale and Re-Calibrated Existing Cognitive Tests Mungas et al., 2003

Principles of Scale Construction Information corresponds to assessment goals Broad and flat TIC for longitudinal change measure in population with heterogenous ability For selection or diagnostic test, peak at point of ability continuum where discrimination is most important But normal cognition spans a 4.0 s.d. range, and is even greater in demographically diverse populations

Other Issues In IRT Polytomous IRT models are available Useful for ordinal (Likert) rating scales Each possible score of the item (minus 1) is treated like a separate item with a different difficulty parameter Information is greater for polytomous item than for the same item dichotomized at a cutpoint

Other Issues in IRT Applicable to broad range of content domains IRT certainly applies to cognitive abilities Also applies to other health outcomes Quality of life Physical function Fatigue Depression Pain

Other Issues in IRT Differential Item Function - Test Bias IRT provides explicit methods to evaluate and quantify the extent to which items and tests have different measurement properties in different groups e.g. racial and ethnic groups, linguistic groups, gender

English and Spanish Item Characteristic Curves for “Lamb/Cordero” Item

English and Spanish Item Characteristic Curves for “Stone/Piedra” Item

Differential Item Function (DIF) DIF refers to systematic bias in measuring “true” ability - doesn’t address group differences in ability

Challenges/ Limitations of IRT Large samples required for stable estimation 150-200 for 1PL 400-500 for 2PL 600-1000 for 3PL Analytic methods are labor intensive There are a number of (expensive *) applications readily available for IRT analyses Evaluation of basic assumptions, identification of appropriate model, and systematic IRT analysis require considerable expertise and labor * but, R!!

Computerized Adaptive Testing (CAT) IRT based computer driven method Selects items that most closely match examinee’s ability Administers only items needed to achieve a pre-specified level of precision in measurement (information, s.e.m., reliability)

Why CAT Efficiency Administration - Scoring Standardization Time efficiency Data collection Scoring Computer can implement complex scoring algorithms

CAT Example 1

CAT Example 2

Practical Considerations for CAT

What You Need for CAT Computer technology Item Selection Item Administration Scale Scoring Item bank with IRT parameters Range of item difficulty relevant to measurement needs

What is Straightforward/Easy? Dichotomous items Multiple choice items Ordered polytomous response scales Up to 10-15 response options

Technical Challenges Continuous response scales (memory, timed tasks) Can be recoded into smaller number of ordered response ranges Lose information

Methodological Challenges Sample size requirements Minimally 300-600 cases for stable estimation of item parameters Differential Item Function and Measurement Bias Essentially involves item calibration within groups of interest e.g., age, education, language, gender, race Available literature provides minimal guidance

References Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century. Med Care, 38(9 Suppl), II28-42. Mungas, D., Reed, B. R., & Kramer, J. H. (2003). Psychometrically matched measures of global cognition, memory, and executive function for assessment of cognitive decline in older persons. Neuropsychology, 17(3), 380-392. Mungas, D., Reed, B. R., Crane, P. K., Haan, M. N., & González, H. (2004). Spanish and English Neuropsychological Assessment Scales (SENAS): Further development and psychometric characteristics. Psychological Assessment, 16(4), 347-359.