1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement.

Slides:



Advertisements
Similar presentations
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Chapter 4 Sampling Distributions and Data Descriptions.
Statistics Part II Math 416. Game Plan Creating Quintile Creating Quintile Decipher Quintile Decipher Quintile Per Centile Creation Per Centile Creation.
AP STUDY SESSION 2.
1
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Multicriteria Decision-Making Models
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
CALENDAR.
Development and Implementation of a Recovery-Based System: Comparison of Instruments for Assessing Recovery Jeanette M. Jerrell, Ph.D. Professor of Neuropsychiatry,
Winter Education Conference Consequential Validity Using Item- and Standard-Level Residuals to Inform Instruction.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
Chapter 7 Sampling and Sampling Distributions
The 5S numbers game..
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007.
Item Analysis.
PP Test Review Sections 6-1 to 6-6
Briana B. Morrison Adapted from William Collins
Chapter 6 Normal Distributions Understandable Statistics Ninth Edition
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Benchmark Series Microsoft Excel 2013 Level 2
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Hours Listening To Music In A Week! David Burgueño, Nestor Garcia, Rodrigo Martinez.
Quantitative Analysis (Statistics Week 8)
The Normal Distribution PSYC 6130, PROF. J. ELDER 2 is the mean is the standard deviation The height of a normal density curve at any point x is given.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Psychology Practical (Year 2) PS2001 Correlation and other topics.
25 seconds left…...
Subtraction: Adding UP
1 Measure Up! Benchmark Assessment Quality Assurance Process RCAN September 10, 2010.
Converting a Fraction to %
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
PSSA Preparation.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Simple Linear Regression Analysis
Select a time to count down from the clock above
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-2 Estimating a Population Proportion Created by Erin.
Patient Survey Results 2013 Nicki Mott. Patient Survey 2013 Patient Survey conducted by IPOS Mori by posting questionnaires to random patients in the.
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Chapter 5 The Mathematics of Diversification
Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Presentation transcript:

1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement 2.0 Conference, Salt Lake City, February 2009

2. Len Bickmans Peabody Treatment Progress Battery To give away a suite of tools to evaluate client progress in counseling, we had to develop 17 new tests. This required a formal systematic approach.

3 Statistical Approach: Imperfect Complementary Models NETFLIX Winners, We found it was important to utilize a variety of models that complement the shortcomings of each other.... Lessons Learned... the best predictive performance came from combining complementary models. CHASING $1,000,000: HOW WE WON THE NETFLIX PROGRESS PRIZE Robert Bell, Yehuda Koren, and Chris Volinsky AT&T Labs – Research VOLUME 18, NO 2, DEC. 2007

4 How to Identify Reliable Items Classical test theory Enough for one-shot ad hoc indices Floors or ceilings restrict variance Look at a PCA To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (confirmatory if at all possible) See how well a 1-factor confirmatory model fits Factorial validity, does the factor structure fit theory? Rasch (IRT) modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items suited to the intended task Informal Formal

5 Classical Test Theory (CTT) Tools of CTT Basic description of items & their correlations Cronbachs alpha, internal-consistency reliability Corrected item-total correlations Principal components (PCA) Spearman-Brown test length estimation CTT is good to do routinely with index scores OK for informal test development e.g., one-shot ad hoc index Insufficient for tests that will be published for wide use

6 Note Floors or Ceilings The Too Short IQ Test (TS-IQ) Low mean, SD, variance all indicate floors or ceilings, but outrageous kurtosis is easy to see.

7 Acorn 10 item scale and 3 item index

8 Retain Flagged Estimates of Item Quality Too Short IQ Test (TS-IQ) VariableMeanKurtosis Item Item Item Item Item Item Item Item Item Item

9 Raw Scree Plots for Comparison Compare Result to Random Shadow Simple principal components (Pearson, 1901) 10,000 PCAs on random numbers Same size data set Half page R code Visually distinguish chance effects Falls short of a confirmatory factor analysis Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2),

10 Too Short IQ Spreadsheet Items with something in common contribute to a reliable total score Cronbachs alpha internal consistency reliability Reliability increases with high item-total correlations Reliability increases with test length ItemMeanKurtosis r(Item- Total) Item Item Item Item Item Item Item Item Item Item

11 How to Identify Reliable Items 2 Classical test theory Enough for one-shot ad hoc indices Floors or ceilings restrict variance Look at a PCA To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (confirmatory if at all possible) See how well a 1-factor confirmatory model fits Factorial validity, does the factor structure fit theory? Rasch (IRT) modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items suited to the intended task Informal Formal

12 Confirmatory Factor Analysis (CFA) See how well (ha! how badly!) the data fit a theory-driven model (factorial validity) Theory: TS-IQ measures g, a single dimension of intelligence. Evaluate the fit of a single factor measurement model CFA, popular in psychology, seldom done in non-psychiatric medicine (exception: Quality of life indices have extensive psychometric analysis using all current methods)

13 Too Short IQ SAS CFA of single-factor measurement model RMSEA 0.95 or 0.96 (high standards of model fit) So far, most VU tests early in development fail to meet the high standards for measurement model fit. SAS PROC CALIS, old fashioned but (more or less) useable

14 Rasch or IRT Model IRT, Item Response Theory Rasch: One parameter logistic IRT model Good for practical test development (converges) Multi-parameter Item Response Theory (IRT) 2-3 parameter models (discrimination, guessing) For measurement research Software, e.g. R, MPLUS, Parscale, Bilog-MG, user- written procs P = Prob of getting item i right Theta = persons ability b = items difficulty on same scale

15 Rasch Model Measure score for person and item in same units If your measure = items measure, p(right) = 50% If youre better than the item, p (right) > 50% 1 Parm logistic model (1PLM) As (Person – Item) increases, prob (correct) increases in logistic model.

16 Rasch (1960/1980) model Simple 1PLM, can use conventional total score or table lookup Parallel logistic curves for items Good for practical test construction (WINSTEPS) Software in development > 20 years IRT 2PLM, 3PLM may be better for certain kinds of measurement research Rasch, G. (1960/1980). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press. Statistician, Danish student of RA Fisher Rasch Model: TS-IQ Items Cover a Range of Difficulties

17 Too Short IQ Items Information Spread Across Whole Range Easy items, like #10, are most informative about low scoring individuals Hard items, like #1, are most informative about high scoring individuals. This tests items spread to describe whole range of IQs

18 IRT: Compare Items with People Clinically Targeted Test (VUMC Greco) Items gray, people black School sample High is bad (sicker) Clinical screens focus on sick people Classify: treat yes-no Job is to be maximally informative at the cutpoint This test invests its items in severe range Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2),

19 Left, distribution of children (each # = 3) Right, distribution of items Centerline, measure score, theta for people and difficulty for items Self-harm item, a severe outlier 9 Items concentrated in low-average range Are they concentrated near the clinical-normal threshold? Acorn 10 item scale

20 Putting It All Together Too Short-IQs Items and Total

21 Putting it all together (Walkers CSI) Multiple criteria converge => firm conclusion without definitive cutoffs or perfect models Items scored 0-4 Items 1-35 Walker, Lynn S., Beck, Joy E., Garber, Judy, & Lambert, Warren. The Childrens Somatization Inventory: Psychometric Properties of the Revised Form (CSI-24) and Evidence for a Continuum of Symptom Reporting in Youth. In press, J. Pediatric Psychology.

22 Bold items, some concern Self-harm, having a low mean, shows some roughness (no fatal flaws). Infit/outfit flags are borderline. Good is now , used to be A&D items are near the floor, but still seem to work. Acorn 10 item scale and 3 item index

23 10 item scale has excellent overall stats Even fits a one-factor model with fit indices good enough for Psych Assessment purists. 3 item scale has some problems as a reliable psychological test May be too short to act as a scale with a reliable sum score A set of 3 warning flags? Acorn 10 item scale and 3 item index

24

25

26