Presentation on theme: "1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement."— Presentation transcript:
1 Choosing Reliable Items: An objective multi-model approach to applied psychometrics Warren Lambert Peabody College & Vanderbilt Kennedy Center Measurement 2.0 Conference, Salt Lake City, February 2009
2. Len Bickmans Peabody Treatment Progress Battery To give away a suite of tools to evaluate client progress in counseling, we had to develop 17 new tests. This required a formal systematic approach.
3 Statistical Approach: Imperfect Complementary Models NETFLIX Winners, We found it was important to utilize a variety of models that complement the shortcomings of each other.... Lessons Learned... the best predictive performance came from combining complementary models. CHASING $1,000,000: HOW WE WON THE NETFLIX PROGRESS PRIZE Robert Bell, Yehuda Koren, and Chris Volinsky AT&T Labs – Research VOLUME 18, NO 2, DEC. 2007
4 How to Identify Reliable Items Classical test theory Enough for one-shot ad hoc indices Floors or ceilings restrict variance Look at a PCA To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (confirmatory if at all possible) See how well a 1-factor confirmatory model fits Factorial validity, does the factor structure fit theory? Rasch (IRT) modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items suited to the intended task Informal Formal
5 Classical Test Theory (CTT) Tools of CTT Basic description of items & their correlations Cronbachs alpha, internal-consistency reliability Corrected item-total correlations Principal components (PCA) Spearman-Brown test length estimation CTT is good to do routinely with index scores OK for informal test development e.g., one-shot ad hoc index Insufficient for tests that will be published for wide use
6 Note Floors or Ceilings The Too Short IQ Test (TS-IQ) Low mean, SD, variance all indicate floors or ceilings, but outrageous kurtosis is easy to see.
7 Acorn 10 item scale and 3 item index
8 Retain Flagged Estimates of Item Quality Too Short IQ Test (TS-IQ) VariableMeanKurtosis Item Item Item Item Item Item Item Item Item Item
9 Raw Scree Plots for Comparison Compare Result to Random Shadow Simple principal components (Pearson, 1901) 10,000 PCAs on random numbers Same size data set Half page R code Visually distinguish chance effects Falls short of a confirmatory factor analysis Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2),
10 Too Short IQ Spreadsheet Items with something in common contribute to a reliable total score Cronbachs alpha internal consistency reliability Reliability increases with high item-total correlations Reliability increases with test length ItemMeanKurtosis r(Item- Total) Item Item Item Item Item Item Item Item Item Item
11 How to Identify Reliable Items 2 Classical test theory Enough for one-shot ad hoc indices Floors or ceilings restrict variance Look at a PCA To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (confirmatory if at all possible) See how well a 1-factor confirmatory model fits Factorial validity, does the factor structure fit theory? Rasch (IRT) modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items suited to the intended task Informal Formal
12 Confirmatory Factor Analysis (CFA) See how well (ha! how badly!) the data fit a theory-driven model (factorial validity) Theory: TS-IQ measures g, a single dimension of intelligence. Evaluate the fit of a single factor measurement model CFA, popular in psychology, seldom done in non-psychiatric medicine (exception: Quality of life indices have extensive psychometric analysis using all current methods)
13 Too Short IQ SAS CFA of single-factor measurement model RMSEA 0.95 or 0.96 (high standards of model fit) So far, most VU tests early in development fail to meet the high standards for measurement model fit. SAS PROC CALIS, old fashioned but (more or less) useable
14 Rasch or IRT Model IRT, Item Response Theory Rasch: One parameter logistic IRT model Good for practical test development (converges) Multi-parameter Item Response Theory (IRT) 2-3 parameter models (discrimination, guessing) For measurement research Software, e.g. R, MPLUS, Parscale, Bilog-MG, user- written procs P = Prob of getting item i right Theta = persons ability b = items difficulty on same scale
15 Rasch Model Measure score for person and item in same units If your measure = items measure, p(right) = 50% If youre better than the item, p (right) > 50% 1 Parm logistic model (1PLM) As (Person – Item) increases, prob (correct) increases in logistic model.
16 Rasch (1960/1980) model Simple 1PLM, can use conventional total score or table lookup Parallel logistic curves for items Good for practical test construction (WINSTEPS) Software in development > 20 years IRT 2PLM, 3PLM may be better for certain kinds of measurement research Rasch, G. (1960/1980). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press. Statistician, Danish student of RA Fisher Rasch Model: TS-IQ Items Cover a Range of Difficulties
17 Too Short IQ Items Information Spread Across Whole Range Easy items, like #10, are most informative about low scoring individuals Hard items, like #1, are most informative about high scoring individuals. This tests items spread to describe whole range of IQs
18 IRT: Compare Items with People Clinically Targeted Test (VUMC Greco) Items gray, people black School sample High is bad (sicker) Clinical screens focus on sick people Classify: treat yes-no Job is to be maximally informative at the cutpoint This test invests its items in severe range Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2),
19 Left, distribution of children (each # = 3) Right, distribution of items Centerline, measure score, theta for people and difficulty for items Self-harm item, a severe outlier 9 Items concentrated in low-average range Are they concentrated near the clinical-normal threshold? Acorn 10 item scale
20 Putting It All Together Too Short-IQs Items and Total
21 Putting it all together (Walkers CSI) Multiple criteria converge => firm conclusion without definitive cutoffs or perfect models Items scored 0-4 Items 1-35 Walker, Lynn S., Beck, Joy E., Garber, Judy, & Lambert, Warren. The Childrens Somatization Inventory: Psychometric Properties of the Revised Form (CSI-24) and Evidence for a Continuum of Symptom Reporting in Youth. In press, J. Pediatric Psychology.
22 Bold items, some concern Self-harm, having a low mean, shows some roughness (no fatal flaws). Infit/outfit flags are borderline. Good is now , used to be A&D items are near the floor, but still seem to work. Acorn 10 item scale and 3 item index
23 10 item scale has excellent overall stats Even fits a one-factor model with fit indices good enough for Psych Assessment purists. 3 item scale has some problems as a reliable psychological test May be too short to act as a scale with a reliable sum score A set of 3 warning flags? Acorn 10 item scale and 3 item index