Presentation on theme: "Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007."— Presentation transcript:
Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007
Examples of Recent Test Development by KC Investigators Peabody Two different tests of school-based reading ability A test of school-based math skills A battery of 10 new tests designed to track ongoing mental health treatment of children Very early signs of autism spectrum in infants VUMC Psychological rigidity in children with behavior/emotional problems Somatizing in children with recurrent abdominal pain Survey of attending MD satisfaction with a department in hospital
What Is a “Test” Could be questionnaire A set of items in a structured interview Signs & symptoms of something Often a “fuzzy” construct with numerous imperfect indicators, e.g. Beck Depression Inventory or CBCL Tests gain reliability by combining imperfect items into a total score For today: A test is a set of items that produces a total score
How to Identify the Best Items A toolkit, not an analytic plan Actually, flag weaker items to drop or revise Identify the weaker Relative, not absolute criteria Classical test theory Floors or ceilings restrict variance To increase Cronbach’s alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (exploratory and confirmatory) Are the items that don’t fit the construct? Avoid items that do not load on the main factor See how well a confirmatory model fits Rasch modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items that suit the intended task
Psychometrics vs Statistics Statistics: Look for a statistical model that fits your data Psychometric test construction: Look for data that fits your statistical model Choose sound measurement models and pick items that fit the model
Unresolved Issues for Discussion Role of confirmatory factor analysis? Other approaches?
Classical Test Theory (CTT) Basic description of items Can be done with SAS SPSS etc Do this routinely with scales old or new
Note Floors or Ceilings The “Too Short” IQ Test (TS-IQ) Mean, SD, variance all indicate floors or ceilings, but kurtosis is very easy to spot. The “Too Short” IQ Test data set with SAS and SPSS code available for download http://kc.vanderbilt.edu/quant/Seminar/schedule.htm
Hard, Medium, Easy Items #1, #6, #10 Measuring entire population requires a range of item difficulties. Kurtosis: 11 -2 3
Use Excel Conditional Formatting to Flag Notable Values
Retain Flagged Estimates of Quality “Too Short IQ Test” (TS-IQ) VariableMeanKurtosis Item010.0611.16 Item020.22-0.11 Item030.35-1.61 Item040.39-1.82 Item050.45-1.96 Item060.49-2.01 Item070.54-1.99 Item080.58-1.91 Item090.77-0.29 Item100.862.60
Item-total Correlations If an item is uncorrelated with the other items, it doesn’t contribute to internal-consistency reliability Software packages like SAS SPSS etc will do these easily
Biological Age Index Negative Item-Total Correlations Are Bad Forgot to “flip” items on left Correlation with TotalHigh isLabel Correlation with Total 0.57GoodFeet Walked In Six Minutes0.68 0.42GoodRank For Variable Foot0.59 0.30GoodTimes Weight Lifting0.54 -0.40BadSeconds Trail B0.44 0.42GoodStanding Forward Bend0.41 0.38GoodTinetti Balance Score0.38 -0.30BadGDS Depression (High=Sad)0.38 0.20GoodSum 3 Exercise Measures0.35 -0.20BadCharlson Comorbidity Index0.32 -0.12BadMini Mental State Examination0.15 -0.09BadBody Mass Index0.02 Make sure all items are high-is-good or high-is-bad
“TS-IQ,” Low Item-Total r’s are Bad SPSS Reliability or SAS PROC CORR
“Too Short” Item-Total Correlations (See SAS and SPSS Code in Handout) Items with nothing in common would not have a reliable total score Cronbach’s alpha internal consistency reliability Reliability increases with high item-total correlations ItemMeanKurtosis r(Item- Total) Item010.0611.160.30 Item020.22-0.110.53 Item030.35-1.610.53 Item040.39-1.820.43 Item050.45-1.960.53 Item060.49-2.010.60 Item070.54-1.990.59 Item080.58-1.910.36 Item090.77-0.290.49 Item100.862.600.39
How Many Items? Spearman-Brown’s Predicted Reliability = F(N Items) Classical Test Theory: Reliability increases with the number of items Put the the S-B formula into Excel to see approximately how many items you need for desired reliability under CTT. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 171-195.
Factor Analysis (FA) We want to produce one or more single-factor tests Use EFA (exploratory factor analysis) and CFA (confirmatory factor analysis)
Scree Plot of “Too Short IQ Test” Run a principle components analysis with SAS, SPSS etc “Scree” plot of eigenvalues Cattell’s metaphor, a mountain with useless rubble More than one big component? Hard to get multiple factors (subtests) from the “Too short IQ test” Kaiser criterion, min eigenvalue > 1 extremely liberal, makes unstable factors
Exploratory Factor Analysis: VUMC researcher dropped items with low loadings on Factor I Started with 50 items, picked the best 17 Learning sample ≈ 330 Validation sample ≈ 350
VUMC Researcher’s Three Samples 1.N = 181 children rating understandability of items 2.N = 680 Psychometric sample 2a. Random 50% exploratory sample (CTT, EFA) 2b. Random 50% confirmatory sample (CTT, CFA)
“Too Short IQ” SAS CFA of single-factor measurement model RMSEA 0.95 or 0.96 (very high standards of unidimensionality) Warning: So far, most VU tests early in their development haven’t met the high standards for measurement model fit.
Rasch-IRT Model Measure score for person and item in same units If you’re better than the item, p (right) > 50% As (Person – Item) increases, prob (right) increases in logistic model.
WINSTEPS One-parameter Rasch program (see http://www.winsteps.com) $200 ($99 on summer sale)http://www.winsteps.com
Persons & Items on One Scale Rasch model measures each item and each person on the same scale Concentrate your items where they are needed Measure everyone Measure high clinical cases most efficiently TS-IQ measures across a wide range
“TS IQ” Items Information Spread Across Whole Range Easy items, like #10, are most informative about low scoring individuals Hard items, like #1, are most informative about high scoring individuals. This test’s items spread to describe whole range of IQs
VUMC Clinical Test Focuses on Cutpoint Unlike the TS-IQ High is bad (sicker) Clinical screens focus on sick people Classify: treat yes-no Job is to be maximally informative at the cutpoint This test invests its items in severe range
Rating Scale Model for Likert Scales Separate estimates for Never, sometimes... TS-IQ is right wrong, but Rasch and IRT handle rating scales, such as Likert scales Construct measured can be anything, e.g. depression, not just ability
Putting It All Together Many items near the floor The lowest few have excessive kurtosis However Item-total rs and Rasch fit stats are generally OK Test maker can shorten this with considerable latitude, e.g. with content analysis.
Putting It All Together Last item has poor fit to Rasch model, consider dropping or revising
Putting It All Together Test has one odd item that measures something else, drop or revise that item.
How to Identify the Best Items A toolkit not an analytic plan Actually, flag weaker items to drop or revise Identify the weaker Relative, not absolute criteria Classical test theory Floors or ceilings restrict variance To increase Cronbach’s alpha, avoid low item-total correlations Spearman-Brown test length Factor analysis Are the items reasonably unidimensional? Avoid items that do not load on the main factor See how well a confirmatory model fits Rasch modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items that suit the intended task