Presentation on theme: "Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007."— Presentation transcript:
Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007
Examples of Recent Test Development by KC Investigators Peabody Two different tests of school-based reading ability A test of school-based math skills Very early signs of autism spectrum in infants A battery of 10 new tests for tracking mental health treatment of children VUMC Somatizing in children with recurrent abdominal pain Survey of attending MD satisfaction with a department in hospital Psychological rigidity in children Goal of Todays Session Provide tools for people making their first index or test.
What Is a Test Could be questionnaire A set of items in a structured interview Signs & symptoms of something Often a fuzzy construct with numerous imperfect indicators, e.g. Beck Depression Inventory, SF-36, CBCL Tests gain reliability by combining imperfect items into a total score. The sum of items will be more reliable than any single item. A test is a set of items that produces a total score
How to Identify the Best Items A toolkit, not an analytic plan Flag weaker items to drop or revise Identify the weaker items Relative, not absolute criteria Classical test theory Enough for most medical research Floors or ceilings restrict variance To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (exploratory and confirmatory) Are there items that dont fit the construct? Avoid items that do not load on the main factor See how well a confirmatory model fits Rasch modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items that suit the intended task Informal Formal
Psychometrics vs Statistics Statistics: Find a statistical model that fits your data Psychometric test construction: Find data that fits your statistical model Choose sound measurement models and pick items that fit by dropping weaker items.
Classical Test Theory (CTT) Basic description of items Can be done with SAS SPSS STATA S+ R etc Do this routinely with scales old and new Informal test development e.g. one-shot ad hoc index for an article Not enough for tests that will be widely used in many settings
Note Floors or Ceilings The Too Short IQ Test (TS-IQ) Low mean, SD, variance all indicate floors or ceilings, but kurtosis is very easy to spot. The Too Short IQ Test data set with SAS and SPSS code available for download http://kc.vanderbilt.edu/quant/Seminar/schedule.htm
Hard, Medium, & Easy Items #1, #6, #10 Measuring entire population requires a range of item difficulties. If everyone has the same score, the item gives no information. Kurtosis: 11 -2 3 Floor Ceiling
Use Excel Conditional Formatting to Flag Problems
Retain Flagged Estimates of Quality Too Short IQ Test (TS-IQ) VariableMeanKurtosis Item010.0611.16 Item020.22-0.11 Item030.35-1.61 Item040.39-1.82 Item050.45-1.96 Item060.49-2.01 Item070.54-1.99 Item080.58-1.91 Item090.77-0.29 Item100.862.60
Item-total Correlations How can you add unrelated things into a single total?? If an item is uncorrelated with other items, it doesnt contribute to the internal-consistency reliability of the total score Software packages like SAS SPSS etc will do item-total correlations very easily Good check to use routinely
Biological Age Index (Frailty) Negative Item-Total Correlations Are Bad Forgot to flip items on left Correlation with TotalHigh isLabel Correlation with Total 0.57GoodFeet Walked In Six Minutes0.68 0.42GoodRank For Variable Foot0.59 0.30GoodTimes Weight Lifting0.54 -0.40BadSeconds Trail B0.44 0.42GoodStanding Forward Bend0.41 0.38GoodTinetti Balance Score0.38 -0.30BadGDS Depression (High=Sad)0.38 0.20GoodSum of 3 Exercise Measures0.35 -0.20BadCharlson Comorbidity Index0.32 -0.12BadMini Mental State Examination0.15 -0.09BadBody Mass Index0.02 Make sure all items are high-is-good or high-is-bad Goffaux, J., G. C. Friesinger, Lambert, E.W. et al. (2005). "Biological age--a concept whose time has come: a preliminary study." South Med J 98(10): 985-93.
TS-IQ, Low Item-Total rs are Bad SPSS Reliability or SAS PROC CORR ALPHA SPSS Relilability
Too Short Item-Total Correlations Items with nothing in common would not have a reliable total score Cronbachs alpha internal consistency reliability Reliability increases with high item-total correlations ItemMeanKurtosis r(Item- Total) Item010.0611.160.30 Item020.22-0.110.53 Item030.35-1.610.53 Item040.39-1.820.43 Item050.45-1.960.53 Item060.49-2.010.60 Item070.54-1.990.59 Item080.58-1.910.36 Item090.77-0.290.49 Item100.862.600.39
How Many Items? Spearman-Browns Predicted Reliability = F(N Items) Classical Test Theory: Reliability increases with the number of items Put the the S-B formula into Excel to see approximately how many items you need for desired reliability under CTT. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 171-195.
How Much is Enough? For local use of an ad hoc research index, CTT may suffice Formal tests (available for general use) require more thorough psychometric analysis Factor analysis and Item Response Theory modeling
Factor Analysis (FA) Beginning formal test development Goal is to make sure the tests theory foundations agree with the test data We want to produce one or more single-factor tests Use EFA (exploratory factor analysis) and CFA (confirmatory factor analysis)
Scree Plot of TS-IQ Run a principal components analysis with SAS, SPSS etc Scree plot of eigenvalues Cattells metaphor, a mountain rising above useless rubble Is there more than one big component? Hard to get multiple factors (subtests) from the Too short IQ test Kaiser criterion, min eigenvalue > 1 extremely liberal, makes unstable factors
Formal Test Construction VUMC Pediatric Researchers Three Samples 1.N = 181 children rating understandability of items 2.N = 513 Psychometric sample 1 3.Psychometric sample 2, N = 675 2a. Random 50% N=346 exploratory sample (CTT, EFA) 2b. Random 50% confirmatory sample (CTT, CFA)
Confirmatory Factor Analysis See how well (ha! how badly!) the data fit a theory-driven model: factorial validity Theory: TS-IQ measures a single dimension of intelligence. Run a measurement model Look at fit indices Very popular in psychology, rarely done in nonpsychiatric medicine (exception: SF-36 has extensive psychometric analysis)
Too Short IQ SAS CFA of single-factor measurement model RMSEA 0.95 or 0.96 (very high standards of unidimensionality) Warning: So far, most VU tests early in their development havent met the high standards for measurement model fit.
Run Rasch or IRT IRT, Item Response Theory Rasch: One parameter logistic model Good for practical test development (converges) E.g. Winsteps ($100 or $200) Item Response Theory (IRT) 1-2-3 parameter models Good for research Need large samples E.g. Parscale, Bilog-MG, Multilog ($100 VU site license) P = Prob of getting item i right Theta = persons ability B = items difficulty on same scale
Rasch Model Measure score for person and item in same units If youre better than the item, p (right) > 50% 1 Parm logistic model As (Person – Item) increases, prob (right) increases in logistic model.
Rasch Model Items spread over a range of difficulties http://en.wikipedia.org/wiki/Rasch_model Easy items Hard items
WINSTEPS One-parameter Rasch program (see http://www.winsteps.com) $200 ($99 on summer sale)http://www.winsteps.com
TS IQ Items Information Spread Across Whole Range Easy items, like #10, are most informative about low scoring individuals Hard items, like #1, are most informative about high scoring individuals. This tests items spread to describe whole range of IQs
Persons & Items on One Scale Rasch model measures each item and each person on the same scale Concentrate your items where they are needed Measure everyone Measure high clinical cases most efficiently TS-IQ measures across a wide range
VUMC Clinical Test Focuses on Cutpoint Unlike the TS-IQ School sample High is bad (sicker) Clinical screens focus on sick people Classify: treat yes-no Job is to be maximally informative at the cutpoint This test invests its items in severe range
Putting It All Together TS-IQs Items and Total
Putting It All Together VUMC Pediatrics Items go 0-4 Many items near the floor (LE 1) The lowest few have excessive kurtosis However many item-total rs and Rasch fit stats are OK Test maker can shorten this with considerable latitude, e.g. with content analysis.
Putting It All Together Test has one odd item that measures something else. Drop or revise that item.
How to Identify the Best Items A toolkit, not an analytic plan Flag weaker items to drop or revise Identify the weaker Relative, not absolute criteria Classical test theory Enough for most medical research Floors or ceilings restrict variance To increase Cronbachs alpha, avoid low item-total correlations Guesstimate test length with Spearman-Brown formula Factor analysis (exploratory and confirmatory) Are there items that dont fit the construct? Avoid items that do not load on the main factor See how well a confirmatory model fits Rasch modeling Pick items that fit a carefully considered measurement model Consider item difficulties more deeply Pick items that suit the intended task Informal Formal