Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington.

Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington

Disclaimer Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. The views expressed do not necessarily reflect official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.

Outline ADNI neuropsychological battery Latent variable approaches –SEM and IRT ADAS-Cog in ADNI Memory in ADNI Executive functioning in ADNI

ADNI Neuropsychological Battery

Handout There is a handout that summarizes these tests and provides the variable names for the variables in the dataset

ADNI Neuropsychological Battery

Alternate Word Lists There are also two versions of the Rey AVLT that are alternated

Summary Repeated administration of a rich neuropsychological battery at 6 month intervals for 2 (AD) or 3 (NC, MCI) years How do we drink from that fire hose?

Strategies for analyzing these data Pick a couple of tests and ignore the others –ADAS-Cog and MMSE –CDR and CDR-SB Modifications of those tests –ADAS-Tree –ADAS-Rasch Composite scores for specific domains –Z score –Something fancier using latent variable approach

Latent variable approach “Items” not intrinsically interesting, only as indicators of the underlying thing measured by the test Many nice properties follow

Parallel development 1: SEM “Measurement part” of the model specifies how latent constructs are modeled “Structural part” of the model specifies relationships between latent constructs and each other, and between latent constructs and other covariates

http://sites.google.com/site/lvmworkshop/home/downloads-general/2010-downloads

Bunch of indicators limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl

“Memory” Underlying single factor with many indicators limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Memory

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Memory Rey word list ADAS word list MMSE words LM story

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Memory Rey word list ADAS word list MMSE words LM story HC volume This is what we care about!

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Memory* HC volume This is what we care about! … and typically don’t care whether memory is modeled this way or with all that secondary structure

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Memory Rey word list ADAS word list MMSE words LM story HC volume This is what we care about! And sometimes we care about it for 600,000 SNPs, or for voxels – we need to move outside of an SEM package for some of the analyses we want to do

Parallel development 2: IRT Models are nested within SEM Single factor confirmatory factor analysis model Initially worked out with binary indicators Extended 1960s to polytomous items (Samejima) It’s only the measurement part Attention to the quality of measurement and the quality of scores

Typical SEM example Construct Indicator 1 Indicator 2 Indicator 3 Indicator 4

Depression Beck Zung CESD PHQ-9

A closer look at PHQ-9 A 9 item depression scale Standard scores totalled up Typical SEM model would take that total score and treat it as a continuous indicator by using a linear link (single loading parameter) PHQ-9

PHQ-9 measurement properties

IRT approach to PHQ-9

SEM and IRT, then and now SEM was initially about total scores as indicators of constructs measured in common across tests IRT was initially about item level data that had to satisfy assumptions More recently: merging the strengths of the two approaches –Computational rather than conceptual advances, or maybe computational advances have fueled conceptual advances

Categorical data in SEM Runmplus code: That little code snippet tells Mplus to treat all of the elements in the local `vlist’ as categorical data Mplus default is WLSMV –Other appropriate ways of handling Major reason Mplus is dominant SEM software used at FH

What about IRT? Array of tools to address measurement precision Explicit focus on measurement properties and measurement precision differentiates it from SEM

Pretend for a moment that a single factor model was appropriate… Item response theory (IRT) developed middle of last century –Lord and Novick / Birnbaum (1968) Polytomous extension Samejima 1969 Lord 1980 Hambleton et al. 1991 XCALIBRE, Parscale, Multilog –All variations on a single factor CFA model

4 items each at 0.5 increments

Comments on that test Essentially linear test characteristic curve –Immaterial whether the standard score or the IRT score is used in analyses No ceiling or floor effect –People at the extremes of the thing measured by the test will get some right and get some wrong Pretty nice test!

2 items each at 0.5 increments

Comments on that test Essentially linear test characteristic curve –Immaterial whether the standard score or the IRT score is used in analyses No ceiling or floor effect –People at the extremes of the thing measured by the test will get some right and get some wrong Pretty nice test! But that’s what we said about the last one and it had twice as many items!

Why might we want twice as many items? Measurement precision / reliability CTT: summarized in a single number: alpha IRT: conceptualized as a quantity that may vary across the range of the test Information Mathematical relationship between information and standard error of measurement Intuitively makes sense that a test with 2x the items will measure more precisely / more reliably than a test with 1x the items

Test information curves for those two tests

Standard errors of measurement for the two tests

Comments about these information and SEM curves Information curves look more different than the SEM curves –Inverse square root relationship –TIC 100  SEM 0.10(1/10) –TIC 25  SEM 0.20(1/5) –TIC 16  SEM 0.25(1/4) –TIC 9  SEM 0.33(1/3) –TIC 4  SEM 0.50(1/2) Trade off between test length and measurement precision

These were highly selected “tests” It would be possible to design such a test if we started with a robust item pool Almost certainly not going to happen by accident / history What are more realistic tests?

Test characteristic curves for 2 26- item dichotomous tests

Comments on these TCCs Same number of items but very different shapes Now it may matter whether you use an IRT score or a standard score in analyses Both ceiling and floor effects

Comments on the TICs and SEMs Comparing the red test and the blue test: the red test is better for people of moderate ability (more items close to where they are) –For people right in the middle, measurement precision is just as good as a test twice as long Items far away from your ability level don’t help your standard error The blue test is better for people at the extremes (more items close to where they are)

Where do information curves come from? Item information curves use the same parameters as the item characteristic curves (difficulty level, b, and strength of association with latent trait or ability, a) (see next slides) Test information is the sum of all of the item information curves –We can do that because of local independence

I(θ) = D 2 a 2 *P(θ)*Q(θ)

More thoughts on IICs The b parameters shift the location of the curve left and right (where is the bead on the string?) The a parameters modify the height of the curve –But not tons; location location location IRT gives us tools to evaluate and manage measurement precision

What about cognitive tests?

Cognitive screening tests

Typical SEM example Construct Indicator 1 Indicator 2 Indicator 3 Indicator 4 Transformation of the scale of the indicator only works if it happens to be the inverse of the function from Cecile’s model! (Blom comment)

How about we fix the tools? Testlet models in IRT –Wainer, Bradlow, Wang (2007) –Li, Bolt, Fu (2006) Embed everything in hierarchical Bayesian framework or SEM framework –Curtis (2010) Modify IRT tools to integrate over nuisance factors for measurement precision –Curtis and Crane (2011 under review) Sample from posterior –Mislevy (1992) –Crane, Feldman, et al., (2011 under development)

ADAS-Cog in ADNI Brief cognitive test Commonly used outcome in drug trials Described in handout Two parts: cognitive part, rater part –Ignored in several manuscripts I have reviewed Eliminating rater items does not reduce respondent burden!

Does it meet IRT assumptions?

ADAS Cog 1 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 ADAS 1 q2 q1 q14

More on assumptions: 1 factor CFA TESTS OF MODEL FIT Chi-Square Test of Model Fit Value 273.016* Degrees of Freedom 39** P-Value 0.0000 Chi-Square Test of Model Fit for the Baseline Model Value 4011.878 Degrees of Freedom 29 P-Value 0.0000 CFI/TLI CFI 0.941 TLI 0.956 Number of Free Parameters 63 RMSEA (Root Mean Square Error Of Approximation) Estimate 0.086 WRMR (Weighted Root Mean Square Residual) Value 1.515

More on assumptions TESTS OF MODEL FIT Chi-Square Test of Model Fit Value 273.016* Degrees of Freedom 39** P-Value 0.0000 Chi-Square Test of Model Fit for the Baseline Model Value 4011.878 Degrees of Freedom 29 P-Value 0.0000 CFI/TLI CFI 0.941 TLI 0.956 Number of Free Parameters 63 RMSEA (Root Mean Square Error Of Approximation) Estimate 0.086 WRMR (Weighted Root Mean Square Residual) Value 1.515 0.90 / 0.95 0.08 / 0.05

BUT the details Item 1 is an average across three learning trials –Already assumes we can just total them up What if we treat the data as more granular, as 3 different indicators

ADAS Cog 2 q1b q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 ADAS 2 q1c q2 q1a q14

ADAS Cog 2 vs standard scores

Typical SEM example Construct Indicator 1 Indicator 2 Indicator 3 Indicator 4 Transformation of the scale of the indicator only works if it happens to be the inverse of the function from Cecile’s model! (Blom comment)

Broad range of scores for each standard score

Typical SEM example Construct Indicator 1 Indicator 2 Indicator 3 Indicator 4 1. Transformation of the scale of the indicator only works if it happens to be the inverse of the function from Cecile’s model! (Blom comment) 2. Transformations will only work at the level of granularity included in the model

Returning to the ADAS story Lot more parameters (20 more) CFI, TLI a bit better RMSEA a bit worse Item 1Item 1a, 1b, 1c 0.90 / 0.95 0.08 / 0.05

Secondary domain structure Methods effects –Same word list multiple times –Rater items Subdomain effects

ADAS Cog 2 bi-factor q1b q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 ADAS 2 q1c q2 q1a q14 Word list

Fit for single vs bi-factor 0.90 / 0.95 0.08 / 0.05

Loadings

ADAS Cog 2 bi-factor q1b q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 ADAS 2 q1c q2 q1a q14 Word list

Single factor vs. secondary factor structure Single factor easier to explain but not consistent with theory, may or may not fit data as well –Single factor model has a nice history of use in educational testing –Nice tools to understand the quality of measurement Bi-factor harder to explain, more consistent with theory, may fit data better –Tools not widely available to understand the quality of measurement –Scores often highly HIGHLY correlated between single factor and bi-factor models

Scores correlate 0.95

ADAS Cog 3 bi-factor q1b q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 ADAS 3 q1c q2 q1a q14 Word list Ratings

Further improves fit 0.90 / 0.95 0.08 / 0.05

Differences in loadings

Information content: single factor

ADAS Single vs. Bi-factor Single factor is sure nice –Easy to explain –But it sure fits less well –And over-estimates measurement precision somewhat Bi-factor is sure nice –Not so hard to explain –More consistent with our theory –Fits better Measurement precision appropriately estimated at least within the SEM framework

I(θ) in presence of nuisance factors

So we integrate over the nuisance dimensions….

What about this?

Multiple versions longitudinally A bit of a challenge But nothing we can’t handle! Does require some assumptions

Different versions of the word list

Mplus ADAS model constraints 1

ADAS model constraints 2

And there’s a cost 320 lines of code

Limitations Only tried this with the single factor model –Thresholds likely to be unaffected –But loadings may vary Also relies on a crucial assumption that may not hold

Essentially co-calibration over time Cross sectional co-calibration assumption: people are people. Relationship of items to the underlying domain, controlling for the latent ability level, is invariant across people This seems problematic with memory vs. other domains when we have people with MCI and AD –Memory declines faster –Maybe not hugely so over 6 months

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/0

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/06

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/0612

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/0612 avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 18

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/0612 avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 18 limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 24

limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl Scr/0612 avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 18 limmtotal avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 ldeltotal avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 24 avtot1 avtot2 avtot3 avtot4 avtot5 avtotb avtot6 avdel30min avdeltot cot1scor cot2scor cot3scor cot4totl mmballdl mmflagdl mmtreedl 36 limmtotal ldeltotal

Strategies Ignore (likely a bad idea) Use flexible approach we used for ADAS-Cog –Mplus to Paul: “That model looks hard. Perhaps you could give me an easier model.” –Paul to Mplus: “No, computers don’t date.” –Mplus: “Fine. I’m not talking to you anymore.” Use dummy variables for study visit and regress versions on visit (Rich Jones) –But: single parameter does not address all differences in ADAS-Cog versions ???

The “ignore” strategy

Executive functioning in ADNI Report is in the documentation Bifactor scores

Acknowledgements R01 AG 029672, “Improving cognitive outcome precision & responsiveness with modern psychometrics” UW –Betsy –Elena –Elizabeth –Joey –Jonathan –Laura –McKay –RJ Hebrew SeniorLife, UC Davis, Washington U –Alden –Chengjie –Dan –Danielle –Doug –Frances –Rich

Depression* Item 2 Item 1 Item 4 Item 3 Item 6 Item 5 Item 8 Item 7 Item 9

Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington.

Similar presentations

Presentation on theme: "Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington.

Similar presentations

Presentation on theme: "Psychometric analyses of ADNI data Paul K. Crane, MD MPH Department of Medicine University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback