The Patient-Reported Outcome Measurement Information System (PROMIS)

Nan Rothrock, Ph.D. Northwestern University May 22, 2012

2 Agenda Problems in patient-reported outcome measures
PROMIS approach to PRO instrument development Available PROMIS instruments Reliability, validity PROMIS and the FDA

3 Challenges in PRO Measurement
Many measures of same health concept Widely varying quality Difficult to compare and combine data . . . across studies . . . across conditions Complex Long These challenges were appreciated by NIH many years ago. [TRANSITION]

4 What is wrong with today's static measures ?
3 2 2 Questionnaire with a high precision - but small range 1 1 Questionnaire with a wide range - but low precision Static questionnaire are either not precise enough or their measurement range is to narrow … ceiling and floor effects … - 1 - 2 - 3

5 “The clinical outcomes research enterprise would be enhanced greatly by the availability of a psychometrically validated, dynamic system to measure PROs efficiently in study participants with a wide range of chronic diseases and demographic characteristics.” National Institutes of Health, 2003

6 PROMIS Aims Attack the Patient- Reported Outcome (PRO) “Tower of Babel” Harness modern psychometric methods Improve quality and interpretability of PROs According to Genesis, the monolingual Babylonians wanted to make a name for themselves by building a mighty city and a tower “with its top in the heavens.” God disrupted the work by so confounding their speech. Workers spoke different languages and could no longer understand one another "Then they said, 'Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves; otherwise we shall be scattered abroad upon the face of the whole earth.'" (Genesis 11:4). Bruegel – Flemish painter. This work is currently in Vienna. Bruegel, 1563

7 Resources Nine-year commitment of NIH $80+ million investment
15 funded research sites

8 What is PROMIS? Methodology Measures (Instruments) Software

9 Glossary Item = question or statement a patient answers
Instrument = collection of items Legacy = existing instrument that is “gold standard” or a commonly used and widely accepted instrument

10 PROMIS Instruments Domain focused, not disease focused Item Banks
Domain = feeling, function or perception you want to measure (e.g., anxiety, physical function, general health perceptions) Item Banks A large collection of items measuring one domain Any and all items can be used to provide a score Can be administered as Computerized Adaptive Tests (CATs) or fixed-length short forms

11 The Life Story of a PROMIS Item
Focus groups Archival data analysis Binning and winnowing Domain Framework Literature review Expert review/ consensus Literacy level analysis Large-scale testing Cognitive interviews Expert item revision Translation review Statistical analysis Development of new and modified items (Beginning pool 8000 total items) 28 focus groups Binning and winnowing to 1064 items Cognitive interviews (784 items) Intellectual property Calibration decisions Short form CAT Validation studies

12 What instruments were created?

13 PROMIS Domain Framework
Symptoms Physical Health Function Affect Self-Reported Health Mental Health Behavior Cognition Relationships Social Health Function

14 PROMIS Current (2012) Physical Health Banks
Adult Pediatric/Parent Proxy Pain Behavior Pain Interference Pain Intensity Fatigue Pain Interference Upper Extremity Physical Health Fatigue Mobility Sleep Disturbance Asthma Impact Sleep-related Impairment Physical Function Sexual Function

15 PROMIS Current (2012) Mental Health Banks
Adult Pediatric/Parent Proxy Anxiety Anxiety Depression Depression Anger Mental Health Anger Psychosocial Illness Impact Applied Cognition Concerns Applied Cognition Abilities Alcohol Use Alcohol Consequences Alcohol Expectancies

16 PROMIS Current (2012) Social Health Banks
Adult Pediatric/Parent Proxy Ability to Participate in Roles & Activities Peer Relationships Satisfaction with Roles & Activities Social Health Companionship Emotional Support Informational Support Instrumental Support Social Isolation

17 Sample PROMIS Fatigue Short Form
Reprinted with permission of the PROMIS Health Organization and the PROMIS Cooperative Group © 2007.

18 Language Availability
Available Universal Spanish In Process German Portuguese Mandarin Chinese French Italian Norwegian Others – see

19 What is the metric? T Score Referenced to the US General Population
Mean = 50 Standard Deviation = 10 Referenced to the US General Population

20 Ongoing PROMIS Development
Adult GI Symptoms Self-efficacy for management of chronic disease Pediatric Pain Behavior, Quality, Intensity Physical Activity Experience of Stress Subjective Well-being Impact of Child Illness on Family Family Belongingness

21 How does PROMIS compare to other PRO instruments?

22 Reliability (Precision)
This leads to precise measurement that improves the power and efficiency of clinical research.

23 Physical Function Measurement Precision and Range
PROMIS Short Form 20 items PROMIS Short Form 10 items SF items SE = 3.3 rel = 0.90 SE = 2.3 rel = 0.95 Error CAT 10 items HAQ 20 items rheumatoid arthritis patients US general population

24 Validity This leads to precise measurement that improves the power and efficiency of clinical research.

25 Scores on PROMIS measures should correlate with accepted measures of the same domain. (Concurrent Validity)

26 Depression PROMIS Depression

27 When people experience clinical benefit or decline, their PROMIS scores should also change. (Responsiveness)

28 Effect Size: PROMIS Pain Interference vs
Effect Size: PROMIS Pain Interference vs. BPI Interference (Patients with pain intensity > 4 at baseline) Back pain with sciatica for at least 6 weeks Scheduled for an epidural steroid injection Baseline, 1 month, 3 month

29 Effect Size: PROMIS Pain Behavior vs
Effect Size: PROMIS Pain Behavior vs. Roland-Morris (Patients with pain intensity > 4 at baseline)

30 PROMIS and the FDA: Agreement
Importance of PRO development to include patient voices Importance of sound measurement Confusion in selecting an instrument because of huge array of choices Ongoing discussions via Interagency Clinical Outcomes Assessment Working Group to qualify PROMIS Fatigue measures, attendance and presentations at PRO Consortium

31 PROMIS and FDA: Differences Concerning Content Validity
FDA Approach  evaluate content validity in each clinical population in which the measure may be used PROMIS Approach  there is commonality in patients’ experiences of symptoms/outcomes and their impact on QOL Need to re-validate a well-developed & valid instrument in a target population is questionable Content Validity = extent to which a scale or questionnaire represents the most relevant and important aspects of a concept in the context of a given measurement application Magasi, S. et al (2011) Content validity of patient-reported outcome measures: Perspectives from a PROMIS meeting. Quality of Life Research

32 Generic fatigue = MS-specific fatigue (r=0
Generic fatigue = MS-specific fatigue (r=0.92) (Cook et al, QOLR, 2011) Content Validity = extent to which a scale or questionnaire represents the most relevant and important aspects of a concept in the context of a given measurement application When include most relevant/important aspects of fatigue for MS, see same result as general fatigue measure

33 PROMIS FatigueSFv1.0 and PROMIS FatigueMS Scores
by Disability Status, Fatigue Severity, and Vitality Scores N PROMIS FatigueSF v1.0 PROMIS FatigueMS Mean SD Expanded Disability Status Scale (EDSS) Mild (0-4) 83 52.2 8.2 52.5 9.2 Moderate ( ) 104 60.5 6.4 60.7 5.6 Severe ( ) 43 8.3 8.7 Fatigue Severity (0-10 NRS) None/Mild (0-1) 18 43.0 4.5 42.5 5.4 Moderate (2-4) 58 51.0 6.0 51.3 6.6 Severe (5-10) 154 61.7 5.8 61.9 5.5 Vitality (item from the MOS) None/A little 52 63.8 5.3 64.2 Some 88 59.9 6.3 60.1 Quite a lot 44 55.7 56.0 6.8 Very Much 45 47.5 7.3 47.0 7.9 Known groups validity equal across 2 PROMIS short forms Cook et al, QOLR, 2011

34 Use generic measures as the foundation for PRO content validity
Supplement with targeted measures Item banking allows flexible item choice without loss of a standard scoring base Alternative is a messy array of contenders that fail to communicate across themselves regarding severity or result interpretation

35 The “Promise” of PROMIS Instruments
Comparability Provide the ability to compare or combine results from multiple studies. Reliability and Validity Reduce response burden. Improve measurement precision. Simplify administration via computer-based administration, scoring, and reporting

36 Questions. PROMIS website www. nihpromis
Questions? PROMIS website Acknowledgements National Institutes of Health (Grants U54 AR , U05 AR , U01 AR ) PROMIS Pis: David Cella, Richard Gershon, San Keller, Joan Broderick, Arthur Stone, Heidi Crane, Paul Crane, Donald Patrick, Dagmar Amtmann, Karon Cook, Darren DeWalt, Chris Forrest, jim fries, stephen haley, david tulsky, dinesh khanna, brennan spiegel, paul pilkonis, carol moinpour, arnold potosky, esi morgan dewitt, lisa shulman, kevin weinfurt

37 Thank you!

38 Minimally Important Differences (MID)

39 MID Methods IRT-based MIDs on a T-Score scale
Multiple cross-sectional and longitudinal anchors (18) Summarized with nonparametric statistics (median, interquartile range) Due to the large number of anchor-based MID estimates calculated, we summarized the results using non parametric statistics, namely medians and interquartile ranges. All self-reported anchors (not clinical). Longitudinal anchors were prospective (e.g., performance status) or retrospective (e.g., global rating of change) The MID for a scale should be larger than its measurement error. To ensure this, the lower bound of the anchor-based MID range was not allowed to go below the SEM. The SEM for the IRT-based MIDs was the average standard error for the sample. The SEM for the raw score MIDs was based on the standard formula (SEM=SD*sqrt(1-r) In the cross-sectional analysis, anchors were used to categorize patients into multiple clinically-distinct groups. Many different anchors can be used for this purpose, provided individuals can be classified into distinct categories that are both clinically relevant but also minimally different. Score differences between adjacent, clinically distinct categories represent estimates of the MID. Effect sizes for these estimates were computed by dividing the adjacent category score difference by the overall SD for the sample. Both prospective and retrospective anchors were used. For prospective, we identified the amount of change in the anchor that represents minimally important change (e.g., 1-pt change in ECOG, 2-pt change on 0-10 pain scale, or change >=MID for a multi-item scale like FACIT-Fatigue). The MID is the mean score change within those categories of minimally important change (decline or improvement). Mean change scores on the PROMIS-Cancer scales corresponding to GRC item responses of +1 or +2 (“a little better,” “moderately better”) and -1 or -2 (“a little worse,” “moderately worse”) were considered estimates of the MID.

40 T-Score MID Effect Sizes* Raw Score MID Effect Sizes‡
Short Form Results Recommended IRT-based T-Score MIDs and Raw Score MIDs for PROMIS-Cancer Short Forms in Advanced Cancer Patients Instrument T-Score MID Points T-Score MID Effect Sizes* Raw Score MID Points Raw Score MID Effect Sizes‡ Fatigue 3 -5 2-3 Pain Interference 4 -6 4-7 Physical Function 3-6 Anxiety 3-5 3-4 Depression *Calculated as the T-Score MID divided by the Assessment 1 T-Score standard deviation ‡Calculated as the Raw Score MID divided by the Assessment 1 Raw Score standard deviation All lower bounds of the MID ranges were greater than the SEMs for both IRT and raw score MIDs. Although methods and results for IRT-based T-Scores were presented in detail in this presentation, the exact same methods were used to derive the raw score MIDs. That is, we did not simply take the IRT-based MIDs and transform them to a raw score scale. 7-item fatigue SF

41 T-Score MID Effect Sizes
CAT Results Recommended IRT-based T-Score MIDs for PROMIS-Cancer CATs CAT T-Score MID Points T-Score MID Effect Sizes Fatigue Pain Interference Physical Function Anxiety Depression All lower bounds of the MID ranges were greater than the SEMs for both IRT and raw score MIDs.

