Presentation is loading. Please wait.

Presentation is loading. Please wait.

Methodological Considerations in Developing Hospital Composite Performance Measures Sean M. O’Brien, PhD Department of Biostatistics & Bioinformatics Duke.

Similar presentations


Presentation on theme: "Methodological Considerations in Developing Hospital Composite Performance Measures Sean M. O’Brien, PhD Department of Biostatistics & Bioinformatics Duke."— Presentation transcript:

1 Methodological Considerations in Developing Hospital Composite Performance Measures Sean M. O’Brien, PhD Department of Biostatistics & Bioinformatics Duke University Medical Center 4TexPoint fonts used in EMF.  Read the TexPoint manual before you delete this box.: AAA 4TexPoint fonts used in EMF.  Read the TexPoint manual before you delete this box.: AAA

2 Introduction n A “composite performance measure” is a combination of two or more related indicators l e.g. process measures, outcome measures n Useful for summarizing a large number of indicators n Reduces a large number of indicators into a single simple summary

3 Example #1 of 3: CMS / Premier Hospital Quality Incentive Demonstration Project source:

4 Example #2 of 3: US News & World Report’s Hospital Rankings RankHospitalScore #1Cleveland Clinic100.0 #2Mayo Clinic, Rochester, Minn.79.7 #3Brigham and Women's Hospital, Boston50.5 #4Johns Hopkins Hospital, Baltimore48.6 #5Massachusetts General Hospital, Boston47.6 #6 New York-Presbyterian Univ. Hosp. of Columbia and Cornell 45.6 #7 Texas Heart Institute at St. Luke's Episcopal Hospital, Houston 45.0 #8 Duke University Medical Center, Durham, N.C source: Rankings – Heart and Heart Surgery

5 Example #3 of 3: Society of Thoracic Surgeons Composite Score for CABG Quality STS Composite Quality Rating STS Database Participant Feedback Report

6 Why Composite Measures? n Simplifies reporting n Facilitates ranking n More comprehensive than single measure n More precision than single measure

7 Limitations of Composite Measures n Loss of information n Requires subjective weighting l No single objective methodology n Hospital rankings may depend on weights n Hard to interpret l May seem like a “black box” l Not always clear what is being measured

8 Goals n Discuss methodological issues & approaches for constructing composite scores n Illustrate inherent limitations of composite scores

9 Outline n Motivating Example: US News & World Reports “Best Hospitals” n Case Study: Developing a Composite Score for CABG

10 Reputation Score Motivating Example: US News & World Reports – Best Hospitals 2007 Quality Measures for Heart and Heart Surgery Mortality Index Structure Component Volume Nursing index Nurse magnet hosp Advanced services Patient services Trauma center (Based on physician survey. Percent of physicians who list your hospital in the “top 5”) (Risk adjusted 30-day. Ratio of observed to expected number of mortalities for for AMI, CABG etc.) + +

11 Motivating Example: US News & World Reports – Best Hospitals 2007 “structure, process, and outcomes each received one-third of the weight.” - America’s Best Hospitals 2007 Methodology Report

12 Duke University Medical Center Reputation16.2% Mortality index 0.77 Discharges6624 Nursing index 1.6 Nurse magnet hosp Yes Advanced services 5 of 5 Patient services 6 of 6 Trauma center Yes Motivating Example: US News & World Reports – Best Hospitals 2007 Example Data – Heart and Heart Surgery source: usnews.com

13 Which hospital is better? Hospital A Reputation5.7% Mortality index 0.74 Discharges10047 Nursing index 2.0 Nurse magnet hosp Yes Advanced services 5 of 5 Patient services 6 of 6 Trauma center Yes Hospital B Reputation14.3% Mortality index 1.10 Discharges2922 Nursing index 2.0 Nurse magnet hosp Yes Advanced services 5 of 5 Patient services 6 of 6 Trauma center Yes

14 Despite Equal Weighting, Results Are Largely Driven By Reputation 2007 RankHospital Overall Score Reputation Score #1 Cleveland Clinic % #2 Mayo Clinic, Rochester, Minn % #3 Brigham and Women's Hospital, Boston % #4 Johns Hopkins Hospital, Baltimore % #5 Massachusetts General Hospital, Boston % #6 New York-Presbyterian Univ. Hosp. of Columbia and Cornell % #7 Texas Heart Institute at St. Luke's Episcopal Hospital, Houston % #8 Duke University Medical Center, Durham, N.C % (source of data:

15 Lesson for Hospital Administrators (?) n Best way to improve your score is to boost your reputation l Focus on publishing, research, etc. n Improving your mortality rate may have a modest impact

16 Lesson for Composite Measure Developers n No single “objective” method of choosing weights n “Equal weighting” may not always behave like it sounds

17 Case Study: Composite Measurement for Coronary Artery Bypass Surgery

18 Background n Society of Thoracic Surgeons (STS) – Adult Cardiac Database l Since 1990 l Largest quality improvement registry for adult cardiac surgery l Primarily for internal feedback l Increasingly used for reporting to 3 rd parties n STS Quality Measurement Taskforce (QMTF) l Created in 2005 l First task: Develop a composite score for CABG for use by 3 rd party payers

19 Why Not Use the CMS HQID Composite Score? n Choice of measures l Some HQID measures not available in STS (Also, some nationally endorsed measures are not included in HQID) n Weighting of process vs. outcome measures l HQID is heavily weighted toward process measures l STS QMTF surgeons wanted a score that was heavily driven by outcomes

20 Our Process for Developing Composite Scores n Review specific examples of composite scores in medicine l Example: CMS HQID n Review and apply approaches from other disciplines l Psychometrics n Explore the behavior of alternative weighting methods in real data n Assess the performance of the chosen methodology

21 CABG Composite Scores in HQID (Year 1) Process Measures (4 items) Aspirin prescribed at discharge Antibiotics <1 hour prior to incision Prophylactic antibiotics selection Antibiotics discontinued <48 hours Outcome Measures (3 items) Inpatient mortality rate Postop hemorrhage/hematoma Postop physiologic/metabolic derangement Process Score Outcome Score Overall Composite

22 CABG Composite Scores in HQID – Calculation of the Process Component Score n Based on an “opportunity model” n Each time a patient is eligible to receive a care process, there is an “opportunity” for the hospital to deliver required care n The hospital’s score for the process component is “the percent of opportunities for which the hospital delivered the required care”

23 CABG Composite Scores in HQID – Calculation of the Process Component Score Aspirin at Discharge Antibiotics Initiated Antibiotics Selection Antibiotics Discontinued 9 / 9 (100%) 9 / 10 (90%) 10 / 10 (100%) 9 / 9 (100%) Hypothetical example with N = 10 patients

24 CABG Composite Scores in HQID – Calculation of Outcome Component n Risk-adjusted using 3M TM APR-DRG TM model n Based on ratio of observed / expected outcomes n Outcomes measures are: l Survival index l Avoidance index for hematoma/hemmorhage l Avoidance index for physiologic/metabolic derangement

25 CABG Composite Scores in HQID – Calculation of Outcome Component – Survival Index Interpretation: l index <1 implies worse-than-expected survival l index >1 implies better-than-expected survival (Avoidance indexes have analogous definition & interpretation)

26 CABG Composite Scores in HQID – Combining Process and Outcomes 4 / 7 x Process Score + 1 / 7 x survival index + 1 / 7 x avoidance index for hemorrhage/hematoma + 1 / 7 x avoidance index for physiologic derangment = Overall Composite Score 4 / 7 x Process Score + 1 / 7 x survival index + 1 / 7 x avoidance index for hemorrhage/hematoma + 1 / 7 x avoidance index for physiologic derangment = Overall Composite Score “Equal weight for each measure” l 4 process measures l 3 outcome measures l each individual measure is weighted 1 / 7

27 Strengths & Limitations n Advantages: l Simple l Transparent l Avoids subjective weighting n Disadvantages: l Ignores uncertainty in performance measures l Not able to calculate confidence intervals n An Unexpected Feature: l Heavily weighted toward process measures l As shown below…

28 CABG Composite Scores in HQID – Exploring the Implications of Equal Weighting n HQID performance measures are publicly reported for the top 50% of hospitals n Used these publicly reported data to study the weighting of process vs. outcomes

29 Publicly Reported HQID Data – CABG Year 1 Process Measures Outcome Measures

30 Process Performance vs. Overall Composite Decile Ranking

31 Outcome Performance vs. Overall Composite Decile Ranking

32 Explanation: Process Measures Have Wider Range of Values n The amount that outcomes can increase or decrease the composite score is small relative to process measures

33 Process vs. Outcomes: Conclusions n Outcomes will only have an impact if a hospital is on the threshold between a better and worse classification n This weighting may have advantages l Outcomes can be unreliable - Chance variation - Chance variation - Imperfect risk-adjustment - Imperfect risk-adjustment l Process measures are actionable n Not transparent

34 Lessons from HQID n Equal weighting may not behave like it sounds n If you prefer to emphasize outcomes, must account for unequal measurement scales, e.g. l standardize the measures to a common scale l or weight process and outcomes unequally

35 Goals for STS Composite Measure n Heavily weight outcomes l Use statistical methods to account for small sample sizes & rare outcomes n Make the implications of the weights as transparent as possible n Assess whether inferences about hospital performance are sensitive to the choice of statistical / weighting methods

36 Outline n Measure selection n Data n Latent variable approach to composite measures n STS approach to composite measures

37 The STS Composite Measure for CABG – Criteria for Measure Selection n Use Donabedian model of quality l Structure, process, outcomes n Address three temporal domains l Preoperative, intraoperative, postoperative n Choose measures that meet various criteria for validity l Adequately risk-adjusted l Adequate data quality

38 The STS Composite Measure for CABG – Criteria for Measure Selection + Endorsed by NQF Captured In STS

39 Process Measures n Internal mammary artery (IMA) n Preoperative betablockers n Discharge antiplatelets n Discharge betablockers n Discharge antilipids

40 Risk-Adjusted Outcome Measures n Operative mortality n Prolonged ventilation n Deep sternal infection n Permanent stroke n Renal failure n Reoperation

41 NQF Measures Not Included In Composite n Inpatient Mortality l Redundant with operative mortality n Participation in a Quality Improvement Registry n Annual CABG Volume

42 Other Measures Not Included in Composite n HQID measures, not captured in STS l Antibiotics Selection & Timing l Post-op hematoma/hemmorhage l Post-op physiologic/metabolic derangment n Structural measures n Patient satisfaction n Appropriateness n Access n Efficiency

43 Data n STS database n 133,149 isolated CABG operations during 2004 n 530 providers n Inclusion/exclusion: l Exclude sites with >5% missing data on any process measures l For discharge meds– exclude in-hospital mortalities l For IMA usage – exclude redo CABG n Impute missing data to negative (e.g. did not receive process measure)

44 Distribution of Process Measures in STS

45 Distribution of Outcomes Measures in STS

46 Latent Variable Approach to Composite Measures n Psychometric approach l Quality is a “latent variable” - Not directly measurable - Not precisely defined l Quality indicators are the observable manifestations of this latent variable l Goal is to use the observed indicators to make inferences about the underlying latent trait

47 (“Quality”) X1X1 X1X1 X2X2 X2X2 X3X3 X3X3 X4X4 X4X4 X5X5 X5X5

48 Common Modeling Assumptions n Case #1: A single latent trait l All variables measure the same thing (unidimensionality) l Variables are highly correlated (internal consistency) l Imperfect correlation is due to random measurement error l Can compensate for random measurement error by collecting lots of variables and averaging them n Case #2: More than a single latent trait l Can identify clusters of variables that describe a single latent trait (and meet the assumptions of Case #1) l NOTE: Measurement theory does not indicate how to reduce multiple distinct latent traits into a single dimension l Beyond the scope of measurement theory l Inherently normative, not descriptive

49 Models for A Single Latent Trait “Latent Trait Logistic Model” Landrum et al. 2000

50 Example of latent trait logistic model applied to 4 medication measures (Discharge Betablocker) (Preop Betablocker) (Discharge Antiplatelets) (Discharge Antilipids) “Quality of Perioperative Medical Management” X1X1 X1X1 X2X2 X2X2 X4X4 X4X4 X3X3 X3X3

51 Example of latent trait logistic model applied to 4 medication measures (Discharge Betablocker) (Preop Betablocker) (Discharge Antiplatelets) (Discharge Antilipids) “Quality of Perioperative Medical Management”

52 Technical Details of Latent Trait Analysis Q is an unobserved latent variable Goal is to estimate Q for each participant Use observed numerators and denominators Q is an unobserved latent variable Goal is to estimate Q for each participant Use observed numerators and denominators

53 Latent trait logistic model

54 Latent Trait Analysis Advantages: n Quality can be estimated efficiently l Concentrates information from multiple variables into a single parameter n Avoids having to determine weights

55 Latent Trait Analysis Disadvantages: n Hard for sites to know where to focus improvement efforts because weights are not stated explicitly n Strong modeling assumptions l A single latent trait (unidimensionality) l Latent trait is normally distributed l One major assumption is not stated explicitly but can be derived by examining the model 100% correlation between the individual items 100% correlation between the individual items A very unrealistic assumption!! A very unrealistic assumption!!

56 Table 1. Correlation between hospital log-odds parameters under IRT model DISCHARGE ANTILIPIDS DISCHARGE BETABLOCKER PREOPERATIVE BETABLOCKER DISCHARGE ANTIPLATELETS DISCHARGE ANTILIPIDS DISCHARGE BETABLOCKER 0.50 DISCHARGE ANTILIPIDS DISCHARGE BETABLOCKER PREOPERATIVE BETABLOCKER DISCHARGE ANTIPLATELETS 1.00 DISCHARGE ANTILIPIDS 1.00 DISCHARGE BETABLOCKER 1.00 Table 2. Estimated correlation between hospital log-odds parameters Model did not fit the data

57 Model Also Did Not Fit When Applied to Outcomes INFECSTROKERENALREOPMORT VENT INFECT STROKE RENAL REOP 0.61

58 Latent Trait Analysis – Conclusions n Model did not fit the data! n Each measure captures something different l # latent variables = # of measures? n Cannot use latent variable models to avoid choosing weights

59 The STS Composite Method

60 Step 1. Quality Measures are Grouped Into 4 Domains Step 2. A Summary Score is Defined for Each Domain Step 3. Hierarchical Models Are Used to Separate True Quality Differences From Random Noise and Case Mix Bias Step 4. The Domain Scores are Standardized to a Common Scale Step 5. The Standardized Domain Scores are Combined Into an Overall Composite Score by Adding Them

61 Preview: The STS Hospital Feedback Report

62 Step 1. Quality Measures Are Grouped Into Four Domains Perioperative Medical Care Bundle Preop B-blocker Discharge B-blocker Discharge Antilipids Discharge ASA Risk-Adjusted Morbidity Bundle Stroke Renal Failure Reoperation Sternal Infection Prolonged Ventilation Operative Technique IMA Usage Risk-Adjusted Mortality Measure Operative Mortality

63 Of Course Other Ways of Grouping Items Are Possible… a) Those that belong to the Emperor b) Embalmed ones c) Tame ones d) Suckling pigs e) Sirens f) Fabulous ones g) Stray dogs h) Those included in the present classification i) Frenzied ones j) Innumerable ones k) Those drawn with a very fine camelhair brush l) Others m) Those that have just broken a water pitcher n) Those that from a long way off look like flies *According to Michel Foucault, The Order of Things, 1966 Taxonomy of Animals in a Certain Chinese Encyclopedia*

64 Step 2. A Summary Measure Is Defined for Each Domain Perioperative Medical Care Bundle Risk-Adjusted Morbidity Bundle Operative Technique Risk-Adjusted Mortality Measure n Medications l “all-or-none” composite endpoint Proportion of patients who received ALL four medications (except where contraindicated) n Morbidities l “any-or-none” composite endpoint Proportion of patients who experienced AT LEAST ONE of the five morbidity endpoints

65 All-Or-None / Any-Or-None Advantages: n No need to determine weights n Reflects important values l Emphasizes systems of care l Emphasizes high benchmark n Simple to analyze statistically l Using methods for binary (yes/no) endpoints Disadvantages: n Choice to treat all items equally may be criticized

66 Step 2. A Summary Measure Is Defined for Each Domain Perioperative Medical Care Bundle Proportion of patients who received all 4 medications Risk-Adjusted Morbidity Bundle Proportion of patients who experienced at least one major morbidity Operative Technique Proportion of patients who received an IMA Risk-Adjusted Mortality Measure Proportion of patients who experienced operative mortality

67 Step 3. Use Hierarchical Models to Separate True Quality Differences from Random Noise n proportion of successful outcomes = numerator / denominator = “true probability” + random error n Hierarchical models estimate the true probabilities Variation in performance measures == Variation in true probabilities Variation caused by random error ++

68 Example of Hierarchical Models Figure. Mortality Rates in a Sample of STS Hospitals

69 Step 3. Use Hierarchical Models to Separate True Quality Differences from Case Mix Variation in true probabilities == Variation caused by the hospital Variation caused by case mix ++ Variation in performance measures == Variation in true probabilities Variation caused by random error ++ “risk-adjusted” mortality/morbidity

70 Advantages of Hierarchical Model Estimates n Less variable than a simple proportion l Shrinkage n Borrows information across hospitals l Our version also borrows information across measures n Adjusts for case mix differences

71 Estimated Distribution of True Probabilities (Hierarchical Estimates)

72 Step 4. The Domain Scores Are Standardized to a Common Scale

73 Step 4a. Consistent Directionality Directionality… Needs to be consistent in order to sum the measures Solution… Measure success instead of failure Directionality… Needs to be consistent in order to sum the measures Solution… Measure success instead of failure

74 Step 4a. Consistent Directionality

75 Step 4b. Standardization Notation Each measure is re-scaled by dividing by its standard deviation (sd)

76 Step 4b. Standardization Each measure is re-scaled by dividing by its standard deviation (sd)

77 Step 5. The Standardized Domain Scores Are Combined By Adding Them

78 …then rescaled again (for presentation purposes) (This guarantees that final score will be between 0 and 100.)

79 Distribution of Composite Scores (Fall 2007 harvest data. Rescaled to lie between 0 and 100.)

80 Goals for STS Composite Measure n Heavily weight outcomes l Use statistical methods to account for small sample sizes & rare outcomes n Make the implications of the weights as transparent as possible n Assess whether inferences about hospital performance are sensitive to the choice of statistical / weighting methods

81 Exploring the Implications of Standardization If items were NOT standardized n Items with a large scale would disproportionately influence the score l example: medications would dominate mortality n A 1% improvement in mortality would have the same impact as 1% improvement in any other domain

82 After standardizing n A 1-point difference in mortality has same impact as: l 8% improvement in morbidity rate l 11% improvement in use of IMA l 28% improvement in use of all medications Exploring the Implications of Standardization

83 Composite is weighted toward outcomes…

84 Sensitivity Analyses Key Question n Are inferences about hospital quality sensitive to the choice of methods? l If not, then stakes are not so high… Analysis n Calculate composite scores using a variety of different methods and compare results

85 n Agreement between methods l Spearman rank correlation = 0.98 l Agree w/in 20 %-tile pts = 99% l Agree on top quartile = 93% l Pairwise concordance = 94% n 1 hospital’s rank changed by 23 percentile points places n No hospital was ranked in the top quartile by one method and bottom half by the other Sensitivity Analysis: Within-Domain Aggregation Opportunity Model vs. All-Or-None Composite

86 Sensitivity Analysis: Method of Standardization where range denotes the maximum minus the minimum (across hospitals) Divide by the range instead of the standard deviation

87 Sensitivity Analysis: Method of Standardization Don’t standardize

88 Sensitivity Analysis: Summary n Inferences about hospital quality are generally robust to minor variations in the methodology n However, standardizing vs. not standardizing has a large impact on hospital rankings

89 Performance of Hospital Classifications Based on the STS Composite Score n Bottom Tier l ≥ 99% Bayesian probability that provider’s true score is lower than STS average n Top Tier l ≥ 99% Bayesian probability that provider’s true score is higher than STS average n Middle Tier l < 99% certain whether provider’s true score is lower or higher than STS average.

90 Results of Hypothetical Tier System in 2004 Data

91 Ability of Composite Score to Discriminate Performance on Individual Domains

92 Summary of STS Composite Method n Use of all-or-none composite for combining items within domains n Combining items was based on rescaling and adding n Estimation via Bayesian hierarchical models n Hospital classifications based on Bayesian probabilities

93 Advantages n Rescaling and averaging is relatively simple l Even if estimation method is not n Hierarchical models help separate true quality differences from random noise n Bayesian probabilities provide a rigorous approach to accounting for uncertainty when classifying hospitals l Control false-positives, etc.

94 Limitations n Validity depends on the collection of individual measures l Choice of measures was limited by practical considerations (e.g. available in STS) Measures were endorsed by NQF n Weak correlation between measures l Reporting a single composite score entails some loss of information l Results will depend on choice of methodology We made these features transparent - Examined implications of our choices - Examined implications of our choices - Performed sensitivity analyses - Performed sensitivity analyses

95 Summary n Composite scores have inherent limitations n The implications of the weighting method is not always obvious n Empirical testing & sensitivity analyses can help elucidate the behavior and limitations of a composite score n The validity of a composite score depends on its fitness for a particular purpose l Possibly different considerations for P4P vs. public reporting

96 Extra Slides

97 Comparison of Tier Assignments Based on Composite Score Vs. Mortality Alone

98 EXTRA SLIDES – STAR RATINGS VS VOLUME

99 Frequency of Star Categories By Volume

100 EXTRA SLIDES – HQID METHOD APPLIED TO STS MEASURES

101 Finding #1. Composite Is Primarily Determined by Outcome Component

102 Finding #2. Individual Measures Do Not Contribute Equally to Composite

103 Explanation: Process & Survival Components Have Measurement Unequal Scales Range

104 EXTRA SLIDES – CHOOSING MEASURES

105 Process or Outcomes? Processes that impact patient outcomes Processes that are currently measured

106 Process or Outcomes? Processes that impact patient outcomes Outcomes Randomness

107 Structural Measures? Structure

108 EXTRA SLIDES – ALTERNATE PERSPECTIVES FOR DEVELOPING COMPOSITE SCORES

109 Perspectives for Developing Composites n Normative Perspective l Concept being measured is defined by the choice of measures and their weighting Not vice versa l Weighting different aspects of quality is inherently normative - Weights reflect a set of values - Weights reflect a set of values - Whose values?

110 Perspectives for Developing Composites n Behavioral Perspective l Primary goal is to provide an incentive l Optimal weights are ones that will cause the desired behavior among providers l Issues: Reward outcomes or processes? Rewarding X while hoping for Y

111 SCRAP

112 Domain- specific scores Overall composite score 3-star rating categories Graphical display of STS distribution Score + confidence interval

113


Download ppt "Methodological Considerations in Developing Hospital Composite Performance Measures Sean M. O’Brien, PhD Department of Biostatistics & Bioinformatics Duke."

Similar presentations


Ads by Google