Causal Inference Methods for Credible and Reliable Real World Evidence

Causal Inference Methods for Credible and Reliable Real World Evidence
Douglas Faries, Sr. Research Fellow, Real World Analytics, Global Statistical Science, Eli Lilly and Company Xiang Zhang, Sr. Research Scientist, Real World Analytics, Global Statistical Science, Eli Lilly and Company

Outline An introduction of real world evidence Basic causal inference
Potential outcome framework Rubin’s Causal Model/Pearl’s Causal Model Propensity score Feasibility and Balance Assessment Methods to address time-independent confounding Matching Propensity score stratification Weighting Methods to address time-dependent confounding Marginal Structural Model New approach: Model averaging method Evaluate the impact of unmeasured confounding

Update to Appear: January 2020
SAS Press Books Update to Appear: January 2020 Authors: Douglas E. Faries, Xiang Zhang, Zbigniew Kadziola, Robert L. Obenchain, Uwe Siebert, Felicitas Kuehne, Josep Maria Haro,

Real World Evidence: The Interest is Very REAL
Quiz: How many results you would have if you search “real world evidence” in Google 661,000,000 results 112,000,000 results in google news 26,600,000 results in google videos Why All The Talk About Real-World Evidence? Is There Evidence in Real-World Evidence? FDA approval for Ibrance in men with breast cancer sets precedent for use of real-world evidence

Use of RWE in Clinical Development
12/11/2019 Talking point: RWE could do a lot, but the rest of the course we will focus on the causal inference in analyzing real world data

Growing Interest of RWE Among Regulators
12/11/2019 FDA published framework for its RWE program. NMPA (National Medical Product Agency, China) published a draft guidance in using real-world evidence to support drug development. European Union funded several initiatives to support EMA’s regulatory decision-making on medicines, including the Innovative Medicines Initiative (IMI). Regulators has a long history of using RWE to monitor and evaluate the safety of drug products after they are approved, for instance, FDA’s Sentinel Initiative. There is an increasing trend among regulators to evaluate the potential use of RWD to generate RWE in support of product effectiveness of new indications or to help to support or satisfy post-approval study requirements. FDA, EMA, PMDA, and NMPA are formally sponsoring programs to conduct research and develop guidance on how RWE can be reliably applied in a broader regulatory context.

Real World Data and Real World Evidence
FDA published a framework for Real-World Evidence program in and define real world data and real world evidence as follows: Real-World Data (RWD) are data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources. Real-World Evidence (RWE) is the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD. Examples of RWD may include: Electronic health records (EHRs) Medical claims and billing data Data from product and disease registries patient-generated data, including from in-home-use settings Data gathered from other sources that can inform on health status, such as mobile devices.

Real World Evidence Equation
RW Research Questions RW Design & Analytics RW Data RW Evidence Talking Points With the growing availability and use of RWE, the Executive Committee commissioned a cross functional team to develop a Lilly strategy for coordinated use to maximize the value of RWE for Lilly See Equation: RealWorld Evidence is more than just access to data – but requires a business question and appropriate design and analytics that addresses the challenges of RW data (e.g. such as the lack of randomization). By Real world data we mean ….. See Box on lower right As part of this effort we are focusing on the capabilities we need for ‘Real World Analytics’. This includes a) expanding our capabilities to create a competitive advantage for Lilly (RWA within BU fits this); b) research to establish best practices in targeted key areas of greatest need especially to support access of our medications with payer customers; c) investigation of analytic tools bring greater value from our big data resources (such as claims data). …. ALL LED TO RWE Inc. RWD is necessary but not sufficient for generating RWE Duke-Margolis Whitepaper: A Framework for Regulatory Use of Real World Evidence; Sept 13, 2017

Potential Outcome Scenario: a patient was ill and he took drug A, then he got better. Does it mean drug A has a causal effect on treating the patient? Not really! Let’s formalize the scenario: T=1 means taking drug A while T=0 means not; Y=1 means patient getting better while Y=0 means not. Therefore the scenario we observed is: T=1 Y=1 However, we didn’t (and were not able to) observe the “counterfactual outcome” of this patient, which is: T=0 Y=?

Potential Outcome 12/11/2019 In fact, considering all possible treatment options (A= 1 or 0) and all possible outcomes (Y=1 or 0), there are 4 possible causal effect scenarios T=1 Y=1 T=0 No causal drug effect on the patient T=1 Y=0 T=0 T=1 Y=1 T=0 Y=0 No causal effect means no matter taking drug A or not, the outcome is the same for the patient causal drug effect on the patient T=1 Y=0 T=0 Y=1

Define Causal Treatment Effect
12/11/2019 Given the potential outcome notation, we are able to define individual causal effect. 𝑌 𝑇=1 −𝑌 (𝑇=0) Though could be defined, individual causal effect is NOT observable. We are able to define and estimate causal effect on population level. For instance, average treatment effect (ATE), is defined as: 𝐴𝑇𝐸= Ε 𝑌 𝑖 (𝑇=1) − Ε 𝑌 𝑖 𝑇=0 Where 𝑌 𝑖 (∙) represents the potential outcome of 𝑖th subject. To estimate ATE, we need to estimate the counterfactual outcome of both treatment and control groups. Individual causal effect is not estimable because we can only observe one potential outcome of the same subject while keeping other confounders unchanged. Common causal effect includes ATE, ATT, and CATE To estimate ATE, we need to estimate the countfactual outcome of both treatment and control groups.

The Role of Randomization in Causal Inference
12/11/2019 R. A. Fisher wrote a series of papers and books in 1920s and 1930s on randomized experiment, and randomized experiment made a causal interpretation of the relationship between the treatments and the outcome possible. When comparing treatment effect between treatment and control groups, randomization could remove the systematic distortions that biased the causal treatment effect estimates. Back to the ATE case, with perfect randomization, the control group will provide counterfactual outcomes for the observed performance in the treatment group, so that the causal effect can be estimated Perfect randomization = balance ALL confounders, observed and unobserved For a long time, statisticians, even the great pioneers like Francis Galton and Karl Pearson, tended not to talk about causation but rather association or correlation.

Statistical Challenges in Observational Study
Unmeasured Confounder Inverse Propensity Weighting Propensity matching Regression Confounder Propensity stratification Time-dependent Treatment Outcome Causal Inference: - “Differences are due to the treatment and not other factors” - Examples to follow Unlike RCT, treatment assignments in real world observational studies are not randomized but usually influenced by confounders prior to treatment initiation, therefore estimated causal treatment effect could be biased without proper confounder adjustment.

Rubin’s Causal Model 12/11/2019 The potential outcome (PO) notation: proposed by Neyman in 1923 to explain causal effect in randomized experiment. The PO notation was not used anywhere until late 1970’s by D. Rubin for causal inference in non-randomized studies (“Rubin’s Causal Model”, or RCM). Under RCM, the focus of causal inference is to mimic randomization when randomization is actually not feasible. Under RCM, the propensity score can be viewed as the true (but unknown) assignment probability to treatment in non-randomized studies. 𝑃𝑆=probability T=1 observed baseline confounders) In RCT, the assignment probability to treatments are known and fixed. For example, a complete randomized design to compare 2 treatments, the assignment probability to the treatment of interest is ½.

Rubin’s Causal Model Key assumptions for valid causal inference under RCM: Stable Unit Treatment Value Assumption (SUTVA): the potential outcomes for any subject do not vary with the treatment assigned to other subjects, and, for each subject, there are no different forms or versions of each treatment level, which lead to different potential outcomes. Positivity: the probability of assignment to treatments for each subject is strictly between 0 and 1. Unconfoundedness: the assignment to treatment for each subject is independent of the potential outcomes, given a set of pre-treatment confounders.

Pearl’s Causal Model The direct acyclic graph (DAG) approach, which is part of the Pearl’s Causal Model (PCM), is another method commonly used for causal inference. A classical DAG encodes the assumptions of data-generating process. In the above DAG, nodes (vertices) are random variables with directed edges (arrows) indicating causal relationship between variables. Time-dependent confounding could also be incorporated in DAG as an intermediate step.

Pearl’s Causal Model The main usage of DAG is to identify:
Potential biases Variables that need to be adjusted for, and Methods that need to be applied to obtain unbiased causal effects. Potential biases might be time-independent confounding, time-dependent confounding, unmeasured confounding, and conditioning on a “collider” variable. The validity of causal inference under PCM also requires key assumptions: Consistency Positivity No unmeasured confounding Other complicated assumptions as V-separation. X1 X2 Collider

Example Study REFLECTIONS Study: Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments Robinson et al. 2012; Peng et al. 2015 Prospective Observational Study of Fibromyalgia patients initiating pharmacological treatment Primary Outcome: Pain Severity (BPI) at 1 year post initiation Plasmode Simulated Data: 1000 ‘patients’

Adjustment for Time-independent Confounding
Regression: widely used as a tool to assess the association between a set of variables and the outcome of interest. The estimated regression coefficients sometimes were interpreted as causal effect (Yule, 1895, 1897, 1899). Why not just regression? Treatment assignment mechanism are lost in regression modeling; If the outcome is rare, the regression model may be unstable; If there are substantial differences between treatment and control groups, a meaningful comparison may only be available on a small-overlapped subpopulation. However, a regression model may proceed without noticing the substantial differences so that the fitted regression model is an interpolation between two very distinct populations. Propensity Score methods (Rosenbaum and Rubin, 1983) has been widely accepted as the “gold standard” to infer causality in non-randomized studies. New directions: Model Averaging

Quality Steps: Bind and Rubin (2017)
1. Conceptual Conceptualize as RCT; Estimand 2. Design Model/Feasibility/Balance …(‘Outcome Free’) 3. Analysis Pre-Planned incl. Sensitivity; Ad-hoc 4. Conclusions Causal conclusions Step 2 is where we will spend most of our time today – planning the analysis / looking at baseline data to confirm the feasibility, adjustment and balance – but all this is OUTCOME FREE (no access to the outcome data). That is a key piece of the Rubin quality approach. The other thing to emphasize here is Step 1: conceptualization as RCT will help guide design/analytic decisions; clearly stating an estimand will bring clarity and focus to analysis plan

1 2 3 Steps for Estimating Propensity Scores
Select covariates for the Propensity Model 1 2 Address missing covariate values 3 Select Modeling Method

Selecting Variables: DAG
1 C. Cov. Predictive of Outcome Only A. Cov. Predictive of Treatment Only B. Cov. Predictive of Treatment & Outcome Treatment Outcome ?? Which kind of covariates should be included in the PS Model ??

Choosing A Propensity Model: Literature
1 Areas of Agreement: NO post-index variables (no variables influenced by Treatment) Recommendations from Literature In theory: only true confounders are needed (B only) if a covariate is neither associated with the treatment selection nor the outcome, then it should not be included in the models (Rubin, 2001) Do NOT include colliders in the models (Pearl, 2000) Do NOT include instrumental variables in the models (Ding, VanderWeele, and Robins, 2017) Inclusive: A, B, and C. Avoid excluding a confounder / Collinearity not an issue (Stuart 2010) Be selective: B and C. Brookhart et al. (2006) conducted simulation studies to confirm B and C is the “optimal” one among the 3 choices: A; B and C; B and C- ; Sort of Selective/Sort of Inclusive Imbens and Rubin (2015): B and C and A- (variables that are correlated to treatment options)

1 Current Best Practice Which Covariates to Include? Be Careful:
12/11/2019 Current Best Practice 1 Which Covariates to Include? B & C DAG (measured & unmeasured) Include Interactions ! (discuss later) Be Careful: Time Factor: “channeling bias” (Petri and Urquhart, 1991) Cohorts not based on interventions Avoid variables influenced by Treatment In applied research if the impacts of those variables are consistent over time. If that is the case, a variable indicating different time periods could affect the treatment assignment. In epidemiological research, this situation is called “channeling bias” (Petri and Urquhart, 1991) and calendar time-specific propensity score methods (Mack et.al., 2013; Dusetina et.al., 2013) was proposed to incorporate temporal period influence on the intervention assignment.

2 Addressing Missing Covariate Values
Complete Cases (exclude patients) Missing Category Indicator For categorical variable: treat missing value of each categorical variable as an additional category; For continuous variable: impute the missing value of each continuous variable with the marginal mean while adding a dummy variable to indicate it is an imputed value Missing Pattern (MP): to fit separate regressions in estimation of the propensity score for each distinct missingness pattern (MP) (D’Agostino, 2001). Multiple Imputation (MI): randomly impute any missing values multiple times with sampling from the posterior predictive distribution of the missing values given the observed values of the same covariate, thereby creating a series of “complete” data sets. (Mitra and Reiter, 2011) suggest using averaged estimated propensity score from multiple imputed data for a single outcome analysis. MIMP (combination of MI and MP) Qu & Lipkovich (2010)

Estimating the Propensity Scores – What Method?
3 Logistic Regression A Priori vs Automated Automated Algorithms Standard (stepwise, ….) PS Model Fitting Algorithms (Dehejia and Wahba 1999; Rosenbaum and Rubin 1984; Stuart 2003; Imbens and Rubin 2015) Incorporates Interactions Designed to optimize balance on main effect and interaction terms CART – Gradient Boosting McCaffrey (2004, 2013), Westriech 2010 Outperforms other CART/ML approaches No need to specify interactions / model form Stopping criteria to maximize balance (minimize ASAM) Will verbally give high level overview of use of CART for estimating PS – but do not plan to go into detail on why gradient boosting version of CART is good (it fits one tree – then fits the residuals from previous tree etc.). Recommendation: Automated --- interactions / balance

Feasibility & Balance Ensure planned analysis of estimand is feasible given the data - Population, Outcome, Inter-current events, Summary Ensure the model adequately balances the covariates “the success of the propensity score modeling is judged by whether balance on pretreatment characteristics is achieved between the treatment and control groups …” (D’Agostino 2007) How one chooses the propensity score model is not that important. The important things are: Making sure you include variables that might be confounders Making sure the propensity score produced balance treatment groups

Check Balance on Interactions – not just main effects
Assessing the Balance “the success of the propensity score modeling is judged by whether balance on pretreatment characteristics is achieved between the treatment and control groups …” (D’Agostino 2007) Standardized Differences Standard / Common (Austin 2009) Rule of Thumb: < 0.1 is good Variance Ratios Balanced means is insufficient Rule of Thumb: 0.5 < VR < 2.0 (Austin 2009) Full Distribution Assessment (Graphical) 𝑠𝑑𝑚= 𝑥 𝐴 − 𝑥 𝐵 𝑠 𝐴 2 +𝑠 𝐵 2 2 𝑣𝑟= 𝑠 𝑥 𝐴 𝑠 𝑥 𝐵 2 Check Balance on Interactions – not just main effects

Balance Assessment Example: Standardized Difference of Means (SDM)

Feasibility: Mirrored Histograms (Simulated REFLECTIONS Data)

Feasibility (Overlap) Statistics
Standardized Difference of Means (SDM) Walker’s Preference Score (equipoise) Tipton’s Index Proportion of Near Matches Variance Ratios 𝑠𝑑𝑚= 𝑝𝑠 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 − 𝑝𝑠 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑠 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 2 +𝑠 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 2 2 ln 𝐹 1−𝐹 = ln 𝑃𝑆 1−𝑃𝑆 −ln⁡( 𝑃 1−𝑃 ), 𝑇𝐼= 𝑗=1 𝑘 𝑤 𝐴𝑗 𝑤 𝐵𝑗 , % with a match (with replacement) within caliper

What if Little Overlap? SSRI

Basic Methods for Implementing PS
Inverse probability Weighting Matching Regression Matching will be the focus of today Stratification

Matching (modification of Stuart 2015)
Distance Measure: (logit) Propensity Score / Mahalanobis distance / Exact / … # of Matches 1:1 or 1:k vs k1:k2 Matching with or without replacement Calipers Set minimum quality for each match Method Algorithm for determining match: greedy / optimal / Full Which one is ‘closest; to me?

Nearest Neighbour (Greedy) Matching
Most frequently used matching algorithm Does not optimize any overall measure of balance Different match each time you sort the data set .57 .40 .34 .31 Trt A: Trt B: .55 .53 .49 .49 .39 Avg. Abs. Imbalance 0.090

Nearest Neighbour (Greedy) Matching with Caliper
Now with a caliper of 0.1 .57 .40 .34 .31 Trt A: Trt B: Variance / Bias Tradeoff; Never match without a caliper. .55 .53 .49 .49 .39 Avg. Abs. Imbalance 0.015

Optimal Matching Trt A: Trt B:
Optimal Matching (Rosenbaum, 2002, Hansen 2004) - Minimize sum of absolute differences in distance measure - Does not depend on order of the dataset .57 .40 .34 .31 Trt A: Trt B: .55 .53 .49 .49 .39 Avg imbalance 0.085

Full (Optimal) Matching
Optimal Full Matching (Hansen 2004) - Also allows 1:many and many:1 matches .57 .40 .34 .31 Trt A: Trt B: .55 .53 .49 .49 .39 Avg imbalance 0.051

More (Limited) Guidance: Matching
Analysis 2-cohort comparison Outcome analysis: Paired (Austin, 2007) or Unpaired (Schafer and Kang, 2008; Stuart 2010) Regression for residual imbalance Variance estimation of the estimated causal effect Matching without replacement: variance estimator from independent sampling inference methods/variance estimator from paired sampling inference methods/Bootstrap Matching with replacement: Wild bootstrap > 2 cohorts? Yang 2016 [Generalized Propensity Score Matching] McCaffrey 2013 [Generalized Propensity Score Weighting] Lopez & Gutman [Vector Matching] Mention: Debate in literature whether analysis should be ‘paired’ or ‘un-paired’. I vote for ‘un-paired’ as the propensity score is not designed to produce great matches on an individual pair bases – but on average across a population. 39

Matching Analysis Results
Simulated REFLECTIONS data Method N1 N2 Est. Trt Effect P-Value Un-Matched 240 758 -0.34 .01 1:1 Greedy (linear PS) 237 238 -0.07 .68 1:1 Optimal + exact (linear PS) -0.22 .23 1:1 Optimal + caliper (MH) -0.03 .88 Variable Ratio Matching (Linear PS) ‘476’ -0.02 .93 Full Optimal Matching (Linear PS) ‘417’ .81 40

Matching Decisions – Guidance (Limited)
Situation Specific: Guidance is Difficult Distance Measure: Rank Based Mahalanobis distance (Rosenbaum 2010) Exact on most important Covariates then logit PS (Imbens & Rubin) # of Matches Consider your estimand: ATT or ATE? Best Balance – 1:1, Full Matching If multiple: Variable Ratio over 1:k With replacement: when small N or competition for controls Calipers 0.2 SD of logit of the PS (Rosenbaum 2010; Austin 2011) Method Greedy often sufficient but Optimal better matched pairs & smaller N studies (Stuart 2010) Bias-Variance Tradeoff 41

Propensity Score Stratification
Estimate the Propensity Score for each patient Group Propensity Scores into homogeneous strata Analysis: Estimate the Treatment Effect within each strata, average across the strata

Propensity Score Stratification
… Strata 1: Estimate Treatment Effect using Regression (to adjust for residual confounding); Repeat for Strata 2-10, then average across the Strata

Propensity Strata Example
Strata Information Stratum Index Frequencies Propensity Score Range Treated Control Total 1 0.145 0.298 378 1171 1549 2 0.299 0.343 502 1048 1550 3 0.377 573 977 4 0.410 580 969 5 0.441 633 916 6 0.474 732 816 1548 7 0.508 775 773 8 0.551 853 696 9 0.636 909 638 1547 10 0.820 1076 472

Treatment Effects (TE) & Weighting by Strata
Obs Strata Unadjusted Z-Statistic Regression ATE weight ATT 1 314 0.48 -806 -1.32 0.10 0.05 2 620 0.84 714 1.06 0.07 3 1301 2.27 1194 2.24 0.08 4 -604 -1.36 -693 -1.62 5 163 0.36 -67 -0.15 0.09 6 111 0.26 255 0.65 7 -244 -0.61 -243 0.11 8 -92 -0.12 81 0.12 9 968 1.16 125 0.14 10 -314 -0.53 -391 -0.68 0.15 ATE= Average treatment effect; ATT = Average treatment effect of treatment

How Many Strata (Imbens and Rubin 2015)
Exact Stratification Propensity Quintiles (5) or Deciles (10) Automated Approach: Split strata until balance is achieved Start with a single strata Split strata into 2 (median split) if: Imbalance is observed (difference in linear ps > c) AND Resulting strata has sufficient N in each treatment group Continue splitting until no remaining strata meets the splitting criteria

PS Deciles vs Automated Strata
Propensity Score Range … … 0.30 0.38 … … PS Strata 2 PS Strata 3 ASB 4 ASB 5 ASB 6 ASB 7 Propensity Deciles (10) Automated Strata Building (20) Propensity Score N Min Max Strata 1549 0.14 0.30 1 2 1550 0.34 3 0.38 … ... 9 1547 0.55 0.64 10 1548 0.82 Propensity Score N Min Max Strata 962 0.14 0.27 1 2 485 0.29 3 248 0.30 … 19 487 0.64 0.68 20 966 0.82

Causal Effect Estimation via Weighting (Austin & Stuart 2015)
Original Sample Weighted Sample Re-Weighting Active Control Age 48.0 56.0 Female 40% 62% … Active Control Age 52.0 Female 51% … Can we find individual patient weights to transform the Imbalanced original sample into a balanced weighted sample??

Inverse Probability of Treatment Weighting
Treatment: Z = 0 for Control vs Z = 1 for Active Outcome: Yi = Zi Yi(1) + (1 − Zi) Yi(0) ei = Pr(Zi = 1 | Xi) Simple IPW Estimator:

Weighting Options Weighting Method Formula IPTW (ATE)
𝑤 𝑖 = 𝑍 𝑖 𝑒 𝑖 −𝑍 𝑖 1− 𝑒 𝑖 IPTW-Stabilized (ATE) 𝑤 𝑖 = 𝑍 𝑖 𝑃(𝑍=1 𝑒 𝑖 −𝑍 𝑖 )𝑃(𝑍=0 1− 𝑒 𝑖 IPTW (ATT) 𝑤 𝑖 = 𝑍 𝑖 + 𝑒 𝑖 (1− 𝑍 𝑖 ) 1− 𝑒 𝑖 . Overlap (Li et al. 2016) 𝑤 𝑖 = 1− 𝑒 𝑖 , 𝑓𝑜𝑟 𝑇𝑟𝑒𝑎𝑡𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑒 𝑖 , 𝑓𝑜𝑟 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠.

Improved Estimators (combining IPTW and Regression)
Doubly Robust Estimation (Lunceford and Davidian 2004): 1 𝑛 𝑖=1 𝑛 𝑍 𝑖 𝑌 𝑖 − 𝑍 𝑖 − 𝑒 𝑖 𝑚 𝑇 ( 𝑋 𝑖 , 𝛼 𝑇 ) 𝑒 𝑖 − 1 𝑛 𝑖=1 𝑛 1− 𝑍 𝑖 𝑌 𝑖 + 𝑍 𝑖 − 𝑒 𝑖 𝑚 𝐶 ( 𝑋 𝑖 , 𝛼 𝐶 ) 1− 𝑒 𝑖 , where mT and mC and are predicted outcomes from regression models of the outcome on the covariate vector X for Treatment and Control, respectively. Variance Estimation Robust Sandwich Estimator Bootstrap (incorporating variability in weight estimation)

A Common Issue: Outliers

Outcome/Weight Clouds

Bootstrap Distribution of Treatment Effect Estimates

Methods Matter – Example (based on Faries et al 2008)
Longitudinal: Time Dependent Confounding Methods Matter – Example (based on Faries et al 2008)

Causal Treatment Effect Estimation in Longitudinal Real World Data
Same Issues from Cross Sectional …. Plus …. Medication Switching Time-dependent Confounding A predictor of subsequent outcome and subsequent treatment and influenced by prior treatment Missing Data / Censoring

DAGs for Time-dependent Variables
Outcome R Treatment R: Risk score (confounder) R0 Trt0 R1 Trt1 Outcome #us changed (animated)

A Causal Analysis Solution: Marginal Structural Models (MSM)
MSMs = models for the marginal distribution of counterfactual outcomes (Robins 1998; Hernan 2000) Weighted Repeated Measures Model Treatment as a time varying variable Weights based on Inverse Probability of Treatment and Inverse Probability of Censoring Weighting addresses time varying confounders

Key Principle of MSMs Observed population Confounded data set MSMs
Weighting everybody is treated and not treated #changed2016 Pseudo population Unconfounded data set

Pseudo Population Tree
Causal Inference in Observational Studies and Clinical Trials Affected by Treatment Switching: A Practical Hands-on Workshop May 3-6, 2016, UMIT, Hall i.T./ Austria Pseudo Population Tree Observation Tree Npseudo = 30 x 1/0.75 = 40 Death 40 60 30 10 15 45 24 4 9 1 40 60 32 16 36 12 n=100 #changed

Weighting for MSMs (Hernan 2000)
Denominator: Probability patient i received their observed treatment at time k, given their treatment and risk factor history. Numerator: Probability patient i received their observed treatment at time k, given their treatment history but not further adjusting for risk factor history. Similar weighting for Censoring Cumulative Weights

What is the Best Strategy for Bias Control?
MSMs Regression w/i Strata Doubly Robust Penalized Regression Match 1:1 vs 1:many Inverse Propensity Weighting Entropy Balancing Local Control Regression Propensity Matching within Calipers Match with Replacement Mahalanob. Matching High Dim. Propensity Propensity Stratification Propensity Score Matching Exact Matching Near Far IV Matching Greedy Matching Prognostic Matching G-Estimation Optimal Matching

Frequentist Model Averaging (FMA)
Simulations with  50 strategies (Zagar et al, 2017) Strategy = combination of treatment and outcome models Data generation scenarios both smooth and tree- structured rules of treatment assignment and outcome models

Frequentist Model Averaging (Zagar et al. 2017)
Let the Data Decide !! Frequentist Model Averaging (Zagar et al. 2017)

FMA: Bias Control Simulations
Simulation results for one scenario based on claims data Weights for each individual strategy reflect cross-validation loss function FMA-based methods Zagar, Kadziola, Lipkovich, Faries, … (work in progress)

FMA Example: Simulated REFLECTIONS Data
Frequentist Model Averaging (wATE) wATE 2.5 Percentile 97.5 Percentile -0.228 -0.482 0.0424

How Do RWE and RCT Results Compare?
“… real-world evidence can only correct for biases that researchers already understand. By randomly assigning patients to one treatment or another, clinical trials rely on chance to cancel out any biases, whether researchers are aware of them or not.” As with any type of evidence, RWE has strengths and limitations, and stakeholder perceptions represent both the “yin and yang” of RWE. Several other experts who reviewed the data had a different reaction than Dr. Schneeweiss, with two saying no amount of new information would convince them that the RWE approach is workable. One of them called the attempt “dangerous.” The FDA came down somewhere in the middle. An agency spokeswoman said there is a “stronger scientific justification” for randomized controlled trials, but that “recent efforts to use rigorous design and statistical methods” might lead to a greater chance of obtaining valid results with real-world evidence. **************** The FDA has contracted with Aetion and the Brigham to try to duplicate the results of 30 completed randomized trials. The agency has also challenged Aetion to duplicate seven randomized trials that are currently underway. The new data, however, come from a separate attempt by Aetion researchers — one in which they initiated a pilot attempt to replicate the CAROLINA study, which was being Source: Company Confidential © 2019 Eli Lilly and Company

Adjusting for Measured Confounders
Real World Population Standard bias control method Forcing balance on measured confounders may exacerbate the imbalance on the unmeasured confounder. Unmeasured confounder

Current State of the Community
12/11/2019 Current State of the Community What should we do about Unmeasured confounding? Just mention it as a limitation in the Discussion Section and move on! The simplest solution may be acceptable in the past when you present to a individual payer but with growing availability of the RWD, it is not acceptable for FDAMA and regulators.

Unmeasured Confounding
Quantitative analytical methods as sensitivity analysis for unmeasured confounding “No unmeasured confounding”

Available Analytical Methods for Unmeasured Confounding Assessment
Sensitivity analysis to assess the impact of unmeasured confounding No information on unmeasured confounder(s) Internal information on unmeasured confounder(s) External information on the unmeasured confounder(s) Plausibility Assessment I Adjusted Sensitivity Analysis II Plausibility Assessment Adjusted Sensitivity Analysis Plausibility Assessment Adjusted Sensitivity Analysis Array approach Instrument variable Plausibility Assessment I Adjusted Sensitivity Analysis II Plausibility Assessment I Adjusted Sensitivity Analysis II E value Regression discontinuity Rosenbaum-Rubin sensitivity analysis Bayesian Twin Regression Rosenbaum-Rubin sensitivity analysis Bayesian Twin Regression Manski’s partial identification Difference in difference Negative control Missing cause approach Multiple imputations Rosenbaum Sensitivity analysis Propensity score calibration Rosenbaum Sensitivity analysis Empirical distribution calibration Trend-in-trend analysis Propensity score calibration Pseudo treatment Perturbation variable Zhang, X., Faries, D. E., Li, H., Stamey, J. D., & Imbens, G. W. (2018). Pharmacoepidemiology and drug safety, 27(4),

Example: External Data (Bayesian Twin Models)

A Hypothetical Example: Real World Comparative Effectiveness Study
12/11/2019 A Hypothetical Example: Real World Comparative Effectiveness Study Study objective: Compare outcomes between osteoporosis patients who used Drug A versus Drug B Outcome comparison: Drug A seems to increase the risk Data source: A large claim database Odds ratio of outcome: Drug A vs. Drug B Statistical Method: Propensity score matching + exact matching on most important confounders Matching algorithm performs well for the baseline confounders Rational: Confirm established clinical efficacy; demonstrate value of Forteo for decision makers Incorrect results!

A Hypothetical Example: Real World Comparative Effectiveness Study
12/11/2019 A Hypothetical Example: Real World Comparative Effectiveness Study External data 1 Key missing confounder: Bone Mineral Density (BMD) values Claim data External data 1 Adjusted treatment effect estimate BMD is one of the most important predictors of the fracture risk Physicians cite low BMD as the most important reason to initiate Forteo treatment Literature Bayesian Twin regression model: two-stage model Absence of the BMD would introduce significant selection bias between the treatment groups. Rational: Confirm established clinical efficacy; demonstrate value of Forteo for decision makers Solution: Bayesian twin regression modeling to incorporate Multiple “External“ data source that contains BMD information

Sensitivity Analysis Results
12/11/2019 The results reversed after adjusting unmeasured BMD! Odds Ratio * Statistically significant at p<0.05 level.

Summary Comparative analysis from observational data requires statistical adjustment for ‘baseline’ confounding Propensity Scores are a commonly used tool for adjusting for baseline confounding Best Practice Steps discussed – start with DAG Improvements are needed Improvements for the Future Regulatory use of RWE Model Averaging Unmeasured Confounding Basically read slide

Reference Splawa-Neyman, Jerzy, Dorota M. Dabrowska, and T. P. Speed. "On the application of probability theory to agricultural experiments. Essay on principles. Section 9." Statistical Science (1990): Holland, Paul W. "Statistics and causal inference." Journal of the American statistical Association (1986): Pearl, Judea. Causality: models, reasoning and inference. Vol. 29. Cambridge: MIT press, 2000. Robins, James M., and Sander Greenland. "The role of model selection in causal inference from nonexperimental data." American Journal of Epidemiology 123.3 (1986): Rosenbaum, Paul R., and Donald B. Rubin. "The central role of the propensity score in observational studies for causal effects." Biometrika 70.1 (1983): Brookhart, M. Alan, et al "Variable selection for propensity score models." American journal of epidemiology 163(12): D’Agostino R., Lang W., Walkup M., Morgon T. (2001). Examining the Impact of Missing Data on Propensity Score Estimation in Determining the Effectiveness of Self-Monitoring of Blood Dehejia, Rajeev H., and Sadek Wahba. "Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs." Journal of the American statistical Association (1999): McCaffrey, Daniel F., Greg Ridgeway, and Andrew R. Morral. "Propensity score estimation with boosted regression for evaluating causal effects in observational studies." Psychological methods 9.4 (2004): 403. Mitra, Robin, and Jerome P. Reiter. "Estimating propensity scores with missing covariate data using general location mixture models." Statistics in medicine 30.6 (2011): Petri, H., and J. Urquhart "Channeling bias in the interpretation of drug effects." Statistics in medicine 10(4): Qu, Yongming, and Ilya Lipkovich. "Propensity score estimation with missing values using a multiple imputation missingness pattern (MIMP) approach." Statistics in Medicine28.9 (2009):

Causal Inference Methods for Credible and Reliable Real World Evidence

Similar presentations

Presentation on theme: "Causal Inference Methods for Credible and Reliable Real World Evidence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Causal Inference Methods for Credible and Reliable Real World Evidence

Similar presentations

Presentation on theme: "Causal Inference Methods for Credible and Reliable Real World Evidence"— Presentation transcript:

Similar presentations

About project

Feedback