Stratified Covariate Balancing

Stratified Covariate Balancing
Farrokh Alemi, Ph.D. This presentation describes the concepts behind stratified covariate balancing. This brief presentation was organized by Dr. Alemi.

Purpose of Stratified Covariate Balancing Stratified covariate balancing uses stratification to balance the data.

Propensity Scoring Stratified Covariate Balancing Focusses on Interactions Analytical Guaranteed EHR Ready Main Effects Statistical May Not Work Stratified covariate balancing automatically balances the interactions among the covariates. Thus, it overcomes one of the most difficult part of data balancing. Propensity scores, in contrast, balance mostly the main effect and asks the analyst to search for possible interactions, which may or may not be found. So often it does not balance interactions among the covariates.

Propensity Scoring Stratified Covariate Balancing Focusses on Interactions Analytical Guaranteed EHR Ready Main Effects Statistical May Not Work Stratified covariate balancing derives its weights analytically without use of regression or other parameter estimation procedures. The weights are guaranteed to balance not only the covariates but also every interaction of the covariates.

Propensity Scoring Stratified Covariate Balancing Focusses on Interactions Analytical Guaranteed EHR Ready Main Effects Statistical May Not Work Stratified covariate balancing does not need access to a statistical package and can be implemented inside an electronic health record using SQL. Thus it can be part of automated methods of balancing and analyzing data within electronic health records to prepare decision support tools.

R Package Steps in Stratified Covariate Balancing There are software packages available that can do stratified covariate balancing or alternatively you can do this method using SQL. I will describe the method using SQL coding.

Divide Data into Strata
The first step is to divide the data into subgroups. In SQL, the Group By command produces combinations of all covariates.

Divide Into Cases -- Cases describe residents who are unable to eat
SELECT COUNT(distinct [ID]) AS nCases -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS a -- Number unable to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS b – Number unable to eat and alive , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Cases -- Save in temporary file called Cases FROM [dbo].[Data] -- Name of your table may be different WHERE [uEat] = 1 -- Select only residents who were unable to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Select Cases In this example, we are trying to see if residents who are unable to eat are likely to die in the next 6 months. This code shows how cases are identified using the WHERE command.

Divide Into Cases -- Cases describe residents who are unable to eat
SELECT COUNT(distinct [ID]) AS nCases -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS a -- Number unable to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS b – Number unable to eat and alive , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Cases -- Save in temporary file called Cases FROM [dbo].[Data] -- Name of your table may be different WHERE [uEat] = 1 -- Select only residents who were unable to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Strata It also shows how various combination of covariates (in this case age, gender and other disabilities) would be grouped together.

Divide Into Cases Count Outcome
-- Cases describe residents who are unable to eat SELECT COUNT(distinct [ID]) AS nCases -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS a -- Number unable to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS b – Number unable to eat and alive , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Cases -- Save in temporary file called Cases FROM [dbo].[Data] -- Name of your table may be different WHERE [uEat] = 1 -- Select only residents who were unable to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Within each strata, we would need access to total number of cases and the frequency of occurrence of the outcome, in this case mortality in 6 months

Divide Into Controls -- Controls describe residents who are able to eat SELECT COUNT(distinct [ID]) AS nControls -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS c -- Number able to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS d – Number able to eat and alive in 6 months , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Controls -- Save in temporary file called Cases FROM [dbo].[Data] WHERE [uEat] = 0 -- Select only residents who were able to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Select Controls Same procedure is done for controls.

Divide Into Controls -- Controls describe residents who are able to eat SELECT COUNT(distinct [ID]) AS nControls -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS c -- Number able to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS d – Number able to eat and alive in 6 months , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Controls -- Save in temporary file called Cases FROM [dbo].[Data] WHERE [uEat] = 0 -- Select only residents who were able to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Group into Strata Same grouping into strata is done

Divide Into Controls -- Controls describe residents who are able to eat SELECT COUNT(distinct [ID]) AS nControls -- Number of residents unable to eat , Sum(IIF([Dead6M] = 1, 1., 0.)) AS c -- Number able to eat and dead in 6 months , SUM(IIF([Dead6M] = 0, 1., 0.)) AS d – Number able to eat and alive in 6 months , [Gender], [OlderThanAvg] , [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] INTO #Controls -- Save in temporary file called Cases FROM [dbo].[Data] WHERE [uEat] = 0 -- Select only residents who were able to eat GROUP BY -- Create strata from gender, age, and disabilities. Age is matched coarsely [Gender], [OlderThanAvg], [uWalk], [uToilet], [uGroom], [uBathe], [uDress], [uBowel], [uUrine], [uSit] Count outcome Outcomes are counted for control patients

2. Match Cases & Controls In the second step we match cases and controls over the same strata.

2. Match Cases & Controls -- Match cases with controls and calculate common odds ratio SELECT sum(a*d/(a+b+c+d))/sum(b*c/(a+b+c+d)) As [Common Odds Ratio] FROM #Cases inner join #Controls ON #Cases.[Gender] =#Controls.[Gender] and #Cases.[OlderThanAvg] = #Controls.[OlderThanAvg] and #Cases.[uWalk]= #Controls.[uWalk] and #Cases.[uToilet]= #Controls.[uToilet] and #Cases.[uGroom]= #Controls.[uGroom] and #Cases.[uBathe]= #Controls.[uBathe] and #Cases.[uDress]= #Controls.[uDress] and #Cases.[uBowel]= #Controls.[uBowel] and #Cases.[uUrine]= #Controls.[uUrine] and #Cases.[uSit]= #Controls.[uSit] This code shows the matching. The two files for cases and controls are joined with them requiring to have the same strata.

Same n Covariates for Cases and Controls
2. Match Cases & Controls Patients' Characteristics Outcome y = 1 Outcome y = 0 Same n Covariates for Cases and Controls Cases (T = 1) ai bi Controls (T = 0) ci di Here is what the output looks like for each stratum. The stratum includes all the patients with same combination of covariates. Cases are identified as those who receive treatment, where T is one. Controls are patients who do not receive the treatment. Since both cases and controls are examined over the same set of covariates, then outcomes within these strata are independent of the covariates.

2. Match Cases & Controls Patients' Characteristics Outcome y = 1 Outcome y = 0 Same n Covariates for Cases and Controls Cases (T = 1) ai bi Controls (T = 0) ci di For cases, we report their co-occurrence with the outcome. Thus the count a in stratum i shows the number of cases with the outcome. Similarly, the count b shows the number of cases without the outcome. A plus b in the stratum i shows the total count of cases in the stratum..

2. Match Cases & Controls Patients' Characteristics Outcome y = 1 Outcome y = 0 Same n Covariates for Cases and Controls Cases (T = 1) ai bi Controls (T = 0) ci di Similarly, the count c shows the number of controls with the outcome and d shows the number of controls without the outcome. C plus d in the stratum i show the total count of controls.

2. Match Cases & Controls k Age Male Disabilities Cases
Unable to Eat, X = 1 Matched Controls Able to Eat, X = 0 Total, 𝑎 𝑖 + 𝑏 𝑖 Number Dead, 𝑌 Total, 𝑐 𝑖 + 𝑑 𝑖 Weight, wi0 1 65–85 M SGTBWDL 36,677 12,831 17,862 4,253 2.053 2 40–65 19,317 9,787 10,739 3,512 1.79 3 SGTBWD 14,494 3,118 7,456 1,153 1.944 4 85+ 11,336 3,951 22,220 5,436 0.51 5 10,987 3,263 6,318 1,358 1.739 6 GTBWD 6,386 3,275 3,032 1,121 2.106 7 GTBWDL 5,101 2,192 9,524 2,544 0.536 8 4,592 982 7,283 1,226 0.631 Here is a sample of the strata and outcomes observed within each strata.

Unable to Eat, X = 1 Matched Controls Able to Eat, X = 0 Total, 𝑎 𝑖 + 𝑏 𝑖 Number Dead, 𝑌 Total, 𝑐 𝑖 + 𝑑 𝑖 Weight, wi0 1 65–85 M SGTBWDL 36,677 12,831 17,862 4,253 2.053 2 40–65 19,317 9,787 10,739 3,512 1.79 3 SGTBWD 14,494 3,118 7,456 1,153 1.944 4 85+ 11,336 3,951 22,220 5,436 0.51 5 10,987 3,263 6,318 1,358 1.739 6 GTBWD 6,386 3,275 3,032 1,121 2.106 7 GTBWDL 5,101 2,192 9,524 2,544 0.536 8 4,592 982 7,283 1,226 0.631 The combination of the covariates define the strata. Each combination age, gender and disability define a separate stratum. The strata are mutually exclusive. Hundreds or sometimes thousands of strata are identified in the data.

Unable to Eat, X = 1 Matched Controls Able to Eat, X = 0 Total, 𝑎 𝑖 + 𝑏 𝑖 Number Dead, 𝑌 Total, 𝑐 𝑖 + 𝑑 𝑖 Weight, wi0 1 65–85 M SGTBWDL 36,677 12,831 17,862 4,253 2.053 2 40–65 19,317 9,787 10,739 3,512 1.79 3 SGTBWD 14,494 3,118 7,456 1,153 1.944 4 85+ 11,336 3,951 22,220 5,436 0.51 5 10,987 3,263 6,318 1,358 1.739 6 GTBWD 6,386 3,275 3,032 1,121 2.106 7 GTBWDL 5,101 2,192 9,524 2,544 0.536 8 4,592 982 7,283 1,226 0.631 Within each strata we count the total number of cases. These totals are used in the weighting procedure to guarantee that the combination of the strata occur equally in cases and controls.

Unable to Eat, X = 1 Matched Controls Able to Eat, X = 0 Total, 𝑎 𝑖 + 𝑏 𝑖 Number Dead, 𝑌 Total, 𝑐 𝑖 + 𝑑 𝑖 Weight, wi0 1 65–85 M SGTBWDL 36,677 12,831 17,862 4,253 2.053 2 40–65 19,317 9,787 10,739 3,512 1.79 3 SGTBWD 14,494 3,118 7,456 1,153 1.944 4 85+ 11,336 3,951 22,220 5,436 0.51 5 10,987 3,263 6,318 1,358 1.739 6 GTBWD 6,386 3,275 3,032 1,121 2.106 7 GTBWDL 5,101 2,192 9,524 2,544 0.536 8 4,592 982 7,283 1,226 0.631 The outcomes are examined within each strata. The difference in probability of the outcomes cannot be due to covariates and is solely due to difference of cases and controls. The matching provides us with an opportunity to calculate unconfounded impact of treatment or exposure.

3. Calculate Impact In the last step, the data within each strata are used to calculate the unconfounded impact of treatment.

3. Calculate Impact: Common Odds Ratio
𝑂𝑅 = i 𝑎 𝑖 𝑑 𝑖 𝑛 𝑖 i 𝑏 𝑖 𝑐 𝑖 𝑛 𝑖 If the outcome is binary, then common odds ratio can be estimated across the strata by this formula.

3. Calculate Impact: Weighted Data
𝑤 𝑖 = 𝑇 𝑖 + (1−𝑇 𝑖 ) 𝑎 𝑖 + 𝑏 𝑖 𝑐 𝑖 + 𝑑 𝑖 If the outcome is continuous, then these weights can be used to balance the data. Note that each case and each control are weighted. Each strata has a different set of weights for cases and controls. Within the strata, cases and controls have different weights.

𝑤 𝑖 = 𝑇 𝑖 + (1−𝑇 𝑖 ) 𝑎 𝑖 + 𝑏 𝑖 𝑐 𝑖 + 𝑑 𝑖 If the patient is treated, i.e. it is a case, then these weights simply become 1 as the second part of the equation is multiplied by 0.. 1 1

𝑤 𝑖 = 𝑇 𝑖 + (1−𝑇 𝑖 ) 𝑎 𝑖 + 𝑏 𝑖 𝑐 𝑖 + 𝑑 𝑖 If the patient is not treated or is part of the controls, then T variable is 0 and 1

3. Unconfounded Impact: Weighted Data
𝑤 𝑖 = 𝑇 𝑖 + (1−𝑇 𝑖 ) 𝑎 𝑖 + 𝑏 𝑖 𝑐 𝑖 + 𝑑 𝑖 weights are just the total number of cases divided by the total number of controls. Multiplying the controls by this ratio guarantees that controls and cases within the same stratum occur equal number of times. 1

Combination of Covariates Balanced
3. Calculate Impact: Weighted Data Combination of Covariates Balanced Multiplying the controls by this ratio guarantees that controls and cases within the same stratum occur equal number of times.

3. Calculate Impact: Switch Distributions k Age Male Disabilities
Cases Unable to Eat, X = 1 Matched Controls Able to Eat, X = 0 Total, 𝑎 𝑖 + 𝑏 𝑖 Number Dead, 𝑌 Total, 𝑐 𝑖 + 𝑑 𝑖 Weight, wi0 1 65–85 M SGTBWDL 36,677 12,831 17,862 4,253 2.053 2 40–65 19,317 9,787 10,739 3,512 1.79 3 SGTBWD 14,494 3,118 7,456 1,153 1.944 4 85+ 11,336 3,951 22,220 5,436 0.51 5 10,987 3,263 6,318 1,358 1.739 6 GTBWD 6,386 3,275 3,032 1,121 2.106 7 GTBWDL 5,101 2,192 9,524 2,544 0.536 8 4,592 982 7,283 1,226 0.631 This weighting procedure translates to switching the distribution of controls to the frequencies of cases. For example, we can switch the count of controls in strata 4. This is the strata for patients who are above 85 years old, male and have 7 disabilities. We would replace 22,220 with 11,336. We can multiply each control by the weights or we can simply switch the distribution and be done. 11,336

Odds in Cases & Controls
This plot shows what happens when you switch the distribution or weigh the data according to the method of Stratified Covariate Balancing. The odds of various covariates across cases and controls change to 1 to 1. For example, the odds of being unable to transfer changes from 4 to 1 to 1 to 1. Unable to transfer now occurs equal number of times among treated cases and untreated controls. In essence, the covariates are balanced across the two groups. In the weighted sample, we can examine the impact of treatment without concern for the covariates.

Propensity Scoring with 2-way Interaction
Accuracy Propensity Scoring with 2-way Interaction Stratified Covariate Balancing In 2016, Alemi, ElRafey, and Avramovic simulated a number of data sets. They then examined the performance of stratified covariate balancing and propensity scoring in situations where there were significant interactions among the covariates. In high dimensional massive data, it is not practical to model all interactions in the variables. At best, only pair-wise interactions are modelled. Initially, when there was not much interaction among the covariates, pair-wise propensity scoring and stratified covariate balancing performed similarly. As higher-order interaction terms were used to generate the outcome, the stratified covariate balancing method maintained its accuracy, but pair-wise propensity scoring method had increasing error. No matter how many interaction terms were used to generate the outcome, stratified covariate balancing was able to relatively accurately estimate the impact of treatment. Propensity scoring was not able to do so. The reason for the success of stratified covariate balancing is quite simple: its weights are based on combination of covariates. Thus, the weights are based on observed interactions in the data.

Stratified Covariate Balancing is easy to implement, requires no statistical analysis, and is more accurate than Propensity scoring Stratified Covariate Balancing is easy to implement, requires no statistical analysis, and is more accurate than Propensity scoring

Stratified Covariate Balancing

Similar presentations

Presentation on theme: "Stratified Covariate Balancing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stratified Covariate Balancing

Similar presentations

Presentation on theme: "Stratified Covariate Balancing"— Presentation transcript:

Similar presentations

About project

Feedback