Presentation is loading. Please wait.

Presentation is loading. Please wait.

S. Stanley Young Robert Obenchain Goran Krstic

Similar presentations


Presentation on theme: "S. Stanley Young Robert Obenchain Goran Krstic"— Presentation transcript:

1 S. Stanley Young Robert Obenchain Goran Krstic
Bias Adjustment in Data Mining: Local Control Analysis of Radon and Ozone S. Stanley Young Robert Obenchain Goran Krstic

2 Abstract Local control analysis of radon and ozone
S. Stanley Young, CGStat LLC Robert L. Obenchain, Risk Benefit Statistics LLC Goran Krstic, Fraser Health Authority Large (observational) data sets typically present research opportunities, but also problems that can lead to false claims. In Big Data, the standard error of an effect estimate goes to zero as sample size increases, so even small biases can lead to declared (but false) claims. In addition, the average of treatment can be almost meaningless when there are interactions with confounders that create local variation in effect-sizes. Data miners need statistical methods that can deal simply and efficiently with these sources of bias. Here, we demonstrate use of a JMP add-in, Moving Median, and a new JMP platform, Local Control, for the analysis of two data sets. Our first case study illustrate reduction of bias in an environmental epidemiology data set. Our second study uses Local Control on a time series air quality example. By detecting interactions, data miners can produce more realistic and more relevant analyses that reduce the bias typically implied by the variety and heterogeneity of Big Data.

3 Plan for Radon Data Set Radon background
Local Control analysis strategy Analysis of 2881US counties Review of Local Control Results Summary

4 Figure 1. Spatial distribution of obesity, lung cancer, radon and smoking.
Source: Ever Smoking

5 EPA cited meta analysis (1)
2005

6 EPA cited meta analysis (2)
2004 How does one account and control for the effects of high levels of indoor ozone intentionally produced using ozone generators in epidemiological studies on low level ground-level ambient air ozone vs. population health? Ozone generators are sold and widely used in North America, Europe and worldwide.

7 Local Control Analysis Process
Large observational data set A vs B comparison The steps Aggregate – cluster, LTDs Confirm – randomization test Explore – sensitivity analysis Reveal – modeling, MLR, RP

8 Step 0: Select clustering variables

9 Local Control Analysis Radon
“Most Typical” micro-aggregation of 2,881 US Counties on 3 primary X-confounders Age Over 65 % Obesity % Currently Smoke % Y-outcome = Lung Cancer Mortality. Binary Treatment Indicator: Radon High ( > 2.1 pCi/L ) vs. Low

10 Variables selected for clustering.
NB: Regression coefficient for radon is NEGATIVE.

11 Local Control Add-In Russ Wolfinger/Bob Obenchain

12 Step 1: Clustering

13 Within cluster statistics
Local Treatment Difference, LTD Local Linear Regression (slope and intercept) Local Survival Analysis (Failure times) Etc.

14 Local Treatment Difference at the centroid of an informative cluster
E[ (Y|t=1) - (Y|t=0)|X ] Single df comparison Given X, local effect. “Fair Treatment Comparison”

15 Aggregate Cycle Observed LTD Distribution (49 Informative Clusters)

16 Step 2: Confirm clustering matters
Random Distribution Observed Distribution These two distributions are rather clearly different; they differ in that the random distribution (upper) is smoother with longer tails and is shifted more to the left. The observed distribution (lower) is much more compact. Observed LTD Distribution

17 Step 2: Confirm Cycle These two distributions are rather clearly different; they differ most in location and kurtosis – see plots and statistics listed on the next 2 pages. This means that clustering (local conditioning, matching) on 3 primary X-confounders [Age Over 65 %, Obesity %, and Currently Smoke %] has indeed yielded “adjusted” treatment effect-size estimates. Observed LTD empirical Cumulative Distribution Function (CDF) “LTD-like” Random Permutation CDF

18 Step 3: Explore Cycles Tried using “Complete Linkage” as well as “Fast Ward” Tried using of 3 out of 5 potential X-confounders for clustering: Age Over 65 % Obesity % Currently Smoke % Ever Smoke % Median Household Income ($1,000s) Tried using between 50 and 100 clusters.

19 Reveal Cycle Fitted “Supervised Learning” Models for predicting observed LTDs: JMP 11 “Modeling” Platform -> “Partition” option single Tree (7 terminal nodes) Bootstrap Forest – Model Average of 100 Trees JMP “Fit Model” Platform – Multi-Variable Regression (Degree at most 2) Tried using 6 potential X-confounders for predicting observed LTDs: Age Over 65 % Obesity % Currently Smoke % Ever Smoke % Median Household Income ($1,000s) Radon ( or Ln[Rn] ) Level (as either ordinal or continuous measures) NOTE: Cluster #10 is uninformative about LTDs and contains 11 counties. Thus the following predictions use the data from only 2,870 US counties …I.E. LTD “missingness” is not considered informative of potential treatment effect-sizes.

20 Tree Three Left-Most Terminal Nodes:
HIGH "Age Over 65 %" in a county (= or more than 17.5%) MAGNIFIES the advantage of High Radon in keeping Lung Cancer Mortality low (negative LTDs.) But LTDs increase as "Currently Smoke %" increases. (3-way split.) Four Right-Most Terminal Nodes: When "Age Over 65 %" in a county is LOW (less than 17.5%)... Either High "Currently Smoke %" (= or > 27.7%) or High "Obesity %" (= or > 33%) REDUCE the advantage of High Radon in keeping Lung Cancer Mortality low (negative LTDs.) Finally, Higher "Age Over 65 %" in a county (= or more than 12.4%) is better than Lower ( < than 12.4%) in PRESERVING the advantage of High Radon in keeping Lung Cancer Mortality low (negative LTDs.)

21 Method Two (Bootstrap Forest), R^2 =0.80
These variables capture about 80% of the signal variation. Age and current smoking are the most important predictors of the Local Treatment Differences.

22 Linear Regression

23 Regression Results

24 Partial Correlations

25 London Smog, 1952

26 EPA and ozone Bad Good ???? 0.075 ppm 0.20 ppm
Ozone Generators that are Sold as Air Cleaners Reviewed by EPA The results showed that some ozone generators, when run at a high setting with interior doors closed, would frequently produce concentrations of ppm. A powerful unit set on high with the interior doors opened achieved values of 0.12 to 0.20 ppm in adjacent rooms. When units were not run on high, and interior doors were open, concentrations generally did not exceed public health standards (US EPA, 1995). It may be useful to include the current USEPA ozone standard in your slide for comparison (i.e., annual fourth-highest daily maximum 8-hr concentration, averaged over 3 years not to exceed ppm ; ).

27 LC London Ozone 0. Variable selection 1. Cluster to 34/35
2. LR within cluster 3. Append to intercept and slope to data set 4. P-value plot 5. Histograms 6. RP on intercepts and slopes

28 London time Series (outliers removed)
Daily Deaths Smoothed Subtract Moving Median 21-5

29 Step 0: Variable selection

30 Time Series Smoother Add-In
Paul Fogel Paris

31 Possible Ozone Effect

32 Step 1: Cluster

33 LR within clusters, Intercept

34 LR Within clusters Slope
No evidence that daily deaths are affected by current ozone.

35 Predict Intercept

36 Predict Slope Need to add words of explanation for this slide.

37 P-value plot for intercepts and slopes ozone - 1

38 Summary of ozone Local Control Analysis
NB: Outlier time period was removed. Ozone on day of death had no effect, LR. Ozone on day -1, Intercepts and slopes within clusters were consistent with random, with one exception of a strong negative effect, p-value plots. Deaths on day -1, and temperature were the most important predictors for death deviation from time series.

39 Contact Information S. Stanley Young Robert Obenchain Goran Krstic

40 Outlier

41 What is going on?

42 Plan for Radon Data Set Radon background
Local Control analysis strategy Analysis of ~3000 US counties Review of Local Control Results Summary

43 Step Zero: Selecting Variables
Note that radon is not a candidate for the clustering variables.

44 Puzzle, parts and fit Raw Data Politics Cleaned Data Methods


Download ppt "S. Stanley Young Robert Obenchain Goran Krstic"

Similar presentations


Ads by Google