Presentation is loading. Please wait.

Presentation is loading. Please wait.

PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH.

Similar presentations


Presentation on theme: "PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH."— Presentation transcript:

1 PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH

2 Clumping at 0 Some subjects show no response, others have a continuous, or at least ordered response Examples: Hospitalization expense in an HMO Cell growth on plates Urinary output in shock patients Usual normal theory doesn’t apply 2

3 Urinary Output (Afifi & Azen) 3

4 UO Analysis Survival: 27/70 had UO=0; mean=127.9, s=148.13, skewness=1.13 Deaths: 22/43 had UO=0; mean=31.0, s=71.76, skewness=3.37 For these data: – t=3.01 (p=0.0032) – Wilcoxon z=2.794 (p=0.0052) – Kolmogorov-Smirnov p=0.001 – 2 part X 2 =15.86 (p  ) 4

5 Statistical Model f i (x,d)=p i 1-d {(1-p i )h i (x)} d H 0 : p 1 =p 2  h 1 =h 2 Tests: – t-test on full data set – Wilcoxon rank sum test – Kolmogorov-Smirnov – Two part Models: Bin+Z; Bin+W; Bin+KS 5

6 What are the relative properties? Right size? Is  =0.05 when it’s supposed to be? Are the null distributions correct? What is the power of these procedures under various alternatives? (Use log-normal model) – Difference only in proportions – Difference only in means – Difference in both 6

7 Tests 7

8 Two-part Tests Define Then the two-part tests are: B 2 +Z 2 (denoted as BZ), B 2 +W 2 (denoted as BW) and B 2 +K 2 (denoted as BK), where K 2 is the chi-squared value corresponding to the p-value of the KS statistic. Since these are independent, we have the sum of two 1 d.f. (central) chi-squared statistics (under the null) 8

9 Size of Tests n1=n2=50, Equal means 9

10 10

11 Power: n= 50,100 P1=0.1, P2=0.2; MEAN DIFFERENCE=0 11

12 Power: n=50, 100 Differ only in means P=0.1,0.2, mean=0.5 12

13 Power:n=100,p1=0.1,p2=0.2 mean=0.3, 0.5 Proportion and mean are consonant 13

14 Power:n=100,p1=0.2,p2=0.1 mean=0.3, 0.5 Proportion and mean are dissonant 14

15 Conclusions These results are similar to those for other sample sizes and parameter combinations Size is appropriate Distributions match expectations, except for largest values For differences only in proportions (low proportions), the BZ, BW and BK methods did well, Z did poorly 15

16 Conclusions (2) For differences only in means, the W, K, Z, BW and BK did well For consonant differences (mean and proportion in same direction), W, K, BW and BK did well, Z and BZ poorly For dissonant differences, BW, BK and BZ were far superior to the others 16

17 Conclusions (3) Theoretical results indicate that computing sample size or power with the non-central  2 distribution gives an excellent agreement with the simulated powers Papers: Comparisons - Statistics in Medicine 2001, p Non-central - Statistics in Medicine 2001, p

18 Peter A. Lachenbruch and John Molitor Oregon State University

19 The Two-part Model Some data have an excess of zero values. These aren’t be easily modeled because of the spike at 0. Can use a mixture model if one cannot distinguish a sampling zero from a structural zero. Example: telephone calls in a short period of time. If phone is turned on, some time periods may have no calls. If phone is turned off, there are no calls registered. Can use two-part model if all zeros are structural. Example: hospitalization cost when an insured was not hospitalized. Size of growth on an agar plate if all activity is inhibited. 19

20 An equation or two Let y be the response. It is zero if no response, and non-zero otherwise. Let h(y) be the conditional distribution of y given y>0 Let d be an indicator of non-zero response and p=probability that z=1 For a two part model, we have The log-likelihood is easy to compute and the solution is simply the likelihood estimate for p and for the mean (regression) of y. 20

21 Inference One estimates parameters using the individual components of the likelihood. These are standard estimates. For the zero-nonzero part we use a logistic regression, and for the nonzero values we use a multiple regression. An issue is how to select variables for inclusion in a model. Select variables separately for each part of the model? Select variables for the model as a whole using the 0 as if it were a regular observation. 21

22 Variable selection criteria What criterion: R 2 =1-RSS/SST R 2 adj =1-(n-1)/(n-k-1)*RSS/SST AIC=n*ln(RSS/n)+2k+n+n*ln(2  ) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2  ) (these are for normal distribution models) Use forward or backward stepping P to enter 0.15, 0.05 P to remove 0.15, 0.05 Best subsets models? For generalized linear models, the deviance is proposed. 22

23 Variable Selection For the multivariate regression, we can use stepwise regression. There are the usual concerns about stepwise. We can use AIC, BIC, R 2 to select the best model. AIC and BIC penalize the selection based on the number of variables in the model. For normal distributions we have AIC=n*ln(RSS/n)+2k+n+n*ln(2  ) BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2  ) Bias adjusted versions of R 2 and AIC are also available 23

24 More on selection For the logistic part of the model, we use stepwise logistic regression and specify a p(enter) or p(remove) – this is based on the test of the odds ratio for each candidate variable. For variable selection, most programs use a stepwise routine that selects on the basis of the test on the odds ratio (basically a normal theory test). 24

25 Single model methods There are two single model methods we consider: Include the 0 values in a multiple regression This is obviously inappropriate, but users often have done this In practice, it selects more variables and includes the ones that have been selected by the logistic and multiple regression models. Conduct a Bayesian analysis of the variable selection problem. This is work in progress. 25

26 Computing - Stata We use Stata for computing because it has some convenient selection commands. The recently developed command, vselect, due to Lindsay and Sheather, allows one to do variable selection using AIC, BIC, R 2 and forward or backward stepping, as well as finding the best set of variables for each number of variables. The Best subsets option uses the “leaps and bounds” algorithm that vastly reduces the amount of computations. This was due to Furnival and Wilson. 26

27 More on selection Unfortunately, at present, vselect works only for multiple regression and not for logistic regression. Thus, we considered two strategies: Use stepwise logistic regression directly Regress the 0-1 variable using regression and perform the variable selection operation on the results. The vselect command first computes a multiple regression on all variables, then it computes the stepwise variable selection from the X’X matrix It allows the use of R 2, AIC, BIC, Mallows’ C, and Best subsets regression. In the example, we use the Best option that gives all of the above The Bayesian methods will be presented separately. 27

28 Example data We use a data set courtesy of Lisa Rider. lald=ln(aldosterone) (response)aldind – indicator for 0 -1 Dx2 – Polymyositis (1) or Dermatomyositis (2) Agedx – age at diagnosis Yeardx – year of diagnosisgender – male (0) female (1) Ild – interstitial lung disease Y/NArthritis – Y/N Fever >100 – Y/NRaynaud’s sign Y/N Mechhand – mechanics hands Y/Npalpitations Y/N Dysphagia Y/NProximal weakness Y/N Race – W/NWRealonspeed – onset speed 1 28

29 The prediction problem We wish to predict laldo. However, 72 out of 420 are 0. This leads to a clump of zero values. We may wish to have a single set of predictors for lald, or we may wish to have a set of predictors for the non-zero values and a (possibly distinct) set of predictors for the 0 values. A related question is how can we evaluate the prediction ability of the resulting equations? 29

30 Example of vselect. regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed Source | SS df MS Number of obs = F( 14, 332) = 4.45 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] agedx | e yeardx | e dx2 | e gender | e ild | e arthritis | e fever | e raynaud | e mechhand | e palpita | e dysphag | e proxweak | e racewnw | e realonspeed | e _cons | e The next slide gives the vselect command and output. Note the restriction that lald>0 and u80 (an indicator variable that the patient was first diagnosted after

31 Vselect output This is the vselect output on the non-zero values. We truncated at 5 variables selected – the actual output includes all 14 variables. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed,best 1 Observations Containing Missing Predictor Values Response : laldo Fixed Predictors : Selected Predictors: dx2 realonspeed dysphag raynaud palpita gender racewnw fever a > rthritis proxweak agedx yeardx mechhand ild Actual Regressions 37 Possible Regressions Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC Selected Predictors 1 : dx2 2 : dx2 realonspeed 3 : dx2 realonspeed raynaud 4 : dx2 realonspeed dysphag raynaud 5 : dx2 realonspeed dysphag raynaud racewnw 6 : dx2 realonspeed dysphag raynaud palpita racewnw In this case, the program computed 27 regressions out of (=2 14 possible regressions) 31

32 Selecting predictors for 0 indicator For the logistic regressions we use stepwise logistic regression that selects variables based on odds ratios. We use forward stepping with a p-to-enter of 0.15 stepwise, pe(.15): logistic aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 note: proxweak dropped because of estimability note: 1 obs. dropped because of estimability begin with empty model p = < adding palpita p = < adding arthritis p = < adding gender Logistic regression Number of obs = 418 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = aldind | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] palpita | e arthritis | e gender | e estat ic Model | Obs ll(null) ll(model) df AIC BIC | Note: N=Obs used in calculating BIC; see [R] BIC note We see that the dx2 and onset speed variables did not enter, so somewhat different variables predict 0-ness than the magnitude of response 32

33 Selecting predictors for 0 with regression, ignoring binomial form We display only results for first five selected variables. regress aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80 Source | SS df MS Number of obs = F( 14, 404) = 1.84 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = aldind | Coef. Std. Err. t P>|t| [95% Conf. Interval] agedx | e yeardx | e dx2 | e gender | e ild | e arthritis | e fever | e raynaud | e mechhand | e palpita | e dysphag | e proxweak | e racewnw | e realonspeed | e _cons | e

34 Selecting predictors for 0 with regression, ignoring binomial form, 2.. vselect aldind agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpi > ta dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : aldind Fixed Predictors : Selected Predictors: palpita arthritis gender fever agedx mechhand dysphag racewnw > ild proxweak yeardx dx2 realonspeed raynaud Actual Regressions 62 Possible Regressions Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC Selected Predictors 1 : palpita 2 : palpita arthritis 3 : palpita arthritis gender 4 : palpita arthritis gender fever 5 : palpita arthritis gender fever agedx Note that the selected variables are identical to the stepwise logistic regression. 34

35 Multiple regression with 0 in the data set We now consider the model including 0 as part of the data. This may be made a bit easier having taken logs of the non-zero values, so the 0s aren’t quite so obviously different.. regress laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80 Source | SS df MS Number of obs = F( 14, 404) = 2.84 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] agedx | e yeardx | e dx2 | e gender | e ild | e arthritis | e fever | e raynaud | e mechhand | e palpita | e dysphag | e proxweak | e racewnw | e realonspeed | e _cons | e

36 Using vselect on the full data set Displaying best five. vselect laldo agedx yeardx dx2 gender ild arthritis fever raynaud mechhand palpita dysphag proxweak racewnw realonspeed if u80,best 2 Observations Containing Missing Predictor Values Response : laldo Fixed Predictors : Selected Predictors: dx2 palpita realonspeed gender arthritis raynaud dysphag fever > mechhand ild agedx yeardx racewnw proxweak Actual Regressions 47 Possible Regressions Optimal Models Highlighted: # Preds R2ADJ C AIC AICC BIC Selected Predictors 1 : dx2 2 : dx2 palpita 3 : dx2 palpita realonspeed 4 : dx2 palpita realonspeed arthritis 5 : dx2 palpita realonspeed gender arthritis 6 : dx2 palpita realonspeed gender arthritis raynaud 7 : dx2 palpita realonspeed gender arthritis raynaud dysphag There are some differences in the variables selected by logistic regression and multiple regression. Raynaud’s and dysphagia were selected in the multiple regression 36

37 Future Steps Develop a full Bayesian analysis/model May include a model that involves selection of variables with 0 values in the variable selection set or may involve a Bayesian model on the non-zero values and a model for the variable of zero and non-zero values Develop a model using a bootstrap and select based on Wald statistics Stay tuned… 37


Download ppt "PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH."

Similar presentations


Ads by Google