Download presentation

Presentation is loading. Please wait.

Published byMontana Oak Modified about 1 year ago

1
Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

2
What is a model ? Y = f(x1, x2, x3…xn) Y = a + b1x1 + b2x2…bnxn Y = e a + b1x1 + b2x2…bnxn

3
“All models are wrong, some are useful” -- George Box A useful model is –Not very biased –Interpretable –Replicable (predicts in a new sample)

4

5
Some Premises “Statistics” is a cumulative, evolving field Newer is not necessarily better, but should be entertained in the context of the scientific question at hand Data analytic practice resides along a continuum, from exploratory to confirmatory. Both are important, but the difference has to be recognized. There’s no substitute for thinking about the problem

6
Statistics is a cumulative, evolving field: How do we know this stuff? Theory Simulation

7
Y = b X + error b s1 b s2 b s3 b s4 b sk-1 b sk …………………. Concept of Simulation

8
b s1 b s2 b s3 b s4 b sk-1 b sk …………………. Y = b X + error Evaluate Concept of Simulation

9
Y =.4 X + error b s1 b s2 b s3 b s4 b sk-1 b sk …………………. Simulation Example

10
b s1 b s2 b s3 b s4 b sk-1 b sk …………………. Evaluate Y =.4 X + error Simulation Example

11
True Model: Y =.4*x1 + e

12
Ingredients of a Useful Model Correct probability model Good measures/no loss of information Based on theory Comprehensive Parsimonious Flexible Tested fairly Useful Model

13
Correct Model Gaussian: General Linear Model Multiple linear regression Binary (or ordinal): Generalized Linear Model Logistic Regression Proportional Odds/Ordinal Logistic Time to event: Cox Regression or parametric survival models

14
Generalized Linear Model General Linear Model/ Linear Regression ANOVA/t-test ANCOVA Logistic Regression Chi-square Poisson, ZIP, negbin, gamma Normal Binary/Binomial Count, heavy skew, Lots of zeros Regression w/ Transformed DV Can be applied to clustered (e.g, repeated measures data)

15
Factor Analytic Family Structural Equation Models Partial Least Squares Latent Variable Models (Confirmatory Factor Analysis) Multiple regression Principal Components Common Factor Analysis

16
Use Theory Theory and expert information are critical in helping sift out artifact Numbers can look very systematic when the are in fact random –http://www.tufts.edu/~gdallal/multtest.htm

17
Measure well Adequate range Representative values Watch for ceiling/floor effects

18
Using all the information Preserving cases in data sets with missing data Conventional approaches: Use only complete case Fill in with mean or median Use a missing data indicator in the model

19
Missing Data Imputation or related approaches are almost ALWAYS better than deleting incomplete cases Multiple Imputation Full Information Maximum Likelihood

20
Multiple Imputation

21
Modern Missing Data Techniques Preserve more information from original sample Incorporate uncertainty about missingness into final estimates Produce better estimates of population (true) values

22
Don’t throw waste information from variables Use all the information about the variables of interest Don’t create “clinical cutpoints” before modeling Model with ALL the data first, then use prediction to make decisions about cutpoints

23
Dichotomizing for Convenience = Dubious Practice (C.R.A.P.*) Convoluted Reasoning and Anti-intellectual Pomposity Streiner & Norman: Biostatistics: The Bare Essentials

24
Depression score ABC Implausible measurement assumption “not depressed”“depressed”

25
Loss of power Sometimes through sampling error You can get a ‘lucky cut.’

26
Dichotomization, by definition, reduces the magnitude of the estimate by a minimum of about 30% Dear Project Officer, In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand. Sincerely, Dick O. Tomi, PhD Prof. Richard Obediah Tomi, PhD

27
Power to detect non-zero b-weight when x is continuous versus dichotomized True model: y =.4x + e

28
Dichotomizing will obscure non-linearity LowHigh CESD Score

29
Dichotomizing will obscure non-linearity: Same data as previous slide modeled continuously

30
Type I error rates for the relation between x2 and y after dichotomizing two continuous predictors. Maxwell and Delaney calculated the effect of dichotomizing two continuous predictors as a function of the correlation between them. The true model is y =.5x1 + 0x2, where all variables are continuous. If x1 and x2 are dichotomized, the error rate for the relation between x2 and y increases as the correlation between x1 and x2 increases. Correlation between x 1 and x 2 N

31
Is it ever a good idea to categorize quantitatively measured variables? Yes: –when the variable is truly categorical –for descriptive/presentational purposes –for hypothesis testing, if enough categories are made. However, using many categories can lead to problems of multiple significance tests and still run the risk of misclassification

32
CONCLUSIONS Cutting: –Doesn’t always make measurement sense –Almost always reduces power –Can fool you with too much power in some instances –Can completely miss important features of the underlying function Modern computing/statistical packages can “handle” continuous variables Want to make good clinical cutpoints? Model first, decide on cuts afterward.

33
Sample size and the problem of underfitting vs overfitting Model assumption is that “ALL” relevant variables be included—the “antiparsimony principle” Tempered by fact that estimating too many unknowns with too little data will yield junk

34
Sample Size Requirements Linear regression –minimum of N = :predictor (Green, 1990) Logistic Regression –Minimum of N = 10-15/predictor among smallest group (Peduzzi et al., 1990a) Survival Analysis –Minimum of N = 10-15/predictor (Peduzzi et al., 1990b)

35
Consequences of inadequate sample size Lack of power for individual tests Unstable estimates Spurious good fit—lots of unstable estimates will produce spurious ‘good- looking’ (big) regression coefficients

36
All-noise, but good fit Events per predictor ratio R-squares from a population model of completelyrandom variables

37
Simulation: number of events/predictor ratio Y =.5* x1 + 0* x2 +.2* x3 + 0* x4 -- Where x1 x4 =.4 -- N/p = 3, 5, 10, 20, 50

38
Parameter stability and n/p ratio

39
Peduzzi’s Simulation: number of events/predictor ratio P(survival) =a + b1 * NYHA + b2 * CHF + b3 * VES +b4 * DM + b5 * STD + b6 * HTN + b7 * LVC --Events/p = 2, 5, 10, 15, 20, 25 --% relative bias = (estimated b – true b/true b)*100

40
Simulation results: number of events/predictor ratio

41

42
Approaches to variable selection “Stepwise” automated selection Pre-screening using univariate tests Combining or eliminating redundant predictors Fixing some coefficients Theory, expert opinion and experience Penalization/Random effects Propensity Scoring –“Matches” individuals on multiple dimensions to improve “baseline balance” Tibshirani’s “Lasso”

43
Any variable selection technique based on looking at the data first will likely be biased

44
“I now wish I had never written the stepwise selection code for SAS.” --Frank Harrell, author of forward and backwards selection algorithm for SAS PROC REG

45
Automated Selection: Derksen and Keselman (1992) Simulation Study Studied backward and forward selection Some authentic variables and some noise variables among candidate variables Manipulated correlation among candidate predictors Manipulated sample size

46
Automated Selection: Derksen and Keselman (1992) Simulation Study “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.” “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.” “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”

47
Simulation results: Number of noise variables included 20 candidate predictors; 100 samples Sample Size

48
Simulation results: R-square from noise variables 20 candidate predictors; 100 samples Sample Size

49
Simulation results: R-square from noise variables 20 candidate predictors; 100 samples Sample Size

50
1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med) 4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem 5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996). 6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen and Keselman) 9. It allows us to not think about the problem 10. It uses a lot of paper SOME of the problems with stepwise variable selection.

51
author ={Chatfield, C.}, title = {Model uncertainty, data mining and statistical inference (with discussion)}, journal = JRSSA, year = 1995, volume = 158, pages = { }, annote = --bias by selecting model because it fits the data well; bias in standard errors; P. 420:... need for a better balance in the literature and in statistical teaching between techniques and problem solving strategies}. P. 421: It is `well known' to be `logically unsound and practically misleading' (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a `quiet scandal'), they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping with it. P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually true.

52
Phantom Degrees of Freedom Faraway (1992)—showed that any pre- modeling strategy cost a df over and above df used later in modeling. Premodeling strategies included: variable selection, outlier detection, linearity tests, residual analysis. Thus, although not accounted for in final model, these phantom df will render the model too optimistic

53
Phantom Degrees of Freedom Therefore, if you transform, select, etc., you must include the DF in (i.e., penalize for) the “Final Model”

54
Conventional Univariate Pre- selection Non-significant tests also cost a DF Non-significance is NOT necessarily related to importance Variables may not behave the same way in a multivariable model—variable “not significant” at univariate test may be very important in the presence of other variables

55
Despite the convention, testing for confounding has not been systematically studied—in many cases leads to overadjustment and underestimate of true effect of variable of interest. At the very least, pulling variables in and out of models inflates the model fit, often dramatically Conventional Univariate Pre- selection

56
Better approach Pick variables a priori Stick with them Penalize appropriately for any data- driven decision about how to model a variable

57
Spending DF wisely If not enough N/predictor, combine covariates using techniques that do not look at Y in the sample, PCA, FA, conceptual clustering, collapsing, scoring, established indexes. Save DF for finer-grained look at variables of most interest, e.g, non-linear functions

58
Help is on the way? Penalization/Random effects Propensity Scoring –“Matches” individuals on multiple dimensions to improve “baseline balance” Tibshirani’s Lasso

59

60
Validation Apparent fit Usually too optimistic Internal cross-validation, bootstrap honest estimate for model performance provides an upper limit to what would be found on external validation External validation replication with new sample, different circumstances

61
Validation Steyerburg, et al. (1999) compared validation methods Found that split-half was far too conservative Bootstrap was equal or superior to all other techniques

62
Conclusions Measure well Use all the information Recognize the limitations based on how much data you actually have In the confirmatory mode, be as explicit as possible about the model a priori, test it, and live with it By all means, explore data, but recognize— and state frankly --the limits post hoc analysis places on inference

63
Advanced topics and examples

64
?1 …………………. My Sample Evaluate Bootstrap ?2 ?3 ?4 ? k-1 ?k?k WITH REPLACEMENT

65
1, 3, 4, 5, 7,

66
Can use data to determine where to spend DF Use Spearman’s Rho to test “importance” Not peeking because we have chosen to include the term in the model regardless of relation to Y Use more DF for non-linearity

67
Example-Predict Survival from age, gender, and fare on Titanic: example using S-Plus (or R) software

68
If you have already decided to include them (and promise to keep them in the model) you can peek at predictors in order to see where to add complexity

69

70
Non-linearity using splines

71
Linear Spline (piecewise regression) Y = a + b1(x 20)

72
Cubic Spline (non-linear piecewise regression) knots

73
fitfare<-lrm(survived~(rcs(fare,3)+age+sex)^2,x=T,y=T) anova(fitfare) Logistic regression model Spline with 3 knots

74
Wald Statistics Response: survived Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) <.0001 All Interactions Nonlinear (Factor+Higher Order Factors) age (Factor+Higher Order Factors) All Interactions sex (Factor+Higher Order Factors) <.0001 All Interactions fare * age (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB fare * sex (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB age * sex (Factor+Higher Order Factors) TOTAL NONLINEAR TOTAL INTERACTION TOTAL NONLINEAR + INTERACTION <.0001 TOTAL <.0001

75
Wald Statistics Response: survived Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) <.0001 All Interactions Nonlinear (Factor+Higher Order Factors) age (Factor+Higher Order Factors) All Interactions sex (Factor+Higher Order Factors) <.0001 All Interactions fare * age (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB fare * sex (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB age * sex (Factor+Higher Order Factors) TOTAL NONLINEAR TOTAL INTERACTION TOTAL NONLINEAR + INTERACTION <.0001 TOTAL <.0001

76
Wald Statistics Response: survived Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) <.0001 All Interactions Nonlinear (Factor+Higher Order Factors) age (Factor+Higher Order Factors) All Interactions sex (Factor+Higher Order Factors) <.0001 All Interactions fare * age (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB fare * sex (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB age * sex (Factor+Higher Order Factors) TOTAL NONLINEAR TOTAL INTERACTION TOTAL NONLINEAR + INTERACTION <.0001 TOTAL <.0001

77
Wald Statistics Response: survived Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) <.0001 All Interactions Nonlinear (Factor+Higher Order Factors) age (Factor+Higher Order Factors) All Interactions sex (Factor+Higher Order Factors) <.0001 All Interactions fare * age (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB fare * sex (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB age * sex (Factor+Higher Order Factors) TOTAL NONLINEAR TOTAL INTERACTION TOTAL NONLINEAR + INTERACTION <.0001 TOTAL <.0001

78
Wald Statistics Response: survived Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) <.0001 All Interactions Nonlinear (Factor+Higher Order Factors) age (Factor+Higher Order Factors) All Interactions sex (Factor+Higher Order Factors) <.0001 All Interactions fare * age (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB fare * sex (Factor+Higher Order Factors) Nonlinear Nonlinear Interaction : f(A,B) vs. AB age * sex (Factor+Higher Order Factors) TOTAL NONLINEAR TOTAL INTERACTION TOTAL NONLINEAR + INTERACTION <.0001 TOTAL <.0001

79

80

81

82
IndexTrainingCorrected Dxy R Intercept Slope Bootstrap Validation

83
Summary Think about your model Collect enough data

84
Summary Measure well Don’t destroy what you’ve measured

85
Pick your variables ahead of time and collect enough data to test the model you want Keep all your variables in the model unless extremely unimportant Summary

86
Use more df on important variables, fewer df on “nuisance” variables Don’t peek at Y to combine, discard, or transform variables Summary

87
Estimate validity and shrinkage with bootstrap Summary

88
By all means, tinker with the model later, but be aware of the costs of tinkering Don’t forget to say you tinkered Go collect more data Summary

89
Web links for references, software, and more Harrell’s regression modeling text –http://hesweb1.med.virginia.edu/biostat/rms/ SAS Macros for spline estimation –http://hesweb1.med.virginia.edu/biostat/SAS/survrisk.txt Some results comparing validation methods –http://hesweb1.med.virginia.edu/biostat/reports/logistic.val.pdf SAS code for bootstrap –ftp://ftp.sas.com/pub/neural/jackboot.sas S-Plus home page –insightful.com Mike Babyak’s This presentation –http://www.duke.edu/~mbabyak

90
duke.edu symptomresearch.nih.gov/chapter_8/

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google