1 Stat 6601 Presentation Presented by: Xiao Li (Winnie) Wenlai Wang Ke Xu Nov. 17, 2004 V & R 6.6
2 Preview of the Presentation 11/17/2004 Bootstrapping Linear Models Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results Issues and Discussion Summary
3 What is Bootstrapping ? 11/17/2004 Bootstrapping Linear Models Invented by Bradley Efron, and further developed by Efron and Tibshirani A method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample A method to determine the trustworthiness of a statistic (generalization of the standard deviation)
4 Why uses Bootstrapping ? 11/17/2004 Bootstrapping Linear Models Start with 2 questions: What estimator should be used? Having chosen an estimator, how accurate is it? Linear Model with normal random errors having constant variance Least Square Generalized non-normal errors and non-constant variance ???
5 The Mammals Data 11/17/2004 Bootstrapping Linear Models A data frame with average brain and body weights for 62 species of land mammals. “body” :Body weight in Kg “brain” :Brain weight in g “name”:Common name of species
6 Data and Model 11/17/2004 Bootstrapping Linear Models Linear Regression Model: where j = 1, …, n, and is considered random y = log(brain weight) x = log(body weight)
7 Summary of Original Fit 11/17/2004 Bootstrapping Linear Models Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** log(body) <2e-16 *** Residual standard error: on 60 DF Multiple R-Squared: Adjusted R-squared: F-statistic: on 1 and 60 DF p-value: < 2.2e-16
8 for Original Modeling 11/17/2004 Bootstrapping Linear Models library(MASS) library(boot) c <- par(mfrow=c(1,2)) data <- data(mammals) plot(mammals$body, mammals$brain, main='Original Data', xlab='body weight', ylab='brain weight', col=’brown’) # plot of data plot(log(mammals$body), log(mammals$brain), main='Log-Transformed Data', xlab='log body weight', ylab='log brain weight', col=’brown’) # plot of log-transformed data mammal <- data.frame(log(mammals$body), log(mammals$brain)) dimnames(mammal) <- list((1:62), c("body", "brain")) attach(mammal) log.fit <- lm(brain~body, data=mammal) summary(log.fit)
9 Two Methods 11/17/2004 Bootstrapping Linear Models Case-based Resampling: randomly sample pairs (Xi, Yi) with replacement No assumption about variance homogeneity Design fixes the information content of a sample Model-based Resampling: resample the residuals Assume model is correct with homoscedastic errors Resampling model has the same “design” as the data
10 Case-Based Resample Algorithm 11/17/2004 Bootstrapping Linear Models For r = 1, …, R, 1.sample randomly with replacement from {1, 2, …,n} 2.for j = 1, …, n, set, then 3.fit least squares regression to, …, giving estimates,,.
11 Model-Based Resample Algorithm 11/17/2004 Bootstrapping Linear Models For r = 1, …, n, 1.For j = 1, …, n, a)Set b)Randomly sample from, …, ; then c)Set 1.Fit least squares regression to,…, giving estimates,,.
12 Case-Based Bootstrap 11/17/2004 Bootstrapping Linear Models ORDINARY NONPARAMETRIC BOOTSTRAP Bootstrap Statistics : original bias std. error t1* t2* BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Intervals : Level Normal Percentile BCa 95% ( 1.966, ) ( 1.963, ) ( 1.974, ) 95% ( , ) ( , ) ( , ) Calculations and Intervals on Original Scale
13 Case-Based Bootstrap 11/17/2004 Bootstrapping Linear Models Bootstrap Distribution Plots for intercept and Slope
14 Case-Based Bootstrap 11/17/2004 Bootstrapping Linear Models Standardized Jackknife-after-Bootstrap Plots for intercept and Slope
15 for Case-Based 11/17/2004 Bootstrapping Linear Models # Case-Based Resampling fit.case <- function(data) coef(lm(log(data$brain)~log(data$body))) mam.case <- function(data, i) fit.case(data[i, ]) mam.case.boot <- boot(mammals, mam.case, R = 999) mam.case.boot boot.ci(mam.case.boot, type=c("norm", "perc", "bca")) boot.ci(mam.case.boot, index=2, type=c("norm", "perc", "bca")) plot(mam.case.boot) plot(mam.case.boot, index=2) jack.after.boot(mam.case.boot) jack.after.boot(mam.case.boot, index=2)
16 Model-Based Bootstrap 11/17/2004 Bootstrapping Linear Models ORDINARY NONPARAMETRIC BOOTSTRAP Bootstrap Statistics : original bias std. error t1* t2* BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Intervals : Level Normal Percentile Bca 95% ( 1.945, ) ( 1.948, ) ( 1.941, ) 95% ( , ) ( , ) ( , ) Calculations and Intervals on Original Scale
17 Model-Based Bootstrap 11/17/2004 Bootstrapping Linear Models Bootstrap Distribution Plots for intercept and Slope
18 Model-Based Bootstrap 11/17/2004 Bootstrapping Linear Models Standardized Jackknife-after-Bootstrap Plots for intercept and Slope
19 for Model-Based 11/17/2004 Bootstrapping Linear Models # Model-Based Resampling (Resample Residuals) fit.res <- lm(brain ~ body, data=mammal) mam.res.data <- data.frame(mammal, res=resid(fit.res), fitted=fitted(fit.res)) mam.res <- function(data, i){ d <- data d$brain <- d$fitted + d$res[i] coef(update(fit.res, data=d)) } fit.res.boot <- boot(mam.res.data, mam.res, R = 999) fit.res.boot boot.ci(fit.res.boot, type=c("norm", "perc", "bca")) boot.ci(fit.res.boot, index=2, type=c("norm", "perc", "bca")) plot(fit.res.boot) plot(fit.res.boot, index=2) boot.ci(fit.res.boot, type=c("norm", "perc", "bca")) jack.after.boot(fit.res.boot) jack.after.boot(fit.res.boot, index=2)
20 Comparisons and Discussion 11/17/2004 Bootstrapping Linear Models Comparing Fields Original Model Case-Based (Fixed) Model-Bsed (Random) Intercept (t 1 *) Stand Error Slope (t 2 *) Stand Error
21 Case-Based Vs. Model-Based 11/17/2004 Bootstrapping Linear Models Model-based resampling enforces the assumption that errors are randomly distributed by resampling the residuals from a common distribution If the model is not specified correctly – i.e., unmodeled nonlinearity, non-constant error variance, or outliers – these attributes do not carry over to the bootstrap samples The effects of outliers is clear in the case-based, but not with the model-based.
22 When Might Bootstrapping Fail? 11/17/2004 Bootstrapping Linear Models Incomplete Data Assume that missing data are not problematic If multiple imputation is used beforehand Dependent Data Bootstrap imposes mutual dependence on the Y j, and thus their joint distribution is Outliers and Influential Cases Remove/Correct obvious outliers Avoid the simulations to depend on particular observations
23 Review & More Resampling 11/17/2004 Bootstrapping Linear Models Resampling techniques are powerful tools for: -- estimating SD from small samples -- when the statistics do not have easily determined SD Bootstrapping involves: -- taking ‘new’ random samples with replacement from the original data -- calculate boostrap SD and statistical test from the average of the statistic from the bootstrap samples More resampling techniques: -- Jackknife resampling -- Cross-validation
24 SUMMARY 11/17/2004 Bootstrapping Linear Models Introduction to Bootstrap Data and Modeling Methods on Bootstrapping LM Results and Comparisons Issues and Discussion
25 Reference 11/17/2004 Bootstrapping Linear Models Anderson, B. “Resampling and Regression” McMaster University. Davision, A.C. and Hinkley D.V. (1997) Bootstrap methods and their application. pp Cambridge University Press Efron and Gong (February 1983), A Leisurely Look at the Bootstrap, the Jackknife, and Cross Validation, The American Statistician. Holmes, S. “Introduction to the Bootstrap” Stanford University. Venables and Ripley (2002), Modern Applied Statistics with S, 4 th ed. pp Springer
26 11/17/2004 Bootstrapping Linear Models
27 Extra Stuff… 11/17/2004 Bootstrapping Linear Models Jackknife Resampling takes new samples of the data by omitting each case individually and recalculating the statistic each time Resampling data by randomly taking a single observation out # of jackknife samples used # of cases in the original sample Works well for robust estimators of location, but not for SD Cross-Validation randomly splits the sample into two groups comparing the model results from one sample to the results from the other. 1 st subset is used to estimate a statistical model (screening/training sample) Then test our findings on the second subset. (confirmatory/test sample)