Notes on Bootstrapping Jeff Witmer 10 February 2016.

Notes on Bootstrapping Jeff Witmer 10 February 2016

New sample  new statistic Many samples  many statistics  sampling dist n But we can only imagine taking repeated samples of size n and observing how the sample mean behaves – e.g., finding out how far it typically is from µ. We want to estimate a parameter (e.g., a population mean µ). We have a sample of data and a statistic (e.g., the sample mean).

Traditional method: Use theory to study how the statistic should behave. Bootstrap method: Use a computer to simulate how the statistic behaves. Brad Efron, 1980s – “infinitely intelligent, but he doesn’t know any distribution theory”

The Bootstrap We wonder what would happen (how would a statistic behave) in repeated samples from the population. Basic idea: Simulate the sampling distribution of any statistic (like a sample mean or proportion) by repeatedly sampling from the data. A “pseudo-sample” is to the sample as the sample is to the population. We hope.

The Bootstrap Basic idea: Simulate the sampling distribution of any statistic (like a sample mean) by repeatedly sampling from the data. Get a “pseudo-sample”/bootstrap sample and fit a model in order to calculate a statistic (i.e., an estimate of a parameter) from the sample. Do this MANY times, fit the model for each bootstrap sample and collect the estimates.

“Pull yourself up by your bootstraps” Why “bootstrap”? Lift yourself in the air simply by pulling up on the laces of your boots Metaphor for accomplishing an “impossible” task without any outside help Perhaps an unfortunate name…

Sampling Distribution Population µ BUT, in practice we don’t see the “tree” or all of the “seeds” – we only have ONE seed

Bootstrap Distribution Bootstrap “Population” What can we do with just one seed? Grow a new tree! µ Hat tip: Robin Lock -

The Bootstrap We hope that population::statistic is as bootstrap population::bootstrap statistic If so, then by studying the distribution of bootstrap statistics we can understand how our sample statistic arose. E.g., we can get a good idea of what the SE is, and thus create a good confidence interval.

The Bootstrap Process Original Sample Bootstrap Sample...... Bootstrap Statistic Sample Statistic Bootstrap Statistic...... Bootstrap Distribution

A small bootstrap example Consider these 7 data points on the variable Y = pulse: 48, 50, 60, 66, 68, 72, 90 The (sample) mean is 64.9 (and the SD is 14.3). Let’s take a bootstrap sample, by writing the 7 values on 7 cards, shuffling, and making 7 draws with replacement. Then calculate the mean of those 7 values. (We can call this a bootstrap mean.) Repeat this a few times. This is tedious. Technology helps…StatKey…

Bootstrapping with StatKey We can go to StatKey, choose Bootstrap Confidence Interval and choose CI for Single Mean, Median, Std. Dev., choose any of the built-in data sets and click the Edit Data button, and then type in our data (48, 50, 60, 66, 68, 72, 90). Then the Generate 1 Sample button will create a bootstrap sample. And we can get 1000 bootstrap samples easily.

Bootstrapping with StatKey From 5000 bootsamples we see that the SE (“st. dev.”) is almost exactly 5. We can make a 95% CI for µ via Estimate ± 2 * SE 64.9 ± 2*(5) 64.9 ± 10 (54.9, 74.9) Good news: We can bootstrap a difference in means or a regression coefficient or… Bad news: Bootstrapping when n is small does not overcome the problem that a small sample might not mimic the shape and spread of the population.

If the bootstrap distribution is not symmetric then “estimate ± 2*SE” won’t be right. Bootstrap Percentile CIs And what if we want something other than 95% confidence? We can take the middle 95% of the bootstrap distribution as our interval. Or the middle 90% for a 90% CI, etc.

Recall the pulse data n=7, data are (48, 50, 60, 66, 68, 72, 90). We can made a 95% CI for µ via Estimate ± 2 * SE 64.9 ± 2*(5) 64.9 ± 10(54.9, 74.9) Click the “Two-Tail” button. The middle 95% of the bootstrap dist n is (55.4, 75.1)

Bootstrapping other statistics: Correlation Click the “Two-Tail” button. The middle 95% of the bootstrap dist n of correlations is (0.72, 0.87) We can bootstrap a correlation. Consider the Commute Atlanta (Time as a function of Distance) data. The sample correlation is r=0.81. Generate 5000 bootstrap samples. Each time, get the correlation b/w Distance and Time. Note that this is not symmetric around 0.81!

Atlanta commute correlation bootstrapped

Bootstrapping other statistics We can use the bootstrap for almost anything – but not everything works... Consider the Ten Counties (Area) data, which are skewed and have a small n of 10:

If we try to bootstrap the median, we don’t get a reasonably smooth and symmetric graph. Instead, we get clumpiness:

There are lots of ways to make a bootstrap CI. These include: Other Bootstrap CIs (a) Percentile method (b) Normal method/ t with bootstrap SE (d) T method/bootstrap t interval (c) “basic” bootstrap (aka “reverse percentiles”, which is like inverting a randomization test) (e) BCa – bias corrected, accelerated

The center of the bootstrap dist n is not the center of the sampling dist n. Bootstrapping informs us about the standard error, it can alert us to bias, etc. N.B. We use the bootstrap to get CIs An unfortunate name? Taking thousands of bootstrap samples does not increase the original sample size, nor, e.g., push the sample mean closer to µ. We don’t really “get something for nothing.” But we do learn about uncertainty in the sample statistic.

Suppose we have a set of (x,y) pairs in a regression setting. If we don’t trust a t-test or t-based CI, how might we bootstrap? Bootstrap and Regression/Correlation (1) Randomly resample (x,y) pairs and fit a regression model to the new “boot-sample.” If we think of the x’s as fixed (e.g., a designed experiment), then (2) looks good. If not, then (1). (2) Randomly resample the residuals from the fitted model and attach them to the (fixed) (x,y) pairs. Then fit a regression model to this new “boot-sample.” ^

How does one actually do this? Implementing the bootstrap If it is 1991, then maybe write your own code. Today, use software such as StatKey for a small problem in a standard setting (1 mean, 2 means, simple regression, etc.) In general, use R and either (a) write a function for the statistic you want to bootstrap and use the boot() command or (b) use the do()* command in the mosaic package.

A consulting client recently needed a CI for a ratio estimator, so I wrote a short script in R. Here is part of it: Example ratioBoot <- function(mydata, indices) { d <- mydata[indices,] # allows boot to select sample ratio <- sum(d$Y)/sum(d$X) return(ratio) } mydata=ratio.data #this is the client’s data file library(boot) #make the boot package active bootratios5000 = boot(mydata, ratioBoot, R=5000) boot.ci(bootratios5000, type="norm") #gives 0.1236 to 0.2486 boot.ci(bootratios5000, type="perc") #gives 0.1221 to 0.2479 hist(bootratios5000$t) #a histogram of the 5000 values mean(bootratios5000$t) #gives 0.1847 sd(bootratios5000$t) #gives 0.0316

The mosaic package in R has a command that will do something (anything) many times. Using the do()* command library(mosaic) pulse=c(48, 50, 60, 66, 68, 72, 90) mean(pulse) mean(resample(pulse)) do(5)*resample(pulse) do(5)*mean(resample(pulse)) mymeans = do(5)*mean(resample(pulse)) mymeans mymeans = do(10000)*mean(resample(pulse)) hist(mymeans$mean, breaks=100) mean(mymeans$mean) sd(mymeans$mean) quantile(mymeans$mean, probs=c(0.025,0.975)) #gives 55.7 to 75.1

There are yet other variations on the theme. E.g., the "wild bootstrap" in which you keep each residual with its observation, but randomly multiply the residual by -1 with probability 1/2. There is more… [Bootstrap tilting: Reweight the data to satisfy a null hypothesis, sample from the weighted distribution, then invert a hyp test to get a CI.] Note: I have been talking about the nonparametric bootstrap. There is also the smoothed bootstrap (take samples from a smooth (kernel density estimated) population) and the parametric bootstrap (take samples from a parametric distribution after estimating the parameters) [the “grow a new tree analogy’].

References Tim Hesterberg’s 2015 paper “What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum” in The American Statistician. See http://amstat.tandfonline.com/doi/full/10.1080/00031305.2 015.1089789 Efron and Tibshirani (1993) An Introduction to the Bootstrap, Springer. This is a summary of a longer (2014) paper with the same title, available at http://arxiv.org/abs/1411.5279.http://arxiv.org/abs/1411.5279

Notes on Bootstrapping Jeff Witmer 10 February 2016.

Similar presentations

Presentation on theme: "Notes on Bootstrapping Jeff Witmer 10 February 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Notes on Bootstrapping Jeff Witmer 10 February 2016.

Similar presentations

Presentation on theme: "Notes on Bootstrapping Jeff Witmer 10 February 2016."— Presentation transcript:

Similar presentations

About project

Feedback