Bootstrap – The Statistician’s Magic Wand

Bootstrap – The Statistician’s Magic Wand
Saharon Rosset

An abstract view of statistics
There is a “world” (=unknown distribution) F We observe some data from the world, say 100 heights (z) and weights (y) of random people We want to learn about some property of the world F, e.g.: Mean of height Correlation between height and weight Variance of the empirical correlation between height and weight

Standard statistical methodology
Find a way to estimate the property of F of interest directly from the data Mean height estimated by average Correlation between height and weight estimated by empirical correlation How do we estimate the variance of the correlation? There are some formulae under some assumptions, but it gets complicated Instead, we want to invent a general approach that will allow estimating every property of F relatively easily (hopefully, also well)

The general Bootstrap recipe
We are interested in some property of the world θ=𝑡(𝐹) We create an “alternative world” 𝐹 (usually using our data), and in it we estimate θ =𝑡 𝐹 Usually simply done empirically by drawing data from this world and applying t The main wisdom lies in how to build the Bootstrap world 𝐹 so that it is “similar” to 𝐹 in the ways that matter to us Secondary problem: how to perform the estimation in the bootstrap world, usually straight forward

Graphical representation
Real world Bootstrap world θ =𝑡 𝐹 Dist. 𝐹 Data X Dist. 𝐹 Data X* Determine 𝐹 θ=𝑡 𝐹 Statistic s(X) θ =𝑡 𝐹 Statistic s(X*)

Is Bootstrap important in practice?

Example: variance of empirical correlation
F is the bivariate distribution of z=height and y=weight, we are given data X with 100 pairs of (z,y) The statistic of interest is 𝑠 𝑋 = 𝑐𝑜𝑟 𝑧,𝑦 The property of F we are interested in is 𝜃= 𝑣𝑎𝑟 𝐹 𝑠 Bootstrap approach: Build bootstrap world 𝐹 Repeatedly draw “bootstrap samples” X* from 𝐹 Repeatedly estimate s(X*) from each sample Use these estimates to empirically estimate 𝜃 =𝑣𝑎𝑟 𝐹 𝑠( 𝑋 ∗ ) in the boostrap world This is your estimate of 𝜃 in the real world

How to build 𝐹 ? The “double arrow” is the key to designing a bootstrap algorithm The most standard approach: use the empirical distribution of the data Drawing X* is drawing 100 pairs (z*,y*) with return from the original dataset This is commonly referred to as “bootstrap sampling” or “nonparametric bootstrap” But this is not the only approach, and often not the best one!

Parametric Bootstrap example
Assume we “know” that 𝐹 (joint dist. of height, weight) is bivariate normal Then it makes sense to make 𝐹 bivariate normal, with parameters estimated from the data X Then we can repeat exactly the same stages of drawing X*, and estimating the variance empirically

Concrete example Let’s choose 𝐹 of height and weight to be bi-normal:
𝑍 𝑌 ~ 𝑁 , We start by drawing 105 random samples of 100 and observing the distribution of 𝑠 𝑋 = 𝑐𝑜𝑟 𝑧,𝑦 , in particular we get 𝜃= 𝑣𝑎𝑟 𝐹 𝑠 = Now we want to try different Bootstrap approaches for estimating 𝜃

Approach 1: standard non-parametric Bootstrap
Define 𝐹 to be the empirical distribution of X, then: Sample many X* (Bootstrap samples) Calculate 𝑠 ∗ 𝑋 ∗ = 𝑐𝑜𝑟 𝑧 ∗ , 𝑦 ∗ for each X* Estimate the variance of s by the empirical variance of s* In simulation we can repeat this whole exercise many times to get a distribution of Bootstrap estimates

Approach 2: parametric Bootstrap using normal distribution
Use X to estimate mean and covariance of 𝐹, assuming it is normal, and define 𝐹 to be this normal distribution. The rest proceeds as before: Sample many X* from this bi-normal distribution (parametric Bootstrap samples) Calculate 𝑠 ∗ 𝑋 ∗ = 𝑐𝑜𝑟 𝑧 ∗ , 𝑦 ∗ for each X* Estimate the variance of s by the empirical variance of s*

Which one will be better here?

Does Bootstrap always work?
Of course not! From what we already know it’s clear that if we fail to build 𝐹 so that θ =𝑡 𝐹 is “similar” to θ=𝑡(𝐹) then our approach is useless Can be a result of wrong assumptions on 𝐹 used in building 𝐹 Can easily devise examples where no Bootstrap approach will give reasonable results Still, the usefulness of properly implemented Bootstrap is very general and applies to almost any “reasonable” problem we encounter

Hypothesis testing with Bootstrap
Recall the components of a hypothesis testing problem: Null hypothesis: 𝐻 0 : 𝜃= 𝜃 0 Test statistic z=s(X) Performing a test entails calculating quantities like p−value= 𝑃 𝐻 0 (𝑠 𝑋 >𝑧) and rejecting if it is small The p-value for a given z is also a property of 𝐹, but how can we use the Bootstrap to estimate it? If 𝐻 0 uniquely defines the distribution, then it’s trivial, a standard simulation exercise But if 𝐻 0 contains many possible 𝐹’s, we can implement the bootstrap paradigm: Choose 𝐹 as a member of 𝐻 0 that is “consistent with our data”, calculate the p value under this distribution

Inference on phylogenetic trees
Dataset of malaria genetic sequences from different organisms (11 species, sequences of length 221): Result of applying standard phylogenetic tree learning approach: Our inference goal: asses confidence in the 9-10 clade (subtree) – is it strongly supported by the data?

Felsenstein’s Bootstrap of Phylogenetic trees
Given this phylogenetic tree built on this dataset, Felsenstein wanted to get an answer to questions like: how certain am I that subtree 𝑇 0 (say, 9-10) is “real” (i.e. exists in the world and not just my data) He suggested using the Bootstrap as follows: Draw bootstrap samples of markers Build tree on each sample (all species, sampled markers) Use the % of time we get the subtree 𝑇 0 as “confidence” in this subtree

Is this Bootstrap legit?
We want to know whether the subtree exists in 𝐹 so we estimate this by % of time it exists in data drawn from 𝐹 This is not exactly a Bootstrap recipe (details are not critical) But assuming it is a Bootstrap approach, is it a good one? Not at all, because 𝐹 was built based on the sample whose “best tree” contains the subtree This basically means that 𝐹 contains the subtree, so we know we are getting over-optimistic results A more correct formulation of this question is as a hypothesis test of 𝐻 0 :𝑇𝑟𝑒𝑒 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑇 0 If we reject 𝐻 0 we can conclude that 𝑇 0 is reliable

Efron’s solution(s) In a beautiful paper, Efron et al. (1996, PNAS) reanalyze this problem and show: That under some (quite convoluted) assumptions Felsenstein’s approach can be considered a legitimate Bootstrap That without any convoluted arguments (but with some complicated math and geometry), an appropriate Bootstrap can be devised for the hypothesis testing view of the problem

Efron’s hypothesis testing view
First task: Build a Bootstrap world 𝐹 where: 𝐻 0 holds 𝐹 is as similar as possible to the empirical distribution of our data Then we can test 𝐻 0 by examining what percentage of the time 𝑇 0 gets selected in this world If it is smaller than 5%, we reject 𝐻 0 at level 0.05 and conclude 𝑇 0 is well supported The challenge is the first task, and this is what Efron concentrates on

A peek into Efron’s approach

Comparing Bootstrap results of Felsenstein and Efron
We recall that Felsenstein’s method gave 96.5% “confidence” for the 9-10 clade Efron is rewarded for his hard work with a result that 93.8% of trees in his Bootstrap world do not contain the 9-10 clade His Bootstrap p-value for 𝐻 0 is 0.062 The results are only slightly different, but if we treat 95% confidence / 5% p-value as the holy grail then we conclude: According to Felsenstein we are “confident” in this clade According to Efron we cannot reject that this clade is a coincidence

Summary Bootstrap is an extremely general and flexible paradigm for statistical inference Allows us to handle complex situations with minimal assumptions and without complicated math Doing theory (and also devising solutions for some problems) can get very complicated, though Has been widely influential in science and industry However, despite the conceptual simplicity it is often misunderstood and misapplied (well beyond Felsenstein)

Thanks!

Bootstrap – The Statistician’s Magic Wand

Similar presentations

Presentation on theme: "Bootstrap – The Statistician’s Magic Wand"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bootstrap – The Statistician’s Magic Wand

Similar presentations

Presentation on theme: "Bootstrap – The Statistician’s Magic Wand"— Presentation transcript:

Similar presentations

About project

Feedback