Presentation is loading. Please wait.

Presentation is loading. Please wait.

Divide-and-Conquer and Statistical Inference for Big Data Michael I. Jordan University of California, Berkeley September 8, 2012.

Similar presentations


Presentation on theme: "Divide-and-Conquer and Statistical Inference for Big Data Michael I. Jordan University of California, Berkeley September 8, 2012."— Presentation transcript:

1 Divide-and-Conquer and Statistical Inference for Big Data Michael I. Jordan University of California, Berkeley September 8, 2012

2 Statistics Meets Computer Science I Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

3 Statistics Meets Computer Science II Bring algorithmic principles more fully into contact with statistical inference The principle in todays talk: divide-and-conquer

4 Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points

5 Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points Thats fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic

6 Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points Thats fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic With subsampled data sets, estimates of uncertainty will be inflated, and in general we dont know how to rescale

7 Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points Thats fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic With subsampled data sets, estimates of uncertainty will be inflated, and in general we dont know how to rescale Lets think this through in the setting of the bootstrap

8 Part I: The Big Data Bootstrap with Ariel Kleiner, Purnamrita Sarkar and Ameet Talwalkar University of California, Berkeley

9 Assessing the Quality of Inference Machine learning and computer science are full of algorithms for clustering, classification, regression, etc –whats missing: a focus on the uncertainty in the outputs of such algorithms (error bars) An application that has driven our work: develop a database that returns answers with error bars to all queries The bootstrap is a generic framework for computing error bars (and other assessments of quality) Can it be used on large-scale problems?

10 The Bootstrap and Scale Why the bootstrap? Naively, it would seem to be a perfect match to modern distributed computing environments –the bootstrap involves resampling with replacement –each bootstrap resample can be processed independently in parallel But the expected number of distinct points in a bootstrap resample is ~ 0.632n If original dataset has size 1 TB, then each resample will have size ~ 632 GB –cant do the bootstrap on massive data!

11 Assessing the Quality of Inference Observe data X 1,..., X n

12 Assessing the Quality of Inference Observe data X 1,..., X n Form a parameter estimate n = (X 1,..., X n )

13 Assessing the Quality of Inference Observe data X 1,..., X n Form a parameter estimate n = (X 1,..., X n ) Want to compute an assessment of the quality of our estimate n (e.g., a confidence region)

14 The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute n on each. Compute based on these multiple realizations of n.

15 The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute n on each. Compute based on these multiple realizations of n. But, we only observe one dataset of size n.

16 The Underlying Population

17 The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute n on each. Compute based on these multiple realizations of n. But, we only observe one dataset of size n.

18 Sampling

19 Approximation

20 Pretend The Sample Is The Population

21 The Bootstrap Use the observed data to simulate multiple datasets of size n: Repeatedly resample n points with replacement from the original dataset of size n. Compute * n on each resample. Compute based on these multiple realizations of * n as our estimate of for n. (Efron, 1979)

22 The Bootstrap: Computational Issues Seemingly a wonderful match to modern parallel and distributed computing platforms But the expected number of distinct points in a bootstrap resample is ~ 0.632n –e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB Cant feasibly send resampled datasets of this size to distributed servers Even if one could, cant compute the estimate locally on datasets this large

23 Subsampling n (Politis, Romano & Wolf, 1999)

24 Subsampling n b

25 There are many subsets of size b < n Choose some sample of them and apply the estimator to each This yields fluctuations of the estimate, and thus error bars But a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (theyll be too large) Need to analytically correct the error bars

26 Subsampling Summary of algorithm: Repeatedly subsample b < n points without replacement from the original dataset of size n Compute b on each subsample Compute based on these multiple realizations of b Analytically correct to produce final estimate of for n The need for analytical correction makes subsampling less automatic than the bootstrap Still, much more favorable computational profile than bootstrap Lets try it out in practice…

27 Empirical Results: Bootstrap and Subsampling Multivariate linear regression with d = 100 and n = 50,000 on synthetic data. x coordinates sampled independently from StudentT(3). y = w T x +, where w in R d is a fixed weight vector and is Gaussian noise. Estimate n = w n in R d via least squares. Compute a marginal confidence interval for each component of w n and assess accuracy via relative mean (across components) absolute deviation from true confidence interval size. For subsampling, use b(n) = n for various values of. Similar results obtained with Normal and Gamma data generating distributions, as well as if estimate a misspecified model.

28 Empirical Results: Bootstrap and Subsampling

29 Bag of Little Bootstraps Ill now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds

30 Bag of Little Bootstraps Ill now discuss a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms

31 Bag of Little Bootstraps Ill now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesnt require analytical rescaling

32 Bag of Little Bootstraps Ill now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesnt require analytical rescaling And its successful in practice

33 Towards the Bag of Little Bootstraps n b

34 b

35 Approximation

36 Pretend the Subsample is the Population

37 And bootstrap the subsample! This means resampling n times with replacement, not b times as in subsampling

38 The Bag of Little Bootstraps (BLB) The subsample contains only b points, and so the resulting empirical distribution has its support on b points But we can (and should!) resample it with replacement n times, not b times Doing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary! Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging)

39 The Bag of Little Bootstraps (BLB)

40 Bag of Little Bootstraps (BLB) Computational Considerations A key point: Resources required to compute generally scale in number of distinct data points This is true of many commonly used estimation algorithms (e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.) Use weighted representation of resampled datasets to avoid physical data replication Example: if original dataset has size 1 TB with each data point 1 MB, and we take b(n) = n 0.6, then expect subsampled datasets to have size ~ 4 GB resampled datasets to have size ~ 4 GB (in contrast, bootstrap resamples have size ~ 632 GB)

41 Empirical Results: Bag of Little Bootstraps (BLB)

42

43 BLB: Theoretical Results Higher-Order Correctness Then:

44 BLB: Theoretical Results BLB is asymptotically consistent and higher- order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that is Hadamard differentiable and is continuous), the output of BLB converges to the population value of as n, b approach.

45 BLB: Theoretical Results Higher-Order Correctness Assume: is a studentized statistic. (Q n (P)), the population value of for n, can be written as where the p k are polynomials in population moments. The empirical version of based on resamples of size n from a single subsample of size b can also be written as where the are polynomials in the empirical moments of subsample j. b n and

46 BLB: Theoretical Results Higher-Order Correctness Also, if BLBs outer iterations use disjoint chunks of data rather than random subsamples, then

47 Part II: Large-Scale Matrix Completion and Matrix Concentration with Lester Mackey and Ameet Talwalkar University of California, Berkeley and Richard Chen, Brendan Farrell and Joel Tropp Caltech


Download ppt "Divide-and-Conquer and Statistical Inference for Big Data Michael I. Jordan University of California, Berkeley September 8, 2012."

Similar presentations


Ads by Google