Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Slides:

Advertisements

Similar presentations

A Fast PTAS for k-Means Clustering

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Fundamentals of Probability

Chapter 1 The Study of Body Function Image PowerPoint

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Numerical Linear Algebra in the Streaming Model

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

STATISTICS Sampling and Sampling Distributions

STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.

STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.

STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Detection of Hydrological Changes – Nonparametric Approaches

STATISTICS Univariate Distributions

STATISTICS Random Variables and Distribution Functions

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Chapter 3: Top-Down Design with Functions Problem Solving & Program Design in C Sixth Edition By Jeri R. Hanly & Elliot B. Koffman.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 10 second questions

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Points, Vectors, Lines, Spheres and Matrices

Assumptions underlying regression analysis

Chapter 7 Sampling and Sampling Distributions

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

Solve Multi-step Equations

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

1 The tiling algorithm Learning in feedforward layered networks: the tiling algorithm writed by Marc M é zard and Jean-Pierre Nadal.

Copyright © Cengage Learning. All rights reserved.

3/2003 Rev 1 I – slide 1 of 33 Session I Part I Review of Fundamentals Module 2Basic Physics and Mathematics Used in Radiation Protection.

Randomized Algorithms Randomized Algorithms CS648 1.

Recurrences : 1 Chapter 3. Growth of function Chapter 4. Recurrences.

Chi-Square and Analysis of Variance (ANOVA)

5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

Chapter 6 The Mathematics of Diversification

How to convert a left linear grammar to a right linear grammar

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Quantitative Analysis (Statistics Week 8)

© 2012 National Heart Foundation of Australia. Slide 2.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

6.4 Best Approximation; Least Squares

25 seconds left…...

Statistical Inferences Based on Two Samples

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

PSSA Preparation.

Experimental Design and Analysis of Variance

Simple Linear Regression Analysis

Oct 9, 2014 Lirong Xia Hypothesis testing and statistical decision theory.

Multiple Regression and Model Building

16. Mean Square Estimation

State Variables.

Commonly Used Distributions

Chapter 5 The Mathematics of Diversification

Probabilistic Reasoning over Time

Sketching for M-Estimators: A Unified Approach to Robust Regression

The General Linear Model. The Simple Linear Model Linear Regression.

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

New Characterizations in Turnstile Streams with Applications

Overview Massive data sets Streaming algorithms Regression

Presentation transcript:

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden

Subspace Embeddings for the L1 norm with Applications to... Robust Regression and Hyperplane Fitting

3 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks

4 Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation

5 Regression analysis Regression Statistical method to study dependencies between variables in the presence of noise.

6 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise.

7 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R I

8 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R I Find linear function that best fits the data

9 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Standard Setting One measured variable b A set of predictor variables a,…, a Assumption: b = x + a x + … + a x + is assumed to be noise and the x i are model parameters we want to learn Can assume x 0 = 0 Now consider n measured variables 1 d 1 1dd 0

10 Regression analysis Matrix form Input: n d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close Consider the over-constrained case, when n À d Can assume that A has full column rank

11 Regression analysis Least Squares Method Find x* that minimizes (b i – )² A i* is i-th row of A Certain desirable statistical properties Method of least absolute deviation (l 1 -regression) Find x* that minimizes |b i – | Cost is less sensitive to outliers than least squares

12 Regression analysis Geometry of regression We want to find an x that minimizes |Ax-b| 1 The product Ax can be written as A *1 x 1 + A *2 x A *d x d where A *i is the i-th column of A This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A nearest to b in l 1 -norm

13 Regression analysis Solving l 1 -regression via linear programming Minimize (1,…,1) ( + ) Subject to: A x = b, 0 Generic linear programming gives poly(nd) time Best known algorithm is nd 5 log n + poly(d/ε) [Clarkson]

14 Our Results A (1+ε)-approximation algorithm for l 1 -regression problem Time complexity is nd poly(d/ε) (Clarksons is nd 5 log n + poly(d/ε)) First 1-pass streaming algorithm with small space (poly(d log n /ε) bits) Similar results for hyperplane fitting

15 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks

16 Our Techniques Notice that for any d x d change of basis matrix U, min x in R d |Ax-b| 1 = min x in R d |AUx-b| 1

17 Our Techniques Notice that for any y 2 R d, min x in R d |Ax-b| 1 = min x in R d |Ax-b+Ay| 1 We call b-Ay the residual, denoted b, and so min x in R d |Ax-b| 1 = min x in R d |Ax-b| 1

18 Rough idea behind algorithm of Clarkson Compute poly(d)- approximation Compute well-conditioned basis Sample rows from the well-conditioned basis and the residual of the poly(d)- approximation Solve l 1 -regression on the sample, obtaining vector x, and output x Find y such that |Ay-b| 1 · poly(d) min x in R d |Ax-b| 1 Let b = b-Ay be the residual Find y such that |Ay-b| 1 · poly(d) min x in R d |Ax-b| 1 Let b = b-Ay be the residual Find a basis U so that for all x in R d, |x| 1 /poly(d) · |AUx| 1 · poly(d) |x| 1 Find a basis U so that for all x in R d, |x| 1 /poly(d) · |AUx| 1 · poly(d) |x| 1 min x in R d |Ax-b| 1 = min x in R d |AUx – b| 1 Sample poly(d/ ε) rows of AUb proportional to their l 1 -norm. min x in R d |Ax-b| 1 = min x in R d |AUx – b| 1 Sample poly(d/ ε) rows of AUb proportional to their l 1 -norm. Takes nd 5 log n time Takes nd time Takes nd 5 log n time Takes poly(d/ ε) time Now generic linear programming is efficient

19 Our Techniques Suffices to show how to quickly compute 1.A poly(d)-approximation 2.A well-conditioned basis

20 Our main theorem Theorem There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1 Embedding is linear is independent of A preserves lengths of an infinite number of vectors

21 Application of our main theorem Computing a poly(d)-approximation Compute RA and Rb Solve x = argmin x |RAx-Rb| 1 Main theorem applied to Ab implies x is a d log d – approximation RA, Rb have d log d rows, so can solve l 1 -regression efficiently Time is dominated by computing RA, a single matrix-matrix product

22 Application of our main theorem Computing a well-conditioned basis 1.Compute RA 2.Compute U so that RAU is orthonormal (in the l 2 -sense) 3.Output AU AU is well-conditioned because: |AUx| 1 · |RAUx| 1 · (d log d) 1/2 |RAUx| 2 = (d log d) 1/2 |x| 2 · (d log d) 1/2 |x| 1 and |AUx| 1 ¸ |RAUx| 1 /(d log d) ¸ |RAUx| 2 /(d log d) = |x| 2 /(d log d) ¸ |x| 1 /(d 3/2 log d) Life is really simple! Time dominated by computing RA and AU, two matrix-matrix products

23 Application of our main theorem It follows that we get an nd poly(d/ε) time algorithm for (1+ε)-approximate l 1 -regression

24 Whats left? We should prove our main theorem Theorem: There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1 R is simple The entries of R are i.i.d. Cauchy random variables

25 Cauchy random variables pdf(z) = 1/(π(1+z) 2 ) for z in (- 1, 1 ) Infinite expectation and variance 1-stable: If z 1, z 2, …, z n are i.i.d. Cauchy, then for a 2 R n, a 1 ¢ z 1 + a 2 ¢ z 2 + … + a n ¢ z n » |a| 1 ¢ z, where z is Cauchy z

26 Proof of main theorem By 1-stability, For all rows r of R, » |Ax| 1 ¢ Z, where Z is a Cauchy RAx » (|Ax| 1 ¢ Z 1, …, |Ax| 1 ¢ Z d log d ), where Z 1, …, Z d log d are i.i.d. Cauchy |RAx| 1 = |Ax| 1 i |Z i | The |Z i | are half-Cauchy i |Z i | = (d log d) with probability 1-exp(-d) by Chernoff ε-net argument on {Ax | |Ax| 1 = 1} shows |RAx| 1 = |Ax| 1 ¢ (d log d) for all x Scale R by 1/(d log d) i |Z i | = (d log d) with probability 1-exp(-d) by Chernoff ε-net argument on {Ax | |Ax| 1 = 1} shows |RAx| 1 = |Ax| 1 ¢ (d log d) for all x Scale R by 1/(d log d) But i |Z i | is heavy-tailed z / (d log d)

27 Proof of main theorem i |Z i | is heavy-tailed, so |RAx| 1 = |Ax| 1 i |Z i | / (d log d) may be large Each |Z i | has c.d.f. asymptotic to 1-Θ(1/z) for z in [0, 1 ) No problem! We know there exists a well-conditioned basis of A We can assume the basis vectors are A *1, …, A *d |RA *i | 1 » |A *i | 1 ¢ i |Z i | / (d log d) With constant probability, i |RA *i | 1 = O(log d) i |A *i | 1

28 Proof of main theorem Suppose i |RA *i | 1 = O(log d) i |A *i | 1 for well-conditioned basis A *1, …, A *d We will use the Auerbach basis which always exists: For all x, |x| 1 · |Ax| 1 i |A *i | 1 = d I dont know how to compute such a basis, but it doesnt matter! i |RA *i | 1 = O(d log d) |RAx| 1 · i |RA *i x i | · |x| 1 i |RA *i | 1 = |x| 1 O(d log d) = O(d log d) |Ax| 1 Q.E.D.

29 Main Theorem Theorem There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1

30 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks

31 Regression for data streams Streaming algorithm given additive updates to entries of A and b Pick random matrix R according to the distribution of main theorem Maintain RA and Rb during the stream Find x' that minimizes |RAx'-Rb| 1 using linear programming Compute U so that RAU is orthonormal The hard thing is sampling rows from AUb proportional to their norm Do not know U, b until end of stream Surpisingly, there is still a way to do this in a single pass by treating U, x as formal variables and plugging them in at the end Uses a noisy sampling data structure Omitted from talk Entries of R do not need to be independent

32 Hyperplane Fitting Reduces to d invocations of l 1 -regression Given n points in R d, find hyperplane minimizing sum of l 1 -distances of points to the hyperplane

33 Conclusion Main results Efficient algorithms for l 1 -regression and hyperplane fitting nd time improves previous nd 5 log n running time for l 1 -regression First oblivious subspace embedding for l 1