Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Slides:

Advertisements

Similar presentations

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Numerical Linear Algebra in the Streaming Model

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

State Space Models. Let { x t :t T} and { y t :t T} denote two vector valued time series that satisfy the system of equations: y t = A t x t + v t (The.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

MS&E 211 Quadratic Programming Ashish Goel. A simple quadratic program Minimize (x 1 ) 2 Subject to: -x 1 + x 2 ≥ 3 -x 1 – x 2 ≥ -2.

The Stability of a Good Clustering Marina Meila University of Washington

Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.

Sketching for M-Estimators: A Unified Approach to Robust Regression

The General Linear Model. The Simple Linear Model Linear Regression.

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

Variance and covariance M contains the mean Sums of squares General additive models.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Estimating Surface Normals in Noisy Point Cloud Data Niloy J. Mitra, An Nguyen Stanford University.

Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.

Chapter 5 Orthogonality

Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Lecture 12 Least Square Approximation Shang-Hua Teng.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Linear and generalised linear models

L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.

Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.

Basics of regression analysis

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

1 Chapter 5 – Orthogonality and Least Squares Outline 5.1 Orthonormal Bases and Orthogonal Projections 5.2 Gram-Schmidt Process and QR Factorization 5.3.

Orthogonality and Least Squares

Computing Eigen Information for Small Matrices The eigen equation can be rearranged as follows: Ax = x  Ax = I n x  Ax - I n x = 0  (A - I n )x = 0.

Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

The Message Passing Communication Model David Woodruff IBM Almaden.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Econometrics III Evgeniya Anatolievna Kolomak, Professor.

A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Lecture XXVII. Orthonormal Bases and Projections Suppose that a set of vectors {x 1,…,x r } for a basis for some space S in R m space such that r  m.

Estimating standard error using bootstrap

An Optimal Algorithm for Finding Heavy Hitters

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

CH 5: Multivariate Methods

Streaming & sampling.

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Matrix Martingales in Randomized Numerical Linear Algebra

The Curve Merger (Dvir & Widgerson, 2008)

Overview Massive data sets Streaming algorithms Regression

The Communication Complexity of Distributed Set-Joins

Generally Discriminant Analysis

Learning-Based Low-Rank Approximations

Presentation transcript:

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden

Regression Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R ∙ I Find linear function that best fits the data

Regression Standard Setting One measured variable b A set of predictor variables a,…, a Assumption: b = x + a x + … + a x +    is assumed to be noise and the x i are model parameters we want to learn Can assume x 0 = 0 Now consider n observations of b 1d 1 1d d 0

Regression Matrix form Input: n  d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close Consider the over-constrained case, when n À d

Fitness Measures Least Squares Method Find x* that minimizes |Ax-b| 2 2 Ax * is the projection of b onto the column span of A Certain desirable statistical properties Closed form solution: x * = (A T A) -1 A T b Method of least absolute deviation (l 1 -regression) Find x* that minimizes |Ax-b| 1 =  |b i – | Cost is less sensitive to outliers than least squares Can solve via linear programming What about the many other fitness measures used in practice?

M-Estimators Measure function –G: R -> R ¸ 0 –G(x) = G(-x), G(0) = 0 –G is non-decreasing in |x| |y| M = Σ i=1 n G(y i ) Solve min x |Ax-b| M Least squares and L 1 -regression are special cases

Huber Loss Function G(x) = x 2 /(2c) for |x| · c G(x) = |x|-c/2 for |x| > c Enjoys smoothness properties of l 2 2 and robustness properties of l 1

Other Examples L 1 -L 2 G(x) = 2((1+x 2 /2) 1/2 – 1) Fair estimator G(x) = c 2 [ |x|/c - log(1+|x|/c) ] Tukey estimator G(x) = c 2 /6 (1-[1-(x/c) 2 ] 3 ) if |x| · c = c 2 /6 if |x| > c

Nice M-Estimators An M-Estimator is nice if it has at least linear growth and at most quadratic growth There is C G > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’| 2 ¸ G(a)/G(a’) ¸ C G |a/a’| Any convex G satisfies the linear lower bound Any sketchable G satisfies the quadratic upper bound –sketchable => there is a distribution on t x n matrices S for which |Sx| M = £ (|x| M ) with probability 2/3 and t is independent of n

Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A 1.[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b| H · (1+ε) min x |Ax-b| H 2.[Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b| M · C*min x |Ax-b| M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal

Talk Outline Huber result Nice M-Estimators result

Naive Sampling Algorithm A x - b min x M x’ = argmin x S¢AS¢A x S¢bS¢b - M S uniformly samples poly(d/ε) rows – this is a terrible algorithm

Leverage Score Sampling For l p -norms, there are probabilities q 1, …, q n with Σ i q i = poly(d/ε) so that sampling works A x - b min x M x’ = argmin x S¢AS¢A S¢bS¢b - M All q i can be found in O(nnz(A)log n) + poly(d) time S is diagonal. S i,i = 1/q i if row i is sampled, 0 otherwise x - For l 2, the q i are the squared row norms in an orthonormal basis of A - For l p, the q i are p-th powers of the p-norms of rows in a “well conditioned basis” [Dasgupta et al.] - For l 2, the q i are the squared row norms in an orthonormal basis of A - For l p, the q i are p-th powers of the p-norms of rows in a “well conditioned basis” [Dasgupta et al.]

Huber Regression Algorithm [Huber inequality]: For z 2 R n, £ (n -1/2 ) min(|z| 1, |z| 2 2 /(2c)) · |z| H · |z| 1 Proof by case analysis Sample from a mixture of l 1 -leverage scores and l 2 - leverage scores –p i = n 1/2 ¢ (q i (1) + q i (2) ) Our nnz(A)log n + poly(d/ε) algorithm –After one step, number of rows < n 1/2 poly(d/ε) –Recursively solve a weighted Huber –Weights do not grow quickly –Once size is < n.01 poly(d/ε), solve by convex programming

Talk Outline Huber result Nice M-Estimators result

CountSketch For l 2 regression, CountSketch with poly(d) rows works [Clarkson, W]: Compute S*A in nnz(A) time Compute x’ argmin x |SAx-Sb| 2 in poly(d) time [ [ S =

M-Sketch [ [ S 1 ¢ R 0 S 2 ¢ R 1 S 3 ¢ R 2 … S log n ¢ R log n S i are independent CountSketch matrices with poly(d) rows R i is n x n diagonal and uniformly samples a 1/b i fraction of [n] -The same M-Sketch works for all nice M-estimators! x’ = argmin x |TAx-Tb| M, w -The same M-Sketch works for all nice M-estimators! x’ = argmin x |TAx-Tb| M, w - Sketch used for estimating frequency moments [Indyk, W] and earthmover distance [Verbin, Zhang] Note: many uses of this data structure do not work since they involve a median operation

M-Sketch Intuition Consider a fixed y = Ax-b For M-Sketch T, output |Ty| w, M = Σ i w i G((Ty) i ) [Contraction] |Ty| w,M ¸ ½ |y| M w.pr. 1-exp(-d log d) [Dilation] |Ty| w,M · 2 |y| M w.pr. 9/10 Contraction allows for a net argument (no scale-invariance!) Dilation implies the optimal y* does not dilate much

M-Skech Analysis Partition into weight classes: –S i = {j | G(y j ) 2 (|y| M /b i, |y| M /b i-1 ]} If |S i | > d log d, there’s a “sampling level” containing about dlog d elements of S i (gives exp(-dlog d) failure probability) –Elements from S j for j · i do not collide –Elements from S j for j > i cancel in a bucket (concentrate to 2-norm) If |S i | small, all elements are found in the top level or S i is not important (relate M to l 2 ) If G close to quadratic growth, need to “clip” top buckets –Ky-Fan norm

Conclusions Summary: 1.[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm 2.[Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Questions: 1. Is there a sketch-based estimator for (1+ε)-approximation? 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l_2-regression - p-stable random variables -> l_p regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …