Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Similar presentations


Presentation on theme: "Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden."— Presentation transcript:

1 Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden

2 Regression Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R ∙ I Find linear function that best fits the data

3 Regression Standard Setting One measured variable b A set of predictor variables a,…, a Assumption: b = x + a x + … + a x +    is assumed to be noise and the x i are model parameters we want to learn Can assume x 0 = 0 Now consider n observations of b 1d 1 1d d 0

4 Regression Matrix form Input: n  d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close Consider the over-constrained case, when n À d

5 Fitness Measures Least Squares Method Find x* that minimizes |Ax-b| 2 2 Ax * is the projection of b onto the column span of A Certain desirable statistical properties Closed form solution: x * = (A T A) -1 A T b Method of least absolute deviation (l 1 -regression) Find x* that minimizes |Ax-b| 1 =  |b i – | Cost is less sensitive to outliers than least squares Can solve via linear programming What about the many other fitness measures used in practice?

6 M-Estimators Measure function –G: R -> R ¸ 0 –G(x) = G(-x), G(0) = 0 –G is non-decreasing in |x| |y| M = Σ i=1 n G(y i ) Solve min x |Ax-b| M Least squares and L 1 -regression are special cases

7 Huber Loss Function G(x) = x 2 /(2c) for |x| · c G(x) = |x|-c/2 for |x| > c Enjoys smoothness properties of l 2 2 and robustness properties of l 1

8 Other Examples L 1 -L 2 G(x) = 2((1+x 2 /2) 1/2 – 1) Fair estimator G(x) = c 2 [ |x|/c - log(1+|x|/c) ] Tukey estimator G(x) = c 2 /6 (1-[1-(x/c) 2 ] 3 ) if |x| · c = c 2 /6 if |x| > c

9 Nice M-Estimators An M-Estimator is nice if it has at least linear growth and at most quadratic growth There is C G > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’| 2 ¸ G(a)/G(a’) ¸ C G |a/a’| Any convex G satisfies the linear lower bound Any sketchable G satisfies the quadratic upper bound –sketchable => there is a distribution on t x n matrices S for which |Sx| M = £ (|x| M ) with probability 2/3 and t is independent of n

10 Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A 1.[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b| H · (1+ε) min x |Ax-b| H 2.[Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b| M · C*min x |Ax-b| M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal

11 Talk Outline Huber result Nice M-Estimators result

12 Naive Sampling Algorithm A x - b min x M x’ = argmin x S¢AS¢A x S¢bS¢b - M S uniformly samples poly(d/ε) rows – this is a terrible algorithm

13 Leverage Score Sampling For l p -norms, there are probabilities q 1, …, q n with Σ i q i = poly(d/ε) so that sampling works A x - b min x M x’ = argmin x S¢AS¢A S¢bS¢b - M All q i can be found in O(nnz(A)log n) + poly(d) time S is diagonal. S i,i = 1/q i if row i is sampled, 0 otherwise x - For l 2, the q i are the squared row norms in an orthonormal basis of A - For l p, the q i are p-th powers of the p-norms of rows in a “well conditioned basis” [Dasgupta et al.] - For l 2, the q i are the squared row norms in an orthonormal basis of A - For l p, the q i are p-th powers of the p-norms of rows in a “well conditioned basis” [Dasgupta et al.]

14 Huber Regression Algorithm [Huber inequality]: For z 2 R n, £ (n -1/2 ) min(|z| 1, |z| 2 2 /(2c)) · |z| H · |z| 1 Proof by case analysis Sample from a mixture of l 1 -leverage scores and l 2 - leverage scores –p i = n 1/2 ¢ (q i (1) + q i (2) ) Our nnz(A)log n + poly(d/ε) algorithm –After one step, number of rows < n 1/2 poly(d/ε) –Recursively solve a weighted Huber –Weights do not grow quickly –Once size is < n.01 poly(d/ε), solve by convex programming

15 Talk Outline Huber result Nice M-Estimators result

16 CountSketch For l 2 regression, CountSketch with poly(d) rows works [Clarkson, W]: Compute S*A in nnz(A) time Compute x’ argmin x |SAx-Sb| 2 in poly(d) time [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 S =

17 M-Sketch [ [ S 1 ¢ R 0 S 2 ¢ R 1 S 3 ¢ R 2 … S log n ¢ R log n S i are independent CountSketch matrices with poly(d) rows R i is n x n diagonal and uniformly samples a 1/b i fraction of [n] -The same M-Sketch works for all nice M-estimators! x’ = argmin x |TAx-Tb| M, w -The same M-Sketch works for all nice M-estimators! x’ = argmin x |TAx-Tb| M, w - Sketch used for estimating frequency moments [Indyk, W] and earthmover distance [Verbin, Zhang] Note: many uses of this data structure do not work since they involve a median operation

18 M-Sketch Intuition Consider a fixed y = Ax-b For M-Sketch T, output |Ty| w, M = Σ i w i G((Ty) i ) [Contraction] |Ty| w,M ¸ ½ |y| M w.pr. 1-exp(-d log d) [Dilation] |Ty| w,M · 2 |y| M w.pr. 9/10 Contraction allows for a net argument (no scale-invariance!) Dilation implies the optimal y* does not dilate much

19 M-Skech Analysis Partition into weight classes: –S i = {j | G(y j ) 2 (|y| M /b i, |y| M /b i-1 ]} If |S i | > d log d, there’s a “sampling level” containing about dlog d elements of S i (gives exp(-dlog d) failure probability) –Elements from S j for j · i do not collide –Elements from S j for j > i cancel in a bucket (concentrate to 2-norm) If |S i | small, all elements are found in the top level or S i is not important (relate M to l 2 ) If G close to quadratic growth, need to “clip” top buckets –Ky-Fan norm

20 Conclusions Summary: 1.[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm 2.[Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Questions: 1. Is there a sketch-based estimator for (1+ε)-approximation? 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l_2-regression - p-stable random variables -> l_p regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …


Download ppt "Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden."

Similar presentations


Ads by Google