Download presentation

Presentation is loading. Please wait.

Published byGavin Vaughn Modified over 3 years ago

1
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM

2
The Problems Given n x d matrix A and n x d matrix B, we want estimators for The matrix product A T B The matrix X * minimizing ||AX-B|| –A slightly generalized version of linear regression Given integer k, the matrix A k of rank k minimizing ||A-A k || We consider the Frobenius matrix norm: square root of sum of squares

3
General Properties of Our Algorithms 1 pass over matrix entries, given in any order (allow multiple updates) Maintain compressed versions or sketches of matrices Do small work per entry to maintain the sketches Output result using the sketches Randomized approximation algorithms Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits

4
Matrix Compression Methods In a line of similar efforts… –Element-wise sampling [AM01], [AHK06] –Row / column sampling: pick small random subset of the rows, columns, or both [DK01], [DKM04], [DMM08] –Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06] –Usually more than 1 pass Here: sketching

5
Outline Matrix Product Linear Regression Low-rank approximation

6
An Optimal Matrix Product Algorithm A and B have n rows, and a total of c columns, and we want to estimate A T B, so that ||Est-A T B|| · ε||A|| ¢ ||B|| Let S be an n x m sign (Rademacher) matrix –Each entry is +1 or -1 with probability ½ –m small, set to O(log 1/ δ) ε -2 –Entries are O(log 1/δ)-wise independent Observation: –E[A T SS T B/m] = A T E[SS T ]B/m = A T B Wouldnt it be nice if all the algorithm did was maintain S T A and S T B, and output A T SS T B/m?

7
An Optimal Matrix Product Algorithm This does work, and we are able to improve the previous dependence on m: New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε -2 so that Pr[||A T SS T B/m-A T B|| > ε ||A|| ||B||] · δ (again ||C|| = [Σ i, j C i, j 2 ] 1/2 ) Follows from bounding O(log 1/δ)-th moment of ||A T SS T B/m-A T B||

8
Efficiency Easy to maintain sketches given updates –O(m) time/update, O(mc log(nc)) bits of space for S T A and S T B –Improves Sarlos algorithm by a log c factor. –Sarlos algorithm based on JL Lemma JL preserves all entries of A T B up to an additive error, whereas we only preserve overall error –Can compute [A T S][S T B]/m via fast rectangular matrix multiplication

9
Matrix Product Lower Bound Our algorithm is space-optimal for constant δ –a new lower bound Reduction from a communication game –Augmented Indexing, players Alice and Bob –Alice has random x 2 {0,1} s –Bob has random i 2 {1, 2, …, s} also x i+1, …, x s Alice sends Bob one message Bob should output x i with probability at least 2/3 Theorem [MNSW]: Message must be (s) bits on average

10
Lower Bound Proof Set s := £ (cε -2 log cn) Alice makes matrix U –Uses x 1 …x s Bob makes matrix U and B –Uses i and x i+1, …, x s Alg input will be A:=U+U and B –A and B are n x c/2 Alice: –Runs streaming matrix product Alg on U –Sends Alg state to Bob –Bob continues Alg with A := U + U and B A T B determines x i with probability at least 2/3 –By choice of U, U, B –Solving Augmented Indexing So space of Alg must be (s) = (cε -2 log cn) bits

11
Lower Bound Details U = U (1) ; U (2) ; …, U (log (cn)) ; 0s –Each U (k) is an £ ( ε -2 ) x c/2 submatrix with entries in {-10 k, 10 k } –U (k) i, j = 10 k if matched entry of x is 0, else U (k) i, j = -10 k Bobs index i corresponds to U (k*) i*, j* U is such that A = U+U = U (1) ; U (2) ; …, U (k*) ; 0s –U is determined from x i+1, …, x s A T B is i * -th row of U (k*) ||A|| ¼ ||U (k *) || since the entries of A are geometrically increasing ε 2 ||A|| 2 ¢ ||B|| 2, the squared error, is small, so most entries of the approximation to A T B have the correct sign

12
Outline Matrix Product Linear Regression Low-rank approximation

13
Linear Regression The problem: min X ||AX-B|| X * minimizing this has X * = A - B, where A - is the pseudo- inverse of A Every matrix A = UΣV T using singular value decomposition (SVD) –If A is n x d of rank k, then U is n x k with orthonormal columns Σ is k x k diagonal matrix, diagonal is positive V is d x k with orthonormal columns A - = VΣ -1 U T Normal Equations: A T A X = A T B for optimal X

14
Linear Regression Let S be an n x m sign matrix, m = O(dε -1 log(1/δ)) The algorithm is –Maintain S T A and S T B –Return X solving min X ||S T (AX-B)|| –Space is O(d 2 ε -1 log(1/δ)) words –Improves Sarlos space by log c factor –Space is optimal via new lower bound Main claim: With probability at least 1- δ, ||AX-B|| · (1+ε)||AX * -B|| –That is, relative error for X is small

15
Regression Analysis Why should X solving min X ||S T (AX-B)|| be good? S T approximately preserves AX-B for fixed X If this worked for all X, were done S T must preserve norms even for X, chosen using S First reduce to showing that ||A(X * -X)|| is small Use normal equation A T AX * = A T B –Implies ||AX-B|| 2 = ||AX * -B|| 2 + ||A(X-X * )|| 2 Bounding ||A(X-X * )|| 2 equivalent to bounding ||U T A(X- X * )|| 2, where A = UΣV T, from SVD, and U is an orthonormal basis of the columnspace of A

16
Regression Analysis Continued Bounding || ¯ || 2 := ||U T A(X-X * )|| 2 –|| ¯ || · ||U T SS T U ¯ /m|| + ||U T SS T U ¯ /m- ¯ || Normal equations in sketch space imply (S T A) T (S T A)X = (S T A) T (S T B) U T SS T U ¯ = U T SS T A(X-X * ) = U T SS T A(X-X * ) + U T SS T (B-AX) = U T SS T (B-AX * ) || U T SS T U ¯ /m|| = ||U T SS T (B-AX * )/m|| · (ε/k) 1/2 ||U|| ¢ ||B-AX * || (new tail estimate) = ε 1/2 ||B-AX * ||

17
Regression Analysis Continued Hence, || ¯ || 2 := ||U T A(X-X * )|| 2 –|| ¯ || · ||U T SS T ¯ /m|| + ||U T SS T U ¯ /m- ¯ || · ε 1/2 ||B-AX * || + ||U T SS T U ¯ /m- ¯ || Recall the spectral norm: ||A|| 2 = sup x ||Ax||/||x|| Implies ||CD|| · ||C|| 2 ||D|| ||U T SS T U ¯ /m- ¯ || · ||U T SS T U/m-I || 2 || ¯ || Subspace JL: for m = (k log(1/δ)), S T approximately preserves lengths of all vectors in a k-space ||U T SS T U ¯ /m- ¯ || · || ¯ ||/2 || ¯ || · 2ε 1/2 ||AX * -B|| ||AX-B|| 2 = ||AX * -B|| 2 + || ¯ || 2 = ||AX * -B|| 2 + 4ε||AX * -B|| 2

18
Regression Lower Bound Tight (d 2 log (nd) ε -1 ) space lower bound Again a reduction from augmented indexing This time more complicated Embed log (nd) ε -1 independent regression sub- problems into hard instance –Uses deletions and geometrically growing property, as in matrix product lower bound Choose the entries of A and b so that the algorithms output x encodes some entries of A

19
Regression Lower Bound Lower bound of (d 2 ) already tricky because of bit complexity Natural approach: –Alice has random d x d sign matrix A -1 –b is a standard basis vector e i –Alice computes A = (A -1 ) -1 and puts it into the stream. Solution x to min x ||Ax=b|| is i-th column of A -1 –Bob can isolate entries of A -1, solving indexing Wrong: A has entries that can be exponentially small!

20
Regression Lower Bound We design A and b together (Aug. Index) 1 A 1, 2 A 1, 3 A 1, 4 A 1,5 0 1 A 2, 3 A 2, 4 A 2, A 3, 4 A 3, A 4, x1x2x3x4x5x1x2x3x4x5 0 A 2,4 A 3,4 1 0 = x 5 = 0, x 4 = 1, x 3 = 0, x 2 = 0, x 1 = -A 1,4

21
Outline Matrix Product Regression Low-rank approximation

22
Best Low-Rank Approximation For any matrix A and integer k, there is a matrix A k of rank k that is closest to A among all matrices of rank k. Since rank of A k is k, it is the product CD T of two k-column matrices C and D –A k can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣ k –This is a good compression of A –LSI, PCA, recommendation systems, clustering

23
Best Low-Rank Approximation Previously, nothing was known for 1-pass low- rank approximation and relative error –Even for k = 1, best upper bound O(nd log (nd)) bits –Problem 28 of [Mut]: can one get sublinear space? We get 1-pass and O(kε -2 (n+dε -2 )log(nd)) space Update time is O(kε -4 ), so total work is O(Nkε -4 ), where N is the number of non-zero entries of A New space lower bound shows optimal up to 1/ε

24
Best Low-Rank Approximation and S T A The sketch S T A holds information about A In particular, there is a rank k matrix A k in the rowspace of S T A nearly as close to A as the closest rank k matrix A k –The rowspace of S T A is the set of linear combinations of its rows That is, ||A-A k || · (1+ε)||A-A k || Why is there such an A k?

25
Low-Rank Approximation via Regression Apply the regression results with A ! A k, B ! A The X minimizing ||S T (A k X-A)|| has ||A k X-A|| · (1+ ε)||A k X * -A|| But here X * = I, and X = (S T A k ) - S T A So the matrix A k X = A k (S T A k ) - S T A: –Has rank k –In the rowspace of S T A –Within 1+ε of smallest distance of any rank-k matrix

26
Low-Rank Approximation in 2 Passes Cant use A k (S T A k ) - S T A without finding A k Instead: maintain S T A –Can show that if G T has orthonormal rows, then the best rank-k approximation to A in the rowspace of G T is A k G T, where A k is the best rank-k approximation to AG –After 1 st pass, compute orthonormal basis G T for rowspace of S T A –In 2 nd pass, maintain AG –Afterwards, compute A k and A k G T

27
Low-Rank Approximation in 2 Passes A k is best rank-k approximation to AG For any rank-k matrix Z, ||AGG T – A k G T || = ||AG-A k || · ||AG-Z|| · ||AGG T – ZG T || For all Y: (AGG T -YG T ) ¢ (A-AGG T ) T = 0, so we can apply Pythagorean Theorem twice: ||A – A k G|| = ||A-AGG T || + ||AGG T – A k G T || · ||A-AGG T || + ||AGG T – ZG T || = ||A – ZG T ||

28
1 Pass Algorithm With high probability,(*) ||AX-B|| · (1+ε)||AX*-B||, where –X* minimizes ||AX-B|| –X minimizes ||S T AX-S T B|| Apply (*) with A ! AR and B ! A and X minimizing ||S T (ARX-A)|| So X= (S T AR) - S T A has ||ARX-A|| · (1+ε)min X ||ARX-A|| Columnspace of AR contains a (1+ε)-approximation to A k So, ||ARX-A|| · (1+ε)min X ||ARX-A|| · (1+ε) 2 min X ||A-A k || Key idea: ARX = AR(S T AR) - S T A is –easy to compute in 1-pass with small space and fast update time –behaves like A (similar to SVD) –use it instead of A in our 2-pass algorithm!

29
1 Pass Algorithm Algorithm: –Maintain AR and S T A –Compute AR(S T AR) - S T A –Let G T be an orthonormal basis for the rowspace of S T A, as before –Output the best rank-k approximation to AR(S T AR) - S T A in the rowspace of S T A Same as 2-pass algorithm except we dont need a second pass to project A onto the rowspace of S T A Analysis is similar to that for regression

30
A Lower Bound binary string x index i and x i+1, x i+2, … matrix A 10, , ,-10 … -100, , 100 … -1000, , , 1000 … ………………………… 0, 0 … 10000, , … 0, 0 … 00………00……… Error now dominated by block of interest Bob also inserts a k x k identity submatrix into block of interest n-k rows k ε -1 columns per block 0s k rows

31
Lower Bound Details Block of interest: n-k rows k ε -1 columns * k rows 0s P*I k Bob inserts k x k identity submatrix, scaled by large value P Show any rank-k approximation must err on all of shaded region So good rank-k approximation likely has correct sign on Bobs entry

32
Concluding Remarks Space bounds are tight for product, regression –Sharpen prior upper bounds –Prove optimal lower bounds Space bounds off by a factor of ε -1 for low-rank approximation –First sub-linear (and near-optimal) 1-pass algorithm –We have better upper bounds for restricted cases Improve the dependence on ε in the update time Lower bounds for multi-pass algorithms?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google