# Numerical Linear Algebra in the Streaming Model

## Presentation on theme: "Numerical Linear Algebra in the Streaming Model"— Presentation transcript:

Numerical Linear Algebra in the Streaming Model
Ken Clarkson - IBM David Woodruff - IBM

The Problems Given n x d matrix A and n x d’ matrix B, we want estimators for The matrix product AT B The matrix X* minimizing ||AX-B|| A slightly generalized version of linear regression Given integer k, the matrix Ak of rank k minimizing ||A-Ak|| We consider the Frobenius matrix norm: square root of sum of squares

General Properties of Our Algorithms
1 pass over matrix entries, given in any order (allow multiple updates) Maintain compressed versions or “sketches” of matrices Do small work per entry to maintain the sketches Output result using the sketches Randomized approximation algorithms Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits

Matrix Compression Methods
In a line of similar efforts… Element-wise sampling [AM01], [AHK06] Row / column sampling: pick small random subset of the rows, columns, or both [DK01], [DKM04], [DMM08] Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06] Usually more than 1 pass Here: sketching

Outline Matrix Product Linear Regression Low-rank approximation

An Optimal Matrix Product Algorithm
A and B have n rows, and a total of c columns, and we want to estimate ATB, so that ||Est-ATB|| · ε||A||¢||B|| Let S be an n x m sign (Rademacher) matrix Each entry is +1 or -1 with probability ½ m small, set to O(log 1/ δ) ε-2 Entries are O(log 1/δ)-wise independent Observation: E[ATSSTB/m] = ATE[SST]B/m = ATB Wouldn’t it be nice if all the algorithm did was maintain STA and STB, and output ATSSTB/m?

An Optimal Matrix Product Algorithm
This does work, and we are able to improve the previous dependence on m: New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε-2 so that Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] · δ (again ||C|| = [Σi, j Ci, j2]1/2) Follows from bounding O(log 1/δ)-th moment of ||ATSSTB/m-ATB||

Efficiency Easy to maintain sketches given updates
O(m) time/update, O(mc log(nc)) bits of space for STA and STB Improves Sarlos’ algorithm by a log c factor. Sarlos’ algorithm based on JL Lemma JL preserves all entries of ATB up to an additive error, whereas we only preserve overall error Can compute [ATS][STB]/m via fast rectangular matrix multiplication

Matrix Product Lower Bound
Our algorithm is space-optimal for constant δ a new lower bound Reduction from a communication game Augmented Indexing, players Alice and Bob Alice has random x 2 {0,1}s Bob has random i 2 {1, 2, …, s} also xi+1, …, xs Alice sends Bob one message Bob should output xi with probability at least 2/3 Theorem [MNSW]: Message must be (s) bits on average

Lower Bound Proof Set s := £(cε-2log cn) Alice makes matrix U
Uses x1…xs Bob makes matrix U’ and B Uses i and xi+1, …, xs Alg input will be A:=U+U’ and B A and B are n x c/2 Alice: Runs streaming matrix product Alg on U Sends Alg state to Bob Bob continues Alg with A := U + U’ and B ATB determines xi with probability at least 2/3 By choice of U, U’, B Solving Augmented Indexing So space of Alg must be (s) = (cε-2log cn) bits

Lower Bound Details U = U(1); U(2); …, U(log (cn)); 0s
Each U(k) is an £(ε-2) x c/2 submatrix with entries in {-10k, 10k} U(k)i, j = 10k if matched entry of x is 0, else U(k)i, j = -10k Bob’s index i corresponds to U(k*)i*, j* U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s U’ is determined from xi+1, …, xs ATB is i*-th row of U(k*) ||A|| ¼ ||U(k*)|| since the entries of A are geometrically increasing ε2||A||2¢ ||B||2, the squared error, is small, so most entries of the approximation to ATB have the correct sign

Outline Matrix Product Linear Regression Low-rank approximation

Linear Regression The problem: minX ||AX-B||
X* minimizing this has X* = A-B, where A- is the pseudo-inverse of A Every matrix A = UΣVT using singular value decomposition (SVD) If A is n x d of rank k, then U is n x k with orthonormal columns Σ is k x k diagonal matrix, diagonal is positive V is d x k with orthonormal columns A- = VΣ-1UT Normal Equations: ATA X = ATB for optimal X

||AX’-B|| · (1+ε)||AX*-B||
Linear Regression Let S be an n x m sign matrix, m = O(dε-1log(1/δ)) The algorithm is Maintain STA and STB Return X’ solving minX ||ST(AX-B)|| Space is O(d2 ε-1log(1/δ)) words Improves Sarlos’ space by log c factor Space is optimal via new lower bound Main claim: With probability at least 1- δ, ||AX’-B|| · (1+ε)||AX*-B|| That is, relative error for X’ is small

Regression Analysis Why should X’ solving minX ||ST(AX-B)|| be good?
ST approximately preserves AX-B for fixed X If this worked for all X, we’re done ST must preserve norms even for X’, chosen using S First reduce to showing that ||A(X*-X’)|| is small Use normal equation ATAX* = ATB Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2 Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’-X*)||2, where A = UΣVT, from SVD, and U is an orthonormal basis of the columnspace of A

Regression Analysis Continued
Bounding ||¯||2 := ||UTA(X’-X*)||2 ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯|| Normal equations in sketch space imply (STA)T(STA)X’ = (STA)T(STB) UTSSTU¯ = UTSSTA(X’-X*) = UTSSTA(X’-X*) + UTSST(B-AX’) = UTSST(B-AX*) || UTSSTU¯/m|| = ||UTSST(B-AX*)/m|| · (ε/k)1/2||U||¢||B-AX*|| (new tail estimate) = ε1/2 ||B-AX*||

Regression Analysis Continued
Hence, ||¯||2 := ||UTA(X’-X*)||2 ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯|| · ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯|| Recall the spectral norm: ||A||2 = supx ||Ax||/||x|| Implies ||CD|| · ||C||2 ||D|| ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯|| Subspace JL: for m = (k log(1/δ)), ST approximately preserves lengths of all vectors in a k-space ||UTSSTU¯/m-¯|| · ||¯||/2 ||¯|| · 2ε1/2||AX*-B|| ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2

Regression Lower Bound
Tight (d2 log (nd) ε-1) space lower bound Again a reduction from augmented indexing This time more complicated Embed log (nd) ε-1 independent regression sub-problems into hard instance Uses deletions and geometrically growing property, as in matrix product lower bound Choose the entries of A and b so that the algorithm’s output x encodes some entries of A

Regression Lower Bound
Lower bound of (d2) already tricky because of bit complexity Natural approach: Alice has random d x d sign matrix A-1 b is a standard basis vector ei Alice computes A = (A-1)-1 and puts it into the stream. Solution x to minx ||Ax=b|| is i-th column of A-1 Bob can isolate entries of A-1, solving indexing Wrong: A has entries that can be exponentially small!

Regression Lower Bound
We design A and b together (Aug. Index) 1 A1, 2 A1, 3 A1, 4 A1,5 A2, 3 A2, 4 A2,5 A3, A3,5 A4,5 x1 x2 x3 x4 x5 A2,4 A3,4 1 = x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4

Outline Matrix Product Regression Low-rank approximation

Best Low-Rank Approximation
For any matrix A and integer k, there is a matrix Ak of rank k that is closest to A among all matrices of rank k. Since rank of Ak is k, it is the product CDT of two k-column matrices C and D Ak can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣk This is a good compression of A LSI, PCA, recommendation systems, clustering

Best Low-Rank Approximation
Previously, nothing was known for 1-pass low-rank approximation and relative error Even for k = 1, best upper bound O(nd log (nd)) bits Problem 28 of [Mut]: can one get sublinear space? We get 1-pass and O(kε-2(n+dε-2)log(nd)) space Update time is O(kε-4), so total work is O(Nkε-4), where N is the number of non-zero entries of A New space lower bound shows optimal up to 1/ε

Best Low-Rank Approximation and STA
The sketch STA holds information about A In particular, there is a rank k matrix Ak’ in the rowspace of STA nearly as close to A as the closest rank k matrix Ak The rowspace of STA is the set of linear combinations of its rows That is, ||A-Ak’|| · (1+ε)||A-Ak|| Why is there such an Ak’?

Low-Rank Approximation via Regression
Apply the regression results with A ! Ak, B ! A The X’ minimizing ||ST(AkX-A)|| has ||AkX’-A|| · (1+ ε)||Ak X*-A|| But here X* = I, and X’ = (ST Ak)- STA So the matrix AkX’ = Ak(STAk)-STA: Has rank k In the rowspace of STA Within 1+ε of smallest distance of any rank-k matrix

Low-Rank Approximation in 2 Passes
Can’t use Ak(STAk)- STA without finding Ak Instead: maintain STA Can show that if GT has orthonormal rows, then the best rank-k approximation to A in the rowspace of GT is A’kGT, where A’k is the best rank-k approximation to AG After 1st pass, compute orthonormal basis GT for rowspace of STA In 2nd pass, maintain AG Afterwards, compute A’k and A’kGT

Low-Rank Approximation in 2 Passes
A’k is best rank-k approximation to AG For any rank-k matrix Z, ||AGGT – A’kGT|| = ||AG-A’k|| · ||AG-Z|| · ||AGGT – ZGT|| For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we can apply Pythagorean Theorem twice: ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT|| · ||A-AGGT|| + ||AGGT – ZGT|| = ||A – ZGT||

1 Pass Algorithm With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where X* minimizes ||AX-B|| X’ minimizes ||STAX-STB|| Apply (*) with A ! AR and B ! A and X’ minimizing ||ST(ARX-A)|| So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A|| Columnspace of AR contains a (1+ε)-approximation to Ak So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak|| Key idea: ARX’ = AR(STAR)-STA is easy to compute in 1-pass with small space and fast update time behaves like A (similar to SVD) use it instead of A in our 2-pass algorithm!

1 Pass Algorithm Algorithm: Maintain AR and STA Compute AR(STAR)-STA
Let GT be an orthonormal basis for the rowspace of STA, as before Output the best rank-k approximation to AR(STAR)-STA in the rowspace of STA Same as 2-pass algorithm except we don’t need a second pass to project A onto the rowspace of STA Analysis is similar to that for regression

A Lower Bound binary string x matrix A index i and xi+1, xi+2, …
k ε-1 columns per block 0s k rows 0, 0 -1000, -1000 -1000, 1000 1000, 1000 0, 0 10000, -10000, 10, -10 -10, 10 -10,-10 -100, -100 100, 100 n-k rows Error now dominated by block of interest Bob also inserts a k x k identity submatrix into block of interest

Lower Bound Details * 0s P*Ik 0s Block of interest: k ε-1 columns
k rows 0s P*Ik 0s n-k rows * Bob inserts k x k identity submatrix, scaled by large value P Show any rank-k approximation must err on all of shaded region So good rank-k approximation likely has correct sign on Bob’s entry

Concluding Remarks Space bounds are tight for product, regression
Sharpen prior upper bounds Prove optimal lower bounds Space bounds off by a factor of ε-1 for low-rank approximation First sub-linear (and near-optimal) 1-pass algorithm We have better upper bounds for restricted cases Improve the dependence on ε in the update time Lower bounds for multi-pass algorithms?

Similar presentations