Lp Row Sampling by Lewis Weights

Lp Row Sampling by Lewis Weights
Richard Peng Georgia Tech Joint with Michael Cohen (M.I.T.) 1

Problem: Row Sampling A’ A Given n × d matrix A, norm p
Pick a few (rescaled) rows of A’ so that ║Ax║p≈1+ε║A’x║p ∀ x A’ A Multiplicative error notation ≈: a≈kb if there exists kmin, kmax s.t. kmax/kmin ≤ k and kmina ≤ b ≤ kmax a

Instantiation: Regression
x Minx ║Mx–b║ p p=2: least squares p=1: robust regression M b -1 Simplified view: Mx – b= [M, b] [x; -1] min║Ax║p with one entry of x fixed ║Ax║p≈1+ε║A’x║p ∀ x: can solve on A’

Row sampling Algorithms
[CW`12][NN`13][LMP`13][CLMMPS`15] Input sparsity time: expensive steps only on size d matrices, O(nnz(A) + poly(d)) total P # rows By 2 dlogd Matrix concentration bounds ([Rudelson-Vershynin `07], [Tropp `12]) d [Batson-Spielman-Srivastava `09] 1 d2.5 [Dasgupta-Drineas-Harb -Kumar-Mahoney `09] 1 < p < 2 dp/2+2 2 < p dp+1 Assuming ε = constant

Row Sampling Graphs BG: edge-vertex incidence matrix
BH: subset of (rescaled) edges ║BGx║2≈1+ε║BHx║2 for all x: H is a spectra/cut sparsifier for G [Naor `11][ Matousek `97] on graphs, 2-norm approx. implies p-norm approx. for all 1 ≤ p ≤ 2

L2 vs. L1 A: A’: 1-dimensional: only `interesting’ vector: x = [1]
1/2 A: A’: 1 1-dimensional: only `interesting’ vector: x = [1] L2 distance: 1 vs. 1 = 1 L1 distance: 2 vs. 1 = 2 Difference between L2 and L1 row sampling analogous to distortion between norms: n1/2

Our Results P Previous Our Uses 1 d2.5 dlogd [Talagrand `90]
dp/2+2 dlogd(loglogd)2 [Talagrand `95] 2 < p dp+1 dp/2logd [Bourgain-Milman-Lindenstrauss `89] Key components: Sample by Lewis weights: wi2/p = aiT(ATdiag(w)1-2/pA)-1ai Compute Lewis weights in O(nnz(A) + poly(d)) time Elementary proof of concentration when p = 1

Outline Row Sampling Lewis Weights Computation Proof of Concentration

How to sample Uniform sampling: drop a key row
Importance sampling: pi for each row Rescale to get E[║A’x║pp] = ║Ax║pp Norm/length sampling: pi =║ai║22 Need to decorrelated columns column with one entry can be ignored due to other heavier columns Need: ║A[1;0;…;0]║p≠ 0

Matrix-Chernoff Bounds
τ: L2 statistical leverage scores τi = aiT(ATA)-1ai for row i Matrix concentration: sampling rows w.p. pi = τi logd gives ║Ax║2≈║A’x║2 ∀x w.h.p. [Foster `49] Σi τi = rank ≤ d  O(dlogd) rows in A’ [CW`12][NN`13][LMP`13][CLMMPS`15]: can estimate τ in O(nnz(A) + dω+a) time

Leverage Scores as Lengths
Approximations are basis independent: Max ║Ax║p/║A’x║p = Max ║AVx║p/║A’ Vx║p Decorrelate columns to get ATA = I (isotropic position) Interpretation of matrix-Chernoff When A is isotropic, length sampling works

Length Sampling Matrix Chernoff: when A is in isotropic position, length sampling works Does this work in p-norm? Bad case: k2 copies of ε/k in some column Total L22 contribution is ε But contributes most of ║A[0,1]║1 1 0.8 0.6/k … k2 copies

Uniform Isotropic Position
Definition of position only uses L2 n × d A where: A is isotropic, ATA = I, L22 length of rows of A close to uniform: ║ai║22 < 2d/n, Alternate view of length sampling: for such A pi = O(dlogd/n) gives ║Ax║2≈║A’x║2 ∀x [Talagrand `90][Talagrand `95][BML `89]: uniformly sampling such gives ║Ax║p≈║A’x║p ∀x FOR ANY p

Lewis Weights w: Lp Lewis Weights wi2/p = aiT(ATdiag(w)1-2/pA)-1ai
Split rows of A to form A’ in uniform istropic position A A’ Then sample A’ uniformly A” Matrix concentration: ∀1≤q ≤2 ║Ax║q≈║A’x║q ∀x Need: ║Ax║p=║A’x║p ∀x Splitting depends on p, sampling guarantees don’t

Splitting Based View Split ai into wi copies Σi wi rows
Goal 0: Preserve |aiTx|p: each copy = wi-1/pai Goal 1: each row have same leverage score D.R. Lewis L2 case: wi copies of wi-1/2ai Effect on ATA: Σ wi (wi-1/2ai)T(wi-1/2ai) = Σ aiTai wi  τi works, same as leverage scores

Splitting When p = 1 Preserve L1: wi copies of wi-1ai New ATA w2=4
1/2 New ATA 1 2 1 w2=4 Leverage scores w.r.t. a different quadratic form Row ai  wi copies of (wi-1ai)T(wi-1ai) Lewis quadratic form: Σwi (wi-1ai)T(wi-1ai) = ATW-1A

Deriving Lewis Weights
wi copies wi-1ai ai Lewis quadratic form: ATW-1A Leverage score of each new row: (wi-1ai )T(ATW-1A)-1(wi-1ai ) Require new score = 1 wi2 = aiT(ATW-1A)-1ai Note: wi is fractional, can interpret as `expected’ copies

Why = 1? wi copies wi-1ai ai L1 Lewis Weights: new score = 1 wi2 = aiT(ATW-1A)-1ai wi = (wi1/2ai)T(ATW-1A)-1(wi1/2ai) = τi(W-1/2A) [Foster `49] Σi wi = Σi τ i ≤ d Sample rows of A w.p. wi ×O(logd) Uniformly sample A’ to O(dlogd) rows

rest of talk An Algorithm: Computation of approximate Lewis weights and a proof: Elementary proof of 1-norm concentration Key ideas: Operator based analysis Isotropic position Radamacher processes in L1

Algorithm Recursive definition
w: L1 Lewis Weights wi2 = aiT(ATW-1A)-1ai Recursive definition W = diag(w) Algorithm: pretend w is right, and iterate w’i  (aiT(ATW-1A)-1ai)1/2 Similar to iterative reweighted least squares Each iteration: compute leverage scores w.r.t. w, O(nnz(A) + dω+a) time

Operators w: L1 Lewis Weights wi2 = aiT(ATW-1A)-1ai
Key to analysis: Lewis quadratic form, ATW-1A Spectral similarity: P ≈ k Q if xTPx ≈ k xTQx ∀x Implications of P ≈ k Q : P-1 ≈ k Q-1 UTPU ≈ k UTQU for all matrices U Relative condition number of P and Q ≤ k

Convergence Proof OutLine
Update rule w’i  (aiT(ATW-1A)-1ai)1/2 w”i  (aiT(ATW’-1A)-1ai)1/2 … Will show: when 0 < p < 4, dist(w, w’) rapidly decreases Key step: dist(ATW-1A, ATW’-1A) ≤ dist(W, W’) Then use wi = length2 of ai against (ATW-1A)-1

Convergence for l1 Assume: w ≈ k w’ W ≈ k W’ W-1 ≈ k W’-1
Update: w’i  (aiT(ATW-1A)-1ai)1/2 w”i  (aiT(ATW’-1A)-1ai)1/2 Composition: ATW-1A ≈ k ATW’-1A (ATW-1A)-1 ≈ k (ATW’-1A)-1 Invert: Apply to vector ai: aiT(ATW-1A)-1ai ≈ k aiT(ATW’-1A)-1ai w’i2 ≈ k w”i2 w’i ≈ k1/2 w”i k’  k1/2 per step, very fast convergence

Overall Algorithm Can also show: if initialize with wi = 1,
After 1 step we have wi ≈ n w’i Convergence bound gives w(t) ≈ 2 w(t+1) in O(loglogn) rounds Sample with probabilities p = min{1, 2logdw} Uniqueness: w’i  (aiT(ATW-1A)-1ai)1/2 is a contractive mapping, same convergence rate to fixed point

Optimization Formulation
Lp Lewis weights: wip/2 = aiT(ATW1-2/pA)-1ai poly-time algorithm: solve Max det(M) s.t. Σi (aiTMai)p/2 ≤ d M P.S.D. wi M gives Lewis quadratic form Convex problem when p > 2 Also leads to input-sparsity time algorithms

Concentration Proofs Will show: elementary proof for p = 1 P Citation
# pages 1 [Talagrand `90] + [Pisier `89, Ch2] 8 + ~30 1 < p < 2 [Talagrand `95] + [Ledoux-Talagrand `90, Ch15.5] 2 < p [Bourgain-Milman-Lindenstrauss `89] 69 Tools used: Gaussian processes K-convexity for p = 1 Majorizing measure + chaining for p > 1

Radamacher Processes Need to bound (over choices of σ = s - 1):
Maxx, ║Ax║1 ≤ 1Σi (1 – si)|aiTx| Radamacher random variables: σi = 1 – si = ±1 w.p. ½ each = Maxx, ║Ax║1 ≤ 1Σi σi |aiTx|

Main Steps Goal: show w.h.p. over choices of σ =±1
Maxx, ║Ax║1 ≤ 1Σi σi |aiTx| ≤ 1 +ε Contraction lemma: turn into a single sum Dual norms: reduce to a single operator Finish with scalar concentration bounds L2 matrix concentration needs: Eigenvalues Matrix exponentials Trace inequalities

Contraction Lemma [Ledoux-Talagrand `89], simplified:
Eσ[Maxx, ║Ax║1 ≤ 1Σi σi |aiTx|] ≤ 2 Eσ[Maxx, ║Ax║1 ≤ 1Σi σi aiTx] Proof via Hall’s theorem on the hypercube, also holds with moment generating functions Need to show w.h.p. over σ: Maxx, ║Ax║1 ≤ 1Σi σi aiTx ≤ 1 +ε

Transformation + Dual Norm
Maxx, ║Ax║1 ≤ 1Σi σi aiTx = Maxx, ║Ax║1 ≤ 1σTAx ATA = I (assumption) = Maxx, ║Ax║1 ≤ 1σTAATAx ≤ Maxy, ║y║1 ≤ 1σTAATy Dual norm: Maxy, ║y║1 ≤ 1 bTy = ║b║∞ = ║σTAAT║∞

Each entry (σTAAT)j = ΣiσiaiTaj σi are the only random variables.
Khintchine’s inequality (with logn moment): w.h.p Σσibi ≤ O( ( logn ║b║22)1/2 ) Suffices to bound Σi bi2 = Σi(aiTaj)2

Last Steps Σi(aiTaj)2 = Σi ajTaiaiTaj = ajT (Σi aiaiT) aj = ajT ATA aj
`Nice case’ assumptions: ATA = I (isotropic position) ║ai ║22 < O(1 / logn) = ║aj║22 < O(1 / logn ) Consequence of Khintchine’s inequality: w.h.p. each entry < O( ( logn ║b║22)1/2 ) = O(1)

Unwind proof stack: Finish with scalar concentration bounds
W.h.p. ║σTAAT║∞ < ε Dual norms: reduce to bounding the maximum of a single operator Maxx, ║Ax║1 ≤ 1σTAx < ε Pass moment generating function into contraction lemma gives result

Reference: http://arxiv.org/abs/1412.0588
Open Problems Other proofs based on L1 concentration? Elementary proof for p ≠ 1? O(d) rows for 1 < p < 2? Better algorithms for p ≥ 4 Generalize to low-rank approximations SGD using Lewis weights? Fewer rows for structured matrices (e.g. graphs) when p > 2? Conjecture: O(dlogf(p)d) for graphs Reference:

Lp Row Sampling by Lewis Weights

Similar presentations

Presentation on theme: "Lp Row Sampling by Lewis Weights"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lp Row Sampling by Lewis Weights

Similar presentations

Presentation on theme: "Lp Row Sampling by Lewis Weights"— Presentation transcript:

Similar presentations

About project

Feedback