L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

OUTLINE Row Sampling Lewis Weights Computation Proof of Concentration

DATA n-by-d matrix A, nnz( A ) non-zeros Columns: components Rows: features Computational tasks Identify patterns Interpret new data Officemate’s ‘matrix tourism’: A

SUBSAMPLING MATRICES Fundamental problem: row reduction #features>> #components #rows (n) >> #columns (d) Run more expensive routines on A ’ What applications need: reduce both rows/columns Approaches: Subspace embedding: S that works for most A Adaptive: build S based on A A’=SA A

LINEAR MODEL Can add/scale data points x : coefficients, combo: Ax Ax x 1 A :, 1 x 2 A :,2 x 3 A :,3 Interpret new data point b:

DISTANCE MINIMIZATION Min x ║ Ax–b ║ p p=2: Euclidean norm of x, least squares p=1: absolute deviations, robust regression Simplified view: Ax – b = [ A, b ] [ x ; -1] min║ Ax ║ p with one entry of x fixed Ab x

ROW SAMPLING A’ A Pick some (rescaled) rows of A’ so that ║ Ax ║ p ≈ 1+ε ║ A’x ║ p for all x Error notation ≈: a≈ k b if there exists k min, k max s.t. k max /k min ≤ k and k min a ≤ b ≤ k max a Feature selection A ’ = SA, S: Õ(d) × n one non-zero per row

ON GRAPHS A : edge-vertex incidence matrix x : labels on vertices p = 1: (fractional) cuts [Benczur-Karger `96]: cut sparsifiers p = 2: energy of voltages [Spielman-Teng `04]: spectral sparsification A ’ with O(dlogd) rows in both cases Row for edge uv: | a i x | p = |x u – x v | p [Naor `11][ Matousek `97] on graphs, L 2 / spectral sparsifiers after normlization work for all 1 ≤ p ≤ 2

PREVIOUS POLY-TIME ALGORITHMS P# rowsBy 2dlogdMatrix concentration bounds ([Rudelson-Vershynin `07], [Tropp `12]) 2d[Batson-Spielman-Srivastava `09] 1d 2.5 [Dasgupta-Drineas-Harb -Kumar-Mahoney `09] 1 < p < 2d p/2+2 2 < pd p+1 Assuming ε = constant [DMMW`11][CDMMMW`12][CW`12][MM `12][NN `12] [LMP `12]: input-sparsity time, O(nnz( A ) + poly(d))

GENERAL MATRICES 1/2 L 2 distance = 1/1 = 1 1 L 1 distance = 2/1 = 2 A:A: A ’: Difference = distortion between L 2 and L 1 : n 1/2 1-dimensional: only `interesting’ vector: x = [1]

OUR RESULTS PPreviousOurUses 1d 2.5 dlogd[Talagrand `90] 1 < p < 2d p/2+2 dlogd(loglogd) 2 [Talagrand `95] 2 < pd p+1 d p/2 logd[Bourgain-Milman- Lindenstrauss `89] Runtime: input-sparsity time, O(nnz( A ) + poly(d)) When p < 4, overhead is O(d ω ) For p = 1, elementary proof that gets most details right Will focus on p = 1 for this talk

SUMMARY Goal: sample rows of matrices to preserve ║Ax║ p. Graphs: preserving p-norm preserves all q < p. Different notion for general matrices.

ISSUES WITH SAMPLING BY NORM column with one entryNeed: ║ A [1;0;…;0] ║ p ≠ 0 norm sampling: p i =║ a i ║ 2 2 Bridge in graph Need: connectivity

MATRIX-CHERNOFF BOUNDS τ : L 2 statistical leverage scores τ i = a i T ( A T A ) -1 a i [Foster `49] Σ i τ i = rank ≤ d  O( dlogd) rows [CW`12][NN`13][LMP`13][CLMMPS`15]: can estimate L 2 leverage scores in O(nnz( A ) + d ω+a ) time a i row i of A Sampling rows w.p. p i = τ i logd gives ║ Ax ║ 2 ≈║ A’x ║ 2 ∀ x w.h.p. On graphs: weight of edge × effective resistance

MATRIX-CHERNOFF BOUNDS w* : L 1 Lewis Weights w i * 2 = a i T ( A T W * -1 A ) -1 a i Equivalent: w i * = w i * -1 a i T ( A T W * -1 A ) -1 a i Leverage score of row i of W * -1/2 A Σ i w i *= d by Foster’s theorem Will show: can get w ≈ w * using calls to L 2 leverage score estimators / in O(nnz( A ) + d ω+a ) time Existence and uniqueness Sampling rows w.p. p i = w i logd gives ║ Ax ║ 1 ≈║ A’x ║ 1 ∀ x w.h.p. Recursive definition

WHAT IS LEVERAGE SCORE Length of a i after `whitening transform’ Approximations are basis independent: Max ║ Ax ║ p /║ A’x ║ p = Max ║ AUx ║ p /║ A’ Ux ║ p Can reorganize the columns of A: Transform so A T A = I (isotropic position) Interpretation of matrix-Chernoff When A is in isotropic position, norm sampling works

WHITENING FOR L 1 ? 10 0(1-ε 2 ) 1/2 0ε/k 0 0… 0 Split εof a i into k 2 copies of ε/k a i Total sampling probability < lognε Problematic when k > ε -1 > logn (this can also happen in a non-orthogonal manner) Most of ║ A [0,1]║ 1 from small rows: k 2 (ε/k) = kε, big when k > ε

WHAT WORKS FOR ANY P Pf(d)Citation 1dlogd[Talagrand `90] 1 < p < 2dlogd(loglogd) 2 [Talagrand `95] 2 < pd p/2 logd[Bourgain-Milman- Lindenstrauss `89] n > f(d), A is isotropic, A T A = I, All row norms < 2d/n, Sampling with p i = ½ gives ║ Ax ║ p ≈║ A’x ║ p ∀ x Symmetrization ([Rudelson-Vershynin `07]): uniformly sample to f(d) rows gives ║ Ax ║ p ≈║ A’x ║ p ∀ x

SPLITTING BASED VIEW “Generalized whitening transformation”: Split a i into w i (fractional) “copies” s.t. all rows have L 2 leverage scores d/n Matrix concentration bound: uniform sampling works when all L 2 leverage scores are same Split a i into w i copies L 2 : w i copies of w i -1/2 a i Quadratic form: w i ( w i -1/2 a i ) T ( w i -1/2 a i ) = a i T a i w i  (d/n) τ i suffices Preserve | a i T x | p : each copy = w i -1/p a i

SPLITTING FOR L 1 10 02 10 01/2 0 0 0 w 2 =4 Measuring leverage scores w.r.t. a different matrix! 10 01 New quadratic form: Lewis quadratic form: Σw i ( w i -1 a i ) T ( w i -1 a i ) = A T W -1 A Preserve L 1 : w i copies of w i -1 a i Row a i  w i copies of ( w i -1 a i ) T ( w i -1 a i )

CHOICE OF N Lewis quadratic form: A T W -1 A All L 2 leverge scores same : d/n = w i -2 a i T ( A T W -1 A ) -1 a i or: w i = n/d w i -1 a i T ( A T W -1 A ) -1 a i Sanity check: Σw i = n/d ( Σw i -1 a i T ( A T W -1 A ) -1 a i )= n n’ rows instead: w ’  w (n’/n) works Check: A T W ’ -1 A = (n/n’) A T W -1 A a i T ( A T W ’ -1 A ) -1 a i = (n’/n) a i T ( A T W -1 A ) -1 a i w i ' -2 a i T ( A T W ’ -1 A ) -1 a i = (n/n’ w i ) -2 (n’/n) a i T ( A T W -1 A ) -1 a i =(n/n’) w i -2 a i T ( A T W -1 A ) -1 a i = d/n'

CHOICE OF N L p Lewis weights: w that gives n = d rows # samples of row i: (d f(d) / n) w w i : weights to split into n rows n’ rows instead: w ’  w (n’/n) ‘Fractional’ copies, w * i < 1 Recusive definition f(d) / d: sampling overhead, akin to O(logd) from L 2 matrix Chernoff bounds

LEWIS WEIGHTS L p Lewis weights: w * s.t. w i * 2/p = a i T ( A T W * 1-2/p A ) -1 a i w * = L 2 leverage scores of W * 1/2-1/p A Sum: d p = 2: 1/2-1/p = 0, same as L 2 leverage scores of A Recursive definition, will show existence / computation next 2-approximate L p Lewis weights: w s.t. w i 2/p ≈ 2 a i T ( A T W 1-2/p A ) -1 a i L 1 : w i ≈ 2 ( a i T ( A T W -1 A ) - 1 a i ) 1/2

INVOKING EXISTING RESULTS P# rowsCitation 1dlogd[Talagrand `90] 1 < p < 2dlogd(loglogd) 2 [Talagrand `95] 2 < pd p/2 logd[Bourgain-Milman- Lindenstrauss `89] Symmetrization ([Rudelson-Vershynin `07]): importance sampling using 2-approximate L p Lewis weights gives A ’ s.t. ║ Ax ║ p ≈║ A’x ║ p ∀ x

SUMMARY Goal: sample rows of matrices to preserve ║ Ax ║ p. Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores  (p-norm) Lewis weights.

FINDING L 1 LEWIS WEIGHTS Each iteration: compute leverage scores w.r.t. w, O(nnz( A ) + d ω+a ) time w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 We show: if 0 < p < 4, distance between w and w ’ rapidly decreases Need: w i ≈ 2 ( a i T ( A T W -1 A ) -1 a i ) 1/2 Algorithm: pretend w is the right answer, and iterate Has resemblances to iterative reweighted least squares

CONVERGENCE PROOF OUTLINE Goal: show distance between w ’ and w ’’ less than distance between w and w ’ w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 w ’’ i  ( a i T ( A T W’ -1 A ) -1 a i ) 1/2 Spectral similarity of matrices: A ≈ k B if x T Ax ≈ k x T Bx ∀ x Implications of P ≈ k Q : P -1 ≈ k Q -1 U T PU ≈ k U T QU for all matrices U ( a i T Pa i ) 1/2 for some matrix P

CONVERGENCE FOR L 1 Iteration steps: w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 w ’’ i  ( a i T ( A T W’ -1 A ) -1 a i ) 1/2 w ≈ k w ’ Composition: A T W -1 A ≈ k A T W ’ -1 A ( A T W -1 A ) -1 ≈ k ( A T W ’ -1 A ) -1 Apply to vector a i : w ’ i 2 ≈ k w’ ’ i 2 w ’ i ≈ k 1/2 w’ ’ i Fixed point iteration: log(k) halves per step! a i T ( A T W -1 A ) -1 a i ≈ k a i T ( A T W ’ -1 A ) -1 a i Assume: Invert: W ≈ k W ’ W -1 ≈ k W ’ -1

OVERALL SCHEME We show: if initialize with w i = 1, After 1 step we have w i ≈ n w ’ i Convergence bound gives w (t) ≈ 2 w (t+1) in O(loglogn) rounds Input-sparsity time: stop when w (t) ≈ n c w (t+1) O(log(1/c)) rounds suffice. Over-sampling by factor of n c. p  2logd w are good sampling probabilities Uniqueness: w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 is a contraction mapping, can show same convergence rate to fixed point

OPTIMIZATION FORMULATION L p Lewis weights: w i * 2 = a i T ( A T W * 1-2/p A ) -1 a i Convex problem when p > 2 Also leads to input-sparsity time algorithms poly-time algorithm: solve Max det( M ) s.t.Σ i ( a i T Ma i ) p/2 ≤ d M P.S.D. wi*wi*

SUMMARY Goal: sample rows of matrices to preserve ║ Ax ║ p Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores  (p-norm) Lewis weights. Iterative computation when 0 < p < 4. Solutions to max determinant.

PROOFS OF KNOWN RESULTS PCitation# pages 1[Talagrand `90] + [Pisier `89, Ch2]8 + ~30 1 < p < 2[Talagrand `95] + [Ledoux-Talagrand `90, Ch15.5] 16 + 12 2 < p[Bourgain-Milman-Lindenstrauss `89]69 Tools used: Gaussian processes K-convexity for p = 1 Majorizing measure for p > 1 Will show: elementary proof for p = 1

CONCENTRATION ‘Nice’ case: A T A = I (isotropic position) ║ a i ║ 2 2 <ε 2 / logn Sampling with p i = ½ Can use this to show the general case ║ Ax ║ 1 - ║ A’x ║ 1 = Σ i | a i T x | - Σ i s i | a i T x | = Σ i (1 – s i )| a i T x | pick half the rows, double them s i : copy of row i: 0 w.p. 1/2 2 w.p. 1/2

RADAMACHER PROCESSES Need to bound (over choices of σ ): Max x, ║ Ax ║ 1 ≤ 1 Σ i σ i | a i T x | σ i = 1 – s i ±1 w.p. ½ each Radamacher random variables Comparison theorem [Ledoux-Talagrand `89]: suffices to bound (expectation over σ ) Max x, ║ Ax ║ 1 ≤ 1 Σ i σ i a i T x = Max x, ║ Ax ║ 1 ≤ 1 σ T Ax Proof via Hall’s theorem on the hypercube

TRANSFORMATION: Max x, ║ Ax ║ 1 ≤ 1 σ T Ax A T A = I (assumption) = Max x, ║ Ax ║ 1 ≤ 1 σ T AA T Ax ≤ Max y, ║ y ║ 1 ≤ 1 σ T AA T y Dual norm: Max y, ║ y ║ 1 ≤ 1 b T y = ║ b ║ ∞ = ║ σ T AA T ║ ∞

EACH ENTRY ( σ T AA T ) j = Σ i σ i a i T a j Khintchine’s inequality (with logn moment): w.h.p Σ σ i b i ≤ O( ( logn ║ b ║ 2 2 ) 1/2 ) Σi(aiTaj)2Σi(aiTaj)2 = Σ i a j T a i a i T a j = a j T A T A a j

INITIAL ASSUMPTIONS a j T A T A a j = ║ a j ║ 2 2 < ε 2 / logn A T A = I (isotropic position) ║ a i ║ 2 2 <ε 2 / logn Unwind proof stack: W.h.p. ║ σ T AA T ║ ∞ < ε Max x, ║ Ax ║ 1 ≤ 1 σ T Ax < ε Pass moment generating function through comparison theorem gives result Khintchine’s inequality: w.h.p. each entry < O( ( logn ║ b ║ 2 2 ) 1/2 ) = O(ε)

SUMMARY Goal: sample rows of matrices to preserve ║ Ax ║ p. Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores  (p-norm) Lewis weights. Iterative computation when 0 < p < 4. Solutions to max determinant. Convergence: bound max of a vector. Follows from scalar Chernoff bounds.

OPEN PROBLEMS What are Lewis weights on graphs? Elementary proof for p ≠ 1? O(d) rows for 1 < p < 2? Better algorithms for p ≥ 4 Fewer rows for structured matrices (e.g. graphs) when p > 2? Conjecture: O(dlog f(p) d) for graphs Generalize low-rank approximations Reference: http://arxiv.org/abs/1412.0588

L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

Similar presentations

Presentation on theme: "L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

Similar presentations

Presentation on theme: "L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T."— Presentation transcript:

Similar presentations

About project

Feedback