Sampling: an Algorithmic Perspective Richard Peng M.I.T.

OUTLINE Structure preserving sampling Sampling as a recursive ‘driver’ Sampling the inaccessible What can sampling preserve?

RANDOM SAMPLING Collection of many objects Pick a small subset of them Goal: Estimate quantities Small approximates Use in algorithms

SAMPLING CAN APPROXIMATE Point sets Matrices Graphs Gradients

PRESERVING GRAPH STRUCTURES Undirected graph, n vertices, m < n 2 edges Is n 2 edges (dense) sometimes necessary? For some information, e.g. connectivity: encoded by spanning forest, < n edges Deterministic, O(m) time algorithm : questions

MORE INTRICATE STRUCTURES k-connectivity: # of disjoint paths between s-t [Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts Stronger: weights of all 2 n cuts in graphs Cut: # of edges leaving a subset of vertices s t Menger’s theorem / maxflow- mincut : previous works ≈: multiplicative approximation

n n MORE GENERAL: ROW SAMPLING A’ A L 2 Row sampling: Given A with m>>n, sample a few rows to form A ’ s.t.║ Ax ║ 2 ≈║ A’x ║ 2 ∀ x m 0 -1 0 0 0 1 0 0 -5 0 0 0 5 0 ≈n ║ Ax ║ p : finite dimensional Banach space Sampling: embedding Banach spaces e.g. [BLM `89], [Talagrand `90]

HOW TO SAMPLE? Widely used: uniform sampling Works well when data is uniform e.g. complete graph Problem: long path, removing any edge changes connectivity (can also have both in one graph) More systematic view of sampling?

SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance Effective resistance: commute time / m Statistical leverage score in unweighted graphs

L 2 MATRIX-CHERNOFF BOUNDS [Foster `49] Σ i τ i = rank ≤ n  O( nlogn) rows [Rudelson, Vershynin `07], [Tropp `12]: sampling with p i ≥ τ i O( logn) gives B’ s.t. ║ Bx ║ 2 ≈║ B’x ║ 2 ∀ x w.h.p. τ : L 2 statistical leverage scores τ i = b i T ( B T B ) -1 b i = ║ b i ║ 2 L -1 Near optimal: L 2 -row samples of B Graph sparsifiers In practice O(logn)  5 usually suffices can also improve via derandomization

THE `RIGHT’ PROBABILITIES Only one non-zero rowColumn with one entry 0010000100 n/m 1 Path + clique: 1 1/n τ : L 2 statistical leverage scores τ i = b i T ( B T B ) -1 b i = ║ b i ║ 2 L -1 Any good upper bounds to τ i lead to size reductions

ALGORITHMIC TEMPLATES W-cycle: T(m) = 2T(m/2) + O(m) V-cycle: T(m) = T(m/2) + O(m) Instances: Sorting FFT Voronoi / Delaunay Instances: Selection Parallel indep. Set Routing

Difficulty: Exists many non-separable graphs Easy to compose hard instances EFFICIENT GRAPH ALGORITHMS Partition via separators

SIZE REDUCTION Ultra-sparsifier: for any k, can find H ≈ k G that’s tree + O(mlog c n/k) edges ` ` e.g. [Koutis-Miller-P `10]: obtain crude estimates on τ i via a tree H equivalent to graph of size O(mlog c n/k) Picking k > log c n gives reductions : my results

INSTANCE: Lx = b Input : graph Laplacian L, vector b Output : x ≈ ε L + b Runtimes [KMP `10, `11]: O(mlogn) work, O(m 1/3 ) depth [CKPPR`14, CMPPX`14]: O(mlog 1/2 n) work, O(m 1/3 ) depth Note: L + : pseudo-inverse Approximate solution Omitting log(1/ε) + recursive Chebyshev iteration: T(m) = k 1/2 (T(mlog c n/k) + O(m))

INSTANCE: INPUT-SPARSITY TIME NUMERICAL ALGORITHMS Similar: Nystrom method sample post-process [Li-Miller-P 13]: Create smaller approximation Recurse on it Bring solution back

INSTANCE: APPROX MAXFLOW Absorb additional (small) error via more calls to approximator Recurse on instances with smaller total size, total cost: O(mlog c n) [P`14]: build approximator on the smaller graph [Racke-Shah-Taubig `14] good approximator by solving maxflows [Sherman `13] [KLOS `14]: structure approximators  fast maxflow routines

DENSE OBJECTS Matrix inverse Schur complement K-step random walks Cost-prohibitive to store Application of separators Directly access sparse approximates?

TWO STEP RANDOM WALKS A : step of random walk Still a graph, can sparsify! A 2 : 2 step random walk

WHAT THIS ENABLED [P-Spielman `14] use this to approximate ( I – A ) -1 = ( I + A ) ( I + A 2 ) ( I + A 4 )… Similar to multi-level methods Skipping: control / propagation of error Combining known tools: efficiently sparsify I – A 2 without computing A 2 [Cheng-Cheng-Liu-P-Teng `15]: sparsified Newton’s method for matrix roots and Gaussian sampling

MATRIX SQUARING ConnectivityMore general Iteration A i+1 ≈ A i 2 I - A i+1 ≈ I - A i 2 Until ║ A d ║ small Size ReductionLow degreeSparse graph MethodDerandomizedRandomized Solution transferConnectivitySolution vectors NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [Rozenman-Vadhan `05]

LONGER RANDOM WALKS A : one step of random walk A 3 : 3 steps of random walk (part of) edge uv in A 3 Length 3 path in A : u-y-z-v

PSEUDOCODE Repeat O(cmlognε -2 ) times: 1.Uniformly randomly pick 1 ≤ k ≤ c and edge e = uv 2.Perform (k -1)-step random walk from u. 3.Perform (r - k)-step random walk from v. 4.Add a scaled copy of the edge to the sparsifier Resembles: Local clustering Approximate triangle counting (c = 3) [Cheng-Cheng-Liu-P-Teng `15]: combine this with repeated squaring to approximate any random walk polynomial in nearyl-linear time.

GAUSSIAN ELIMINATION [Lee-P-Spielman, in progress] approximate such circuits in O(mlog c n) time Partial state of Gaussian elimination: linear system on a subset of variables Graph theoretic interpretation: equivalent circuit on boundaries, Y-Δ transform

WHAT THIS ENABLES [Lee-P-Spielman, in progress] O(n) time approximate Cholesky factorization for graph Laplacians [Lee-Sun, `15] constructible in nearly-linear work

MORE GENERAL STRUCTURES Non-linear structures Directed constraints: Ax ≤ b

║y║1║y║1 ║y║2║y║2 OTHER NORMS Generalization of row sampling: given A, q, find A ’ s.t.║ Ax ║ q ≈║ A’x ║ q ∀ x 1-norm: standard for representing cuts, used in sparse recovery / robust regression Applications (for general A ): Feature selection Low rank approximation / PCA q-norm: ║ y ║ q = (Σ| y i | q ) 1/q

L 1 ROW SAMPLING L 1 Lewis weights ([Lewis `78]): w s.t. w i 2 = a i T ( A T W -1 A ) -1 a i Recursive definition! [Sampling with p i ≥ w i O( logn) gives ║ Ax ║ 1 ≈ ║ A’x ║ 1 ∀ x Can check: Σ i w i ≤ n  O(nlogn) rows [Talagrand `90, “Embedding subspaces of L 1 into L N 1 ” ] can be analyzed as row-sampling / sparsification [Cohen-P `15] w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 Converges in loglogn steps

WHERE THIS FITS IN #rows for q=2 #rows for q=1 Runtime Dasgupta et al. `09n 2.5 mn 5 Magdon-Ismail `10nlog 2 nmn 2 Sohler-Woodruff `11n 3.5 mn ω-1+θ Drineas et al. `12nlognmnlogn Clarkson et al. `12n 4.5 log 1.5 nmnlogn Clarkson-Woodruff `12n 2 lognn8n8 nnz Mahoney-Meng `12n2n2 n 3.5 nnz+n 6 Nelson-Nguyen `12n 1+θ nnz Li et.`13nlognn 3.66 nnz+n ω+θ Cohen et al. 14, Cohen-P `15 nlogn nnz+n ω+θ [Cohen-P `15] Elementary, optimization motivated proof of w.h.p. concentration for L 1

CONNECTION TO LEARNING THEORY Sparsely-used Dictionary Learning: given Y, find A, X so that ║ Y - AX ║ is small and X is sparse [Spielman-Wang-Wright `12]: L 1 regression solves this using about n 2 samples [Luh-Vu `15]: generic chaining: O(nlog 4 n) samples suffice Proof in [Cohen-P `15] gives O(nlog 2 n) samples Key: if X satisfies the Bernoulli-Subgaussian model, then ║ Xy ║ 1 is close to expectation for all y ‘Right’ bound should be O(nlogn)

UNSPARSIFIABLE INSTANCE Complete bipartite graph: Removing any edge u  v makes v unreaclable from u Preserve less structure?

WEAKER REQUIREMENT Sample only needs to make gains in some directions Q1Q1 P Q2Q2 [Cohen-Kyng-Pachocki-P-Rao `14]: point-wise convergence without matrix concentration

UNIFORM SAMPLING? Nystrom method (on matrices): Pick random subset of data Compute on subset Post-process result Post-processing: Theoretical works before us: copy x over Practical: projection, least-squares fitting [CLMMPS `15]: half the rows as A ’ gives good sampling probabilities for A that sum to ≤ 2n How powerful is (recursive) post-processing?

WHY IS THIS EFFECTIVE? Needle in a haystack: only d dimensions, can’t have too many, easy to find via post-process Hay in a haystack: half the data should still contain some info

FUTURE WORK More concretely: More sparsification based algorithms? E.g. multi-grid maxflow? Sampling directed graphs Hardness results? What structures can sampling preserve? What do sampling need to preserve?

Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Similar presentations

Presentation on theme: "Sampling: an Algorithmic Perspective Richard Peng M.I.T."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Similar presentations

Presentation on theme: "Sampling: an Algorithmic Perspective Richard Peng M.I.T."— Presentation transcript:

Similar presentations

About project

Feedback