Fast Regression Algorithms Using Spectral Graph Theory Richard Peng.

Slides:



Advertisements
Similar presentations
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Advertisements

Introduction to Algorithms 6.046J/18.401J/SMA5503
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Primal-dual Algorithm for Convex Markov Random Fields Vladimir Kolmogorov University College London GDR (Optimisation Discrète, Graph Cuts et Analyse d'Images)
1 LP, extended maxflow, TRW OR: How to understand Vladimirs most recent work Ramin Zabih Cornell University.
Routing in Undirected Graphs with Constant Congestion Julia Chuzhoy Toyota Technological Institute at Chicago.
Cuts, Trees, and Electrical Flows Aleksander Mądry.
Liang Shan Clustering Techniques and Applications to Image Segmentation.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
Dantzig-Wolfe Decomposition
Matroid Bases and Matrix Concentration
TexPoint fonts used in EMF.
Satyen Kale (Yahoo! Research) Joint work with Sanjeev Arora (Princeton)
C&O 355 Lecture 23 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Poly-Logarithmic Approximation for EDP with Congestion 2
C&O 355 Mathematical Programming Fall 2010 Lecture 22 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
The Combinatorial Multigrid Solver Yiannis Koutis, Gary Miller Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you.
Optimization Tutorial
An Efficient Parallel Solver for SDD Linear Systems Richard Peng M.I.T. Joint work with Dan Spielman (Yale)
Algorithm Design Using Spectral Graph Theory Richard Peng Joint Work with Guy Blelloch, HuiHan Chin, Anupam Gupta, Jon Kelner, Yiannis Koutis, Aleksander.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Management Science 461 Lecture 2b – Shortest Paths September 16, 2008.
1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
Continuous optimization Problems and successes
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Semi-Definite Algorithm for Max-CUT Ran Berenfeld May 10,2005.
Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.
1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.
Introduction to Linear and Integer Programming Lecture 7: Feb 1.
2010/5/171 Overview of graph cuts. 2010/5/172 Outline Introduction S-t Graph cuts Extension to multi-label problems Compare simulated annealing and alpha-
Computer Algorithms Integer Programming ECE 665 Professor Maciej Ciesielski By DFG.
1 Ford-Fulkerson method Ford-Fulkerson(G) f = 0 while( 9 simple path p from s to t in G f ) f := f + f p output f Runs in time O(|f max | |E|) where f.
1 The Min Cost Flow Problem. 2 The Min Cost Flow problem We want to talk about multi-source, multi-sink flows than just “flows from s to t”. We want to.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Optimization of Linear Problems: Linear Programming (LP) © 2011 Daniel Kirschen and University of Washington 1.
Integrality Gaps for Sparsest Cut and Minimum Linear Arrangement Problems Nikhil R. Devanur Subhash A. Khot Rishi Saket Nisheeth K. Vishnoi.
Yiannis Koutis , U of Puerto Rico, Rio Piedras
Fast, Randomized Algorithms for Partitioning, Sparsification, and
Complexity of direct methods n 1/2 n 1/3 2D3D Space (fill): O(n log n)O(n 4/3 ) Time (flops): O(n 3/2 )O(n 2 ) Time and space to solve any problem on any.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Institute for Advanced Study, April Sushant Sachdeva Princeton University Joint work with Lorenzo Orecchia, Nisheeth K. Vishnoi Linear Time Graph.
Minimizing general submodular functions
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A A A A A AAA Fitting a Graph to Vector Data Samuel I. Daitch (Yale)
OR Chapter 1. Introduction  Ex : Diet Problem Daily requirements : energy(2000kcal), protein(55g), calcium(800mg) Food Serving size Energy (kcal)
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
1 CS612 Algorithms for Electronic Design Automation CS 612 – Lecture 8 Lecture 8 Network Flow Based Modeling Mustafa Ozdal Computer Engineering Department,
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent.
Spectral Partitioning: One way to slice a problem in half C B A.
1 Algebraic and combinatorial tools for optimal multilevel algorithms Yiannis Koutis Carnegie Mellon University.
Linear Programming Chapter 1 Introduction.
Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.
Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.
High Performance Linear System Solvers with Focus on Graph Laplacians
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Richard Peng Georgia Tech Michael Cohen Jon Kelner John Peebles
Lap Chi Lau we will only use slides 4 to 19
Topics in Algorithms Lap Chi Lau.
Parallel Algorithm Design using Spectral Graph Theory
A Combinatorial, Primal-Dual Approach to Semidefinite Programs
Density Independent Algorithms for Sparsifying
Matrix Martingales in Randomized Numerical Linear Algebra
A Numerical Analysis Approach to Convex Optimization
On Solving Linear Systems in Sublinear Time
Optimization on Graphs
Much Faster Algorithms for Matrix Scaling
Presentation transcript:

Fast Regression Algorithms Using Spectral Graph Theory Richard Peng

OUTLINE Regression: why and how Spectra: fast solvers Graphs: tree embeddings

LEARNING / INFERENCE Find (hidden) pattern in (noisy) data Output:Input signal, s:

REGRESSION p ≥ 1: convex Convex constraints e.g. linear equalities Mininimize: |x| p Subject to: constraints on x minimize

APPLICATION 0: LASSO Widely used in practice: Structured output Robust to noise [Tibshirani `96]: Min |x| 1 s.t. A x = s AxAx

APPLICATION 1: IMAGES No bears were harmed in the making of these slides Poisson image processing MinΣ i~j ∈ E (x i -x j -s i~j ) 2

APPLICATION 2: MIN CUT Remove fewest edges to separate vertices s and t MinΣ ij ∈ E |x i -x j | s.t. x s =0, x t =1 s t Fractional solution = integral solution

REGRESSION ALGORITHMS Convex optimization 1940~1960: simplex, tractable 1960~1980: ellipsoid, poly time 1980~2000: interior point, efficient Õ(m 1/2 ) interior steps m = # non-zeros Õ hides log factors minimize

EFFICIENCY MATTERS m > 10 6 for most images Even bigger (10 9 ): Videos 3D medical data

Õ(m 1/2 ) KEY SUBROUTINE Each step of interior point algorithms finds a step direction minimize Linear system solves

MORE REASONS FOR FAST SOLVERS [Boyd-Vanderberghe `04], Figure : The growth in the average number of Newton iterations (on randomly generated SDPs)… is very small

LINEAR SYSTEM SOLVERS [1 st century CE] Gaussian Elimination: O(m 3 ) [Strassen `69] O(m 2.8 ) [Coppersmith-Winograd `90] O(m ) [Stothers `10] O(m ) [Vassilevska Williams`11] O(m ) Total: > m 2

NOT FAST  NOT USED: Preferred in practice: coordinate descent, subgradient methods Solution quality traded for time

FAST GRAPH BASED L 2 REGRESSION [SPIELMAN-TENG ‘04] Input : Linear system where A is related to graphs, b Output : Solution to A x=b Runtime : Nearly Linear, Õ(m)

GRAPHS USING ALGEBRA Fast convergence + Low cost per step = state of the art algorithms

LAPLACIAN PARADIGM [Daitch-Spielman `08] : mincost fow [Christiano-Kelner-Mądry-Spielman-Teng `11] : approx maximum flow /min cut

EXTENSION 1 [Chin-Mądry-Miller-P `12]: regression, image processing, grouped L 2

EXTENSION 2 [Kelner-Miller-P `12] : k-commodity flow Dual: k-variate labeling of graphs s t

EXTENSION 3 [Miller-P `13] : faster for structured images / separable graphs

NEED: FAST LINEAR SYSTEM SOLVERS Implication of fast solvers: Fast regression routines Parallel, work efficient graph algorithms minimize

OTHER APPLICATIONS [Tutte `66] : planar embedding [Boman-Hendrickson-Vavasis`04] : PDEs [Orecchia-Sachedeva-Vishnoi`12] : balanced cut / graph separator

OUTLINE Regression: why and how Spectra: Linear system solvers Graphs: tree embeddings

PROBLEM Given: matrix A, vector b Size of A : n-by-n m non-zeros

SPECIAL STRUCTURE OF A A = Deg – Adj Deg : diag(degree) Adj : adjacency matrix [Gremban-Miller `96]: extensions to SDD matrices ` A ij =deg(i) if i=j w(ij) otherwise

UNSTRUCTURED GRAPHS Social network Intermediate systems of other algorithms are almost adversarial

NEARLY LINEAR TIME SOLVERS [SPIELMAN-TENG ‘04] Input : n by n graph Laplacian A with m non-zeros, vector b Where : b = A x for some x Output : Approximate solution x’ s.t. |x-x’| A <ε|x| A Runtime : Nearly Linear. O(m log c n log(1/ε)) expected runtime is cost per bit of accuracy. Error in the A -norm: |y| A =√y T A y.

HOW MANY LOGS Runtime : O(mlog c n log(1/ ε)) Value of c: I don’t know  [Spielman]: c≤70 [Koutis]: c≤15 [Miller]: c≤32 [Teng]: c≤12 [Orecchia]: c≤6 When n = 10 6, log 6 n > 10 6

PRACTICAL NEARLY LINEAR TIME SOLVERS [KOUTIS-MILLER-P `10] Input : n by n graph Laplacian A with m non-zeros, vector b Where : b = A x for some x Output : Approximate solution x’ s.t. |x-x’| A <ε|x| A Runtime : O(mlog 2 n log(1/ ε)) runtime is cost per bit of accuracy. Error in the A -norm: |y| A =√y T A y.

PRACTICAL NEARLY LINEAR TIME SOLVERS [KOUTIS-MILLER-P `11] Input : n by n graph Laplacian A with m non-zeros, vector b Where : b = A x for some x Output : Approximate solution x’ s.t. |x-x’| A <ε|x| A Runtime : O(mlogn log(1/ ε)) runtime is cost per bit of accuracy. Error in the A -norm: |y| A =√y T A y.

STAGES OF THE SOLVER Iterative Methods Spectral Sparsifiers Low Stretch Spanning Trees

ITERATIVE METHODS Numerical analysis: Can solve systems in A by iteratively solving spectrally similar, but easier, B

WHAT IS SPECTRALLY SIMILAR? A ≺ B ≺ k A for some small k Ideas from scalars hold! A ≺ B : for any vector x, |x| A 2 < |x| B 2 [Vaidya `91] : Since A is a graph, B should be too! [Vaidya `91] : Since G is a graph, H should be too!

`EASIER’ H Goal: H with fewer edges that’s similar to G Ways of easier: Fewer vertices Fewer edges Can reduce vertex count if edge count is small

GRAPH SPARSIFIERS Sparse equivalents of graphs that preserve something Spanners: distance, diameter. Cut sparsifier: all cuts. What we need: spectrum

WHAT WE NEED: ULTRASPARSIFIERS [Spielman-Teng `04] : ultrasparsifiers with n- 1+O(mlog p n/k) edges imply solvers with O(mlog p n) running time. Given: G with n vertices, m edges parameter k Output: H with n vertices, n-1+O(mlog p n/k) edges Goal: G ≺ H ≺ kG ``

EXAMPLE: COMPLETE GRAPH O(nlogn) random edges (with scaling) suffice w.h.p.

GENERAL GRAPH SAMPLING MECHANISM For edge e, flip coin Pr(keep) = P(e) Rescale to maintain expectation Number of edges kept: ∑ e P(e) Also need to prove concentration

EFFECTIVE RESISTANCE View the graph as a circuit R(u,v) = Pass 1 unit of current from u to v, measure resistance of circuit `

EE101 Effective resistance in general: solve G x = e uv, where e uv is indicator vector, R(u,v) = x u – x v. `

(REMEDIAL?) EE101 Single edge: R(e) = 1/w(e) Series: R(u, v) = R(e 1 ) + … + R(e l ) ` w1w1 ` uv uv w1w1 w2w2 R(u, v) = 1/w 1 R(u, v) = 1/w 1 + 1/w 2

SPECTRAL SPARSIFICATION BY EFFECTIVE RESISTANCE [Spielman-Srivastava `08] : Setting P(e) to W(e)R(u,v)O(logn) gives G ≺ H ≺ 2G* *Ignoring probabilistic issues [Foster `49] : ∑ e W(e)R(e) = n-1 Spectral sparsifier with O(nlogn) edges Ultrasparsifier? Solver???

THE CHICKEN AND EGG PROBLEM How to find effective resistance? [Spielman-Srivastava `08] : use solver [Spielman-Teng `04] : need sparsifier

OUR WORK AROUND Use upper bounds of effective resistance, R’(u,v) Modify the problem

RAYLEIGH’S MONOTONICITY LAW Rayleigh’s Monotonicity Law: R(u, v) only increase when edges are removed ` Calculate effective resistance w.r.t. a tree T

SAMPLING PROBABILITIES ACCORDING TO TREE Sample Probability: edge weight times effective resistance of tree path ` Goal: small total stretch stretch

GOOD TREES EXIST Every graph has a spanning tree with total stretch O(mlogn) O(mlog 2 n) edges, too many! ∑ e W(e)R’(e) = O(mlogn) Hiding loglogn

‘GOOD’ TREE??? Unit weight case: stretch ≥ 1 for all edges ` Stretch = 1+1 = 2

WHAT ARE WE MISSING? Need: G ≺ H ≺ k G n-1+O(mlog p n/k) edges Generated: G ≺ H ≺ 2 G n-1+O(mlog 2 n) edges `` Haven’t used k! 

USE K, SOMEHOW Tree is good! Increase weights of tree edges by factor of k ` G ≺ G’ ≺ k G

RESULT Tree heavier by factor of k Tree effective resistance decrease by factor of k ` Stretch = 1/k+1/k = 2/k

NOW SAMPLE? Expected in H: Tree edges: n-1 Off tree edges: O(mlog 2 n/k) ` Total: n- 1+O(mlog 2 n/k)

BUT WE CHANGED G! G ≺ G’ ≺ k G G’ ≺ H ≺ 2 G’ ` G ≺ H ≺ 2k G

WHAT WE NEED: ULTRASPARSIFIERS [Spielman-Teng `04] : ultrasparsifiers with n-1+O(mlog p n/k) edges imply solvers with O(mlog p n) running time. Given: G with n vertices, m edges parameter k Output: H with n vertices, n-1+O(mlog p n/k) edges Goal: G ≺ H ≺ kG `` G ≺ H ≺ 2k G n-1+O(mlog 2 n/k) edges

Input: Graph Laplacian G Compute low stretch tree T of G T  ( log 2 n) T H  G + T H  Sample T (H) Solve G by iterating on H and solving recursively, but reuse T PSEUDOCODE OF O(MLOGN) SOLVER

EXTENSIONS / GENERALIZATIONS [Koutis-Levin-P `12] : sparsify mildly dense graphs in O(m) time [Miller-P `12] : general matrices: find ‘simpler’ matrix that’s similar in O(m+n 2.38+a ) time. ``

SUMMARY OF SOLVERS Spectral graph theory allows one to find similar, easier to solve graphs Backbone: good trees ``

SOLVERS USING GRAPH THEORY Fast solvers for graph Laplacians use combinatorial graph theory

OUTLINE Regression: why and how Spectra: linear system solvers Graphs: tree embeddings

LOW STRETCH SPANNING TREE Sampling probability: edge weight times effective resistance of tree path Unit weight case: length of tree path Low stretch spanning tree: small total stretch

DIFFERENT THAN USUAL TREES n 1/2 -by-n 1/2 unit weighted mesh stretch(e)= O(1)total stretch = Ω(n 3/2 )stretch(e)=Ω(n 1/2 ) ‘haircomb’ is both shortest path and max weight spanning tree

A BETTER TREE FOR THE GRID Recursive ‘C’

LOW STRETCH SPANNING TREES [Elkin-Emek-Spielman-Teng `05], [Abraham-Bartal-Neiman `08]: Any graph has a spanning tree with total stretch O(mlogn) Hiding loglogn

ISSUE: RUNNING TIME Algorithms given by [Elkin-Emek-Spielman-Teng `05], [Abraham-Bartal-Neiman `08] take O(nlog 2 n+mlogn) time Reason: O(logn) shortest paths

SPEED UP [Koutis-Miller-P `11] : Round edge weights to powers of 2 k=logn, total work = O(mlogn) [Orlin-Madduri-Subramani-Williamson `10]: Shortest path on graphs with k distinct weights can run in O(mlog m/n k) time Hiding loglogn, we actually improve these

[Blelloch-Gupta-Koutis-Miller-P- Tangwongsan. `11] : current framework parallelizes to O(m 1/3+a ) depth Combine with Laplacian paradigm  fast parallel graph algorithms `` PARALLEL ALGORITHM?

Before this work: parallel time > state of the art sequential time Our result: parallel work close to sequential, and O(m 2/3 ) time PARALLEL GRAPH ALGORITHMS?

FUNDAMENTAL PROBLEM Long standing open problem: theoretical speedups for BFS / shortest path in directed graphs Sequential algorithms are too fast!

First step of framework by [Elkin-Emek-Spielman-Teng `05] : `` PARALLEL ALGORITHM?  shortest path 

Workaround: use earlier algorithm by [Alon-Karp-Peleg-West `95] Idea: repeated clustering Based on ideas from [Cohen `93, `00] for approximating shortest path PARALLEL TREE EMBEDDING

THE BIG PICTURE Need fast linear system solvers for graph regression Need combinatorial graph algorithms for fast solvers minimize

ONGOING / FUTURE WORK Better regression? Faster/parallel solver? Sparse approximate (pseudo) inverse? Other types of systems?

THANK YOU! Questions?