Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory.

Similar presentations


Presentation on theme: "Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory."— Presentation transcript:

1 Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory seminar, April 5, 2014

2 RANDOM SAMPLING Collection of many objects Pick a small subset of them

3 GOALS OF SAMPLING Estimate quantities Approximate higher dimensional objects Use in algorithms

4 SAMPLE TO APPROXIMATE ε- nets / cuttings Sketches Graphs Gradients This talk: matrices

5 NUMERICAL LINEAR ALGEBRA Linear system in n x n matrix Inverse is dense [Concus-Golub-O'Leary `76]: incomplete Cholesky, drop entries

6 HOW TO ANALYZE? Show sample is good Concentration bounds Scalar: [Bernstein `24][Chernoff`52] Matrices: [AW`02][RV`07][Tropp `12]

7 THIS TALK Directly show algorithm using samples runs well Better bounds Simpler analysis

8 OUTLINE Random matrices Iterative methods Randomized preconditioning Expected inverse moments

9 HOW TO DROP ENTRIES? Entry based representation hard Group entries together Symmetric with positive entries  adjacency matrix of a graph

10 SAMPLE WITH GUARANTEES Sample edges in graphs Goal: preserve size of all cuts [BK`96] graph sparsification Generalization of expanders

11 DROPPING ENTRIES/EDGES L : graph Laplacian 0-1 x : |x| L 2 = size of cut between 0s- and-1s Unit weight case: |x| L 2 = Σ uv (x u – x v ) 2 Matrix norm: |x| P 2 = x T P x

12 DECOMPOSING A MATRIX Sample based on positive representations P = Σ i P i, with each P i P.S.D Graphs: one P i per edge Σ uv (x u – x v ) 2 1 1 u uv v P.S.D. multi-variate version of positive L = Σ uv

13 MATRIX CHERNOFF BOUNDS Can sample Q with O(nlognε -2 ) rescaled P i s s.t. P ≼ Q ≼ (1 +ε) P ≼ : Loewner’s partial ordering, A ≼ B  B – A positive semi definite P = Σ i P i, with each P i P.S.D

14 CAN WE DO BETTER? Yes, [BSS `12]: O(nε -2 ) is possible Iterative, cubic time construction [BDM `11]: extends to general matrices

15 DIRECT APPLICATION For ε accuracy, need P ≼ Q ≼ (1 +ε) P Size of Q depends inversely on ε ε -1 is best that we can hope for Find Q very close to P Solve problem on Q Return answer

16 USE INSIDE ITERATIVE METHODS [AB `11]: crude samples give good answers [LMP `12]: extensions to row sampling Find Q somewhat similar to P Solve problem on P using Q as a guide

17 ALGORITHMIC VIEW Crude approximations are ok But need to be efficient Can we use [BSS `12]?

18 SPEED UP [BSS `12] Expander graphs, and more ‘i.i.d. sampling’ variant related to the Kadison-Singer problem

19 MOTIVATION One dimensional sampling: moment estimation, pseudorandom generators Rarely need w.h.p. Dimensions should be disjoint

20 MOTIVATION Randomized coordinate descent for electrical flows [KOSZ`13,LS`13] ACDM from [LS `13] improves various numerical routines

21 RANDOMIZED COORDINATE DESCENT Related to stochastic optimization Known analyses when Q = P j [KOSZ`13][LS`13] can be viewed as ways of changing bases

22 OUR RESULT For numerical routines, random Q gives same performances as [BSS`12], in expectation

23 IMPLICATIONS Similar bounds to ACDM from [LS `13] Recursive Chebyshev iteration ([KMP`11]) runs faster Laplacian solvers in ~ mlog 1/2 n time

24 OUTLINE Random matrices Iterative methods Randomized preconditioning Expected inverse moments

25 ITERATIVE METHODS [Gauss, 1823] Gauss-Siedel iteration [Jacobi, 1845] Jacobi Iteration [Hestnes-Stiefel `52] conjugate gradient Find Q s.t. P ≼ Q ≼ 10 P Use Q as guide to solve problem on P

26 [RICHARDSON `1910] x (t + 1) = x (t) + (b – P x (t) ) Fixed point: b – P x (t) = 0 Each step: one matrix- vector multiplication

27 ITERATIVE METHODS Multiplication is easier than division, especially for matrices Use verifier to solve problem

28 1D CASE Know: 1/2 ≤ p ≤ 1  1 ≤ 1/p ≤ 2 1 is a ‘good’ estimate Bad when p is far from 1 Estimate of error: 1 - p

29 ITERATIVE METHODS 1 + (1 – p) = 2 – p is more accurate Two terms of Taylor expansion Can take more terms

30 ITERATIVE METHODS Generalizes to matrix settings: 1/p = 1 + (1 – p) + (1 – p) 2 + (1 – p) 3 … P -1 = I + ( I – P ) + ( I – P ) 2 + …

31 [RICHARDSON `1910] x (0) = I b X (1) = ( I + ( I – P ))b x (2) = ( I + ( I – P ) ( I + ( I – P )))b … x (t + 1) = b + ( I – P ) x (t) Error of x (t) = ( I – P ) t b Geometric decrease if P is close to I

32 OPTIMIZATION VIEW Quadratic potential function Goal: walk down to the bottom Direction given by gradient Residue: r (t) = x (t ) – P -1 b Error: |r (t) | 2 2

33 Step may overshoot Need smooth function x (t) x (t+1) x (t) x (t+1) DESCENT STEPS

34 MEASURE OF SMOOTHNESS x (t + 1) = b + ( I – P ) x (t) Note: b = PP -1 b r (t + 1) = ( I – P ) r (t) |r (t + 1) | 2 ≤| I – P | 2 |x (t) | 2

35 MEASURE OF SMOOTHNESS 1 / 2 I ≼ P ≼ I  | I – P | 2 ≤ 1/2 | I – P | 2 : smoothness of |r (t) | 2 2 Distance between P and I Related to eigenvalues of P

36 MORE GENERAL Convex functions Smoothness / strong convexity This talk: only quadratics

37 OUTLINE Random matrices Iterative methods Randomized preconditioning Expected inverse moments

38 ILL POSED PROBLEMS Smoothness of directions differ Progress limited by steeper parts.80 0.1

39 PRECONDITIONING Solve similar problem Q Transfer steps across QPP

40 PRECONDITIONED RICHARDSON Optimal step down energy function of Q given by Q -1 Equivalent to solving Q -1 P x = Q -1 b QP

41 PRECONDITIONED RICHARDSON x (t + 1) = b + ( I – Q -1 P ) x (t) Residue: r (t + 1) = ( I – Q -1 P ) r (t) |r (t + 1) | P = |( I – Q -1 P )r (t) | P

42 CONVERGENCE If P ≼ Q ≼ 10 P, error halves in O(1) iterations How to find a good Q ? QP Improvements depend on | I – P 1/2 Q -1 P 1/2 | 2

43 MATRIX CHERNOFF Take O(nlogn) (rescaled) P i s with probability ~ trace( P i P -1 ) Matrix Chernoff ([AW`02],[RV`07]): w.h.p. P ≼ Q ≼ 2 P P = Σ i P i Q = Σ i s i P i s has small support Note: Σ i trace( P i P -1 ) = n

44 WHY THESE PROBABILITIES? trace( P i P -1 ): Matrix ‘dot product’ If P is diagonal 1 for all i Need all entries.80 0.1 Overhead of concentration: union bound on dimensions

45 IS CHERNOFF NECESSARY? P : diagonal matrix Missing one entry: unbounded approximation factor 10 01 10 00

46 BETTER CONVERGENCE? [Kaczmarz `37]: random projections onto small subspaces can work Better (expected) behavior than what matrix concentration gives!

47 HOW? Will still progress in good directions Can have (finite) badness if they are orthogonal to goal Q1Q1 P≠

48 QUANTIFY DEGENERACIES Have some D ≼ P ‘for free’ D = λ min ( P ) I (min eigenvalue) D = tree when P is a graph D = crude approximation / rank certificate.80 0.2 0 0.1 PD

49 REMOVING DEGENERACIES ‘Padding’ to remove degeneracy If D ≼ P and 0.5 P ≼ Q ≼ P, 0.5 P ≼ D + Q ≼ 2 P PD

50 ROLE OF D Implicit in proofs of matrix Chernoff, as well as [BSS`12] Splitting of P in numerical analysis D and P can be very different PD

51 MATRIX CHERNOFF Let D ≤ 0.1 P, t = trace( PD -1 ) Take O(tlogn) samples with probability ~ trace( P i D -1 ) Q  D + (rescaled) samples W.h.p. P ≼ Q ≼ 2 P PQ

52 WEAKER REQUIREMENT Q only needs to do well in some directions, on average Q1Q1 P Q2Q2

53 EXPECTED CONVERGENCE Exist constant c s.t. for any r, E[|( I – c Q -1 P )r| P ≤ 0.99|r| P Let t = trace( PD -1 ) Take rand[t, 2t] samples, w.p. trace( P i D -1 ) Add (rescaled) results to D to form Q

54 OUTLINE Random matrices Iterative methods Randomized preconditioning Expected inverse moments

55 ASIDE Goal: combine these analyses Matrix Chernoff f( Q )=exp( P -1/2 ( P - Q ) P -1/2 ) Show decrease in relative eigenvalues Iterative methods: f(x) = |x – P -1 b| P Show decrease in distance to solution

56 SIMPLIFYING ASSUMPTIONS P = I (by normalization) tr( P i D -1 ) = 0.1, ‘unit weight’ Expected value of picking a P i at random: 1/t I P0P0 D0D0 P D

57 DECREASE Step: r’ = ( I – Q -1 P )r = ( I – Q -1 )r New error: |r ’ | P = |( I – Q -1 )r| 2 Expand:

58 DECREASE: I ≼ Q ≼ 1.1 I would imply: 0.9 I ≼ Q -1 Q -2 ≼ I But also Q -3 ≼ I and etc. Don’t need 3 rd moment

59 RELAXATIONS Only need Q -1 and Q -2 By linearity, suffices to: Lower bound E Q [ Q -1 ] Upper bound E Q [ Q -2 ]

60 TECHNICAL RESULT Assumption: Σ i P i = I trace( P i D -1 ) = 0.1 Let t = trace( D -1 ) Take rand[t, 2t] uniform samples Add (rescaled) results to D to form Q 0.9 I ≼ E[ Q -1 ] E[ Q -2 ] ≼ O(1) I

61 Q -1 0.5 I ≼ E[ Q -1 ] follows from matrix arithmetic-harmonic mean inequality ([ST`94]) Need: upper bound on E[ Q -2 ]

62 E[ Q -2 ] ≼ O(1) ? Q -2 is gradient of Q -1 More careful tracking of Q -1 gives info on Q -2 as well! Q -1 Q -2 j=t j=2t j=0

63 TRACKING Q -1 Q : start from D, add [t,2t] random (rescaled) P i s. Track inverse of Q under rank-1 perturbations Sherman Morrison formula:

64 BOUNDING Q -1 : DENOMINATOR Current matrix: Q j, sample: R D ≼ Q j  Q j -1 ≼ D -1 tr( Q j -1 R ) ≤ tr( D -1 R ) ≤ 0.1 for any R, E R [ Q j+1 -1 ] ≼ Q j -1 – 0.9 Q j -1 E[ R ] Q j -1 E

65 BOUNDING Q -1 : NUMERATOR R : random rescaled P i sampled Assumption: E[ R ] = P = I E R [ Q j+1 -1 ] ≼ Q j -1 – 0.9/t Q j -2 E R [ Q j+1 -1 ] ≼ Q j -1 – 0.9 Q j -1 E[ R ] Q j -1

66 AGGREGATION Q j is also random Need to aggregate choices of R into bound on E[ Q j -1 ] E R [ Q j+1 -1 ] ≼ Q j -1 – 0.9/t Q j -2 D = Q 0 Q1Q1 Q2Q2

67 HARMONIC SUMS Use harmonic sum of matrices Matrix functionals Similar to Steljes transform in [BSS`12] Proxy for -2 th power Well behaved under expectation: E X [HrmSum (X,a)] ≤ HrmSum(E[X],a) HrmSum(X, a) = 1/(1/x + 1/a)

68 HARMONIC SUM Initial condition + telescoping sum gives E[ Q t -1 ] ≼ O(1) I E R [ Q j+1 -1 ] ≼ Q j -1 – 0.9/t Q j -2

69 E[ Q -2 ] ≼ O(1) I Q -2 is gradient of Q -1 : 0.9/t Q j -2 ≼ Q j -1 - E R [ Q j+1 -1 ] 0.9/tΣ j=t 2t-1 Q j -2 ≼ E[ Q 2t -1 ] - E[ Q t -1 ] Random j from [t,2t] is good! Q -1 j=t j=2t j=0 Q -2

70 SUMMARY Un-normalize: 0.5 P ≼ E[ PQ -1 P ] E[ PQ -1 PQ -1 P ] ≼ 5 P One step of preconditioned Richardson:

71 MORE GENERAL Works for some convex functions Sherman-Morrison replaced by inequality, primal/dual

72 FUTURE WORK Expected convergence of Chebyshev iteration? Conjugate gradient? Same bound without D (using pseudo-inverse)? Small error settings Stochastic optimization? More moments?

73 THANK YOU! Questions?


Download ppt "Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory."

Similar presentations


Ads by Google