Presentation is loading. Please wait.

Presentation is loading. Please wait.

Projection-free Online Learning Dan Garber Elad Hazan.

Similar presentations


Presentation on theme: "Projection-free Online Learning Dan Garber Elad Hazan."— Presentation transcript:

1 Projection-free Online Learning Dan Garber Elad Hazan

2 Matrix completion Super-linear operations are infeasible!

3 Online convex optimization linear (convex) bounded cost functions Total loss =  t f t (x t ) Regret =  t f t (x t ) - min x *  t f t (x * ) Matrix completion: set = low rank matrices, X ij = prediction user i movie j functions: f(X) = |X * E ij - ±1|^2 f 1 (x 1 ) f 2 (x 2 ) f T (x T ) x1x1 f1f1 x2x2 f2f2 Incurred loss

4 Online gradient descent The algorithm: move in the direction of the vector -c t (gradient of the current cost function) Thm [Zinkevich]: if  = 1/  t then this alg attains worst case regret of  t f t (x t ) -  t f t (x * ) = O(  T) y t+1 = x t -  c t and project back to the convex set ctct y t+1 x t+1 xtxt

5 Computational efficiency? Gradient step: linear time Projection step: quadratic program !! Online mirror descent: general convex program The convex decision set K: In general O(m ½ n 3 ) Simplex / Euclidean ball / cube – linear time Flow polytope – conic opt. O(m ½ n 3 ) PSD cone (matrix completion) – Cholesky decomposition O(n 3 )

6 Matrix completion Projections out of the question!

7 Computationally difficult learning problems 1. Matrix completion K = SDP cone Cholesky decomposition 2. Online routing K = flow polytope conic optimization over flow polytope 3. Rotations K = rotation matrices 4. Matroids K = matroid polytope

8 Results part 1 (Hazan + Kale, ICML’12) Projection-less stochastic/online algorithms with regret bounds: Projections Linear optimization parameter free (no learning rate) sparse predictions StochasticAdversarial Smooth√TT¾T¾ Non-smoothT 3/4 T¾T¾

9 Linear opt. vs. Projections 1. Matrix completion K = SDP cone Cholesky decomposition largest singular vector 2. Online routing K = flow polytope conic optimization over flow polytope shortest path computation 3. Rotations K = rotation matrices convex opt. Wahba’s alg.

10 The Frank-Wolfe algorithm v t+1 x t+1 xtxt

11 The Frank-Wolfe algorithm (conditional grad.) Thm[ FW ’56]: rate of convergence = 1/Ct (C = smoothness) [Clarkson ‘06] – refined analysis [Hazan ‘07] - SDP [Jaggi ‘11] – generalization xtxt v t+1 x t+1

12 The Frank-Wolfe algorithm 1. At iteration t – convex comb. of <= t vertices = ((t,K)-sparse 2. No learning rate. Convex combination with 1/t (indep. Of diameter, gradients etc.) xtxt v t+1 x t+1

13 Online Frank-Wolfe – wrong approach K = interval [-1,1] example: [S. Bubek]

14 Online Conditional Gradient (OCG) v t+1 x t+1 xtxt

15 StochasticAdversarial Smooth√TT¾T¾ Non-smoothT 2/3 T¾T¾ Projections Linear optimization parameter free (no learning rate) sparse predictions But can we get the optimal root(T) rate?? Barrier: existing projection-free algs were not linearly converging (poly-time)

16 New poly-time projection free alg [Garber, Hazan 2013] New algorithm with convergence ~ e -t/n rate CS: “poly time” Nemirovski: “linear rate” Only linear optimization calls on the original polytope! (constantly many per iteration)

17 Linearly converging Frank-Wolfe v t+1 x t+1 xtxt Assume optimum is within Euclidean distance r: Thm[ easy ]: rate of convergence = e -t But useless: under a ball-intersection constraint – quadratic optimization equivalent to projection

18 Inherent problem with balls xtxt No intersection -> radius is bounded by distance to boundary With intersection -> hard problem equivalent to projection x*

19 Polytopes are OK! Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that: Contains x* Does not intersect original polytope same shape

20 Projections Linear optimization parameter free (no learning rate) sparse predictions Optimal rate StochasticAdversarial Convex√T Strongly convexlog(T) Implications for online optimization

21

22 More research / open questions Projection free alg – for many problems linear step time vs. cubic or more For main ML problems today – projection-free is the only feasible optimization method Completely poly-time (log dependence on smoothness / strong convexity / diameter) Can we attain poly-time optimization using only gradient information?


Download ppt "Projection-free Online Learning Dan Garber Elad Hazan."

Similar presentations


Ads by Google