Download presentation

Presentation is loading. Please wait.

Published byMoses Colford Modified about 1 year ago

1
Projection-free Online Learning Dan Garber Elad Hazan

2
Matrix completion Super-linear operations are infeasible!

3
Online convex optimization linear (convex) bounded cost functions Total loss = t f t (x t ) Regret = t f t (x t ) - min x * t f t (x * ) Matrix completion: set = low rank matrices, X ij = prediction user i movie j functions: f(X) = |X * E ij - ±1|^2 f 1 (x 1 ) f 2 (x 2 ) f T (x T ) x1x1 f1f1 x2x2 f2f2 Incurred loss

4
Online gradient descent The algorithm: move in the direction of the vector -c t (gradient of the current cost function) Thm [Zinkevich]: if = 1/ t then this alg attains worst case regret of t f t (x t ) - t f t (x * ) = O( T) y t+1 = x t - c t and project back to the convex set ctct y t+1 x t+1 xtxt

5
Computational efficiency? Gradient step: linear time Projection step: quadratic program !! Online mirror descent: general convex program The convex decision set K: In general O(m ½ n 3 ) Simplex / Euclidean ball / cube – linear time Flow polytope – conic opt. O(m ½ n 3 ) PSD cone (matrix completion) – Cholesky decomposition O(n 3 )

6
Matrix completion Projections out of the question!

7
Computationally difficult learning problems 1. Matrix completion K = SDP cone Cholesky decomposition 2. Online routing K = flow polytope conic optimization over flow polytope 3. Rotations K = rotation matrices 4. Matroids K = matroid polytope

8
Results part 1 (Hazan + Kale, ICML’12) Projection-less stochastic/online algorithms with regret bounds: Projections Linear optimization parameter free (no learning rate) sparse predictions StochasticAdversarial Smooth√TT¾T¾ Non-smoothT 3/4 T¾T¾

9
Linear opt. vs. Projections 1. Matrix completion K = SDP cone Cholesky decomposition largest singular vector 2. Online routing K = flow polytope conic optimization over flow polytope shortest path computation 3. Rotations K = rotation matrices convex opt. Wahba’s alg.

10
The Frank-Wolfe algorithm v t+1 x t+1 xtxt

11
The Frank-Wolfe algorithm (conditional grad.) Thm[ FW ’56]: rate of convergence = 1/Ct (C = smoothness) [Clarkson ‘06] – refined analysis [Hazan ‘07] - SDP [Jaggi ‘11] – generalization xtxt v t+1 x t+1

12
The Frank-Wolfe algorithm 1. At iteration t – convex comb. of <= t vertices = ((t,K)-sparse 2. No learning rate. Convex combination with 1/t (indep. Of diameter, gradients etc.) xtxt v t+1 x t+1

13
Online Frank-Wolfe – wrong approach K = interval [-1,1] example: [S. Bubek]

14
Online Conditional Gradient (OCG) v t+1 x t+1 xtxt

15
StochasticAdversarial Smooth√TT¾T¾ Non-smoothT 2/3 T¾T¾ Projections Linear optimization parameter free (no learning rate) sparse predictions But can we get the optimal root(T) rate?? Barrier: existing projection-free algs were not linearly converging (poly-time)

16
New poly-time projection free alg [Garber, Hazan 2013] New algorithm with convergence ~ e -t/n rate CS: “poly time” Nemirovski: “linear rate” Only linear optimization calls on the original polytope! (constantly many per iteration)

17
Linearly converging Frank-Wolfe v t+1 x t+1 xtxt Assume optimum is within Euclidean distance r: Thm[ easy ]: rate of convergence = e -t But useless: under a ball-intersection constraint – quadratic optimization equivalent to projection

18
Inherent problem with balls xtxt No intersection -> radius is bounded by distance to boundary With intersection -> hard problem equivalent to projection x*

19
Polytopes are OK! Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that: Contains x* Does not intersect original polytope same shape

20
Projections Linear optimization parameter free (no learning rate) sparse predictions Optimal rate StochasticAdversarial Convex√T Strongly convexlog(T) Implications for online optimization

21

22
More research / open questions Projection free alg – for many problems linear step time vs. cubic or more For main ML problems today – projection-free is the only feasible optimization method Completely poly-time (log dependence on smoothness / strong convexity / diameter) Can we attain poly-time optimization using only gradient information?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google