Download presentation

Presentation is loading. Please wait.

1
**Projection-free Online Learning**

Dan Garber Elad Hazan

2
**Super-linear operations are infeasible!**

Matrix completion Super-linear operations are infeasible!

3
**Online convex optimization**

Incurred loss x2 f1 f1(x1) x1 f2 f2(x2) fT(xT) linear (convex) bounded cost functions Total loss = t ft(xt) Regret = t ft(xt) minx* t ft(x*) Matrix completion: set = low rank matrices, Xij = prediction user i movie j functions: f(X) = |X * Eij - ±1|^2

4
**Online gradient descent**

yt+1 The algorithm: move in the direction of the vector -ct (gradient of the current cost function) ct xt+1 xt yt+1 = xt - ct and project back to the convex set Thm [Zinkevich]: if = 1/ t then this alg attains worst case regret of t ft(xt) - t ft(x*) = O( T)

5
**Computational efficiency?**

Gradient step: linear time Projection step: quadratic program !! Online mirror descent: general convex program The convex decision set K: In general O(m½ n3 ) Simplex / Euclidean ball / cube – linear time Flow polytope – conic opt. O(m½ n3) PSD cone (matrix completion) – Cholesky decomposition O(n3)

6
**Projections out of the question!**

Matrix completion Projections out of the question!

7
**Computationally difficult learning problems**

Matrix completion K = SDP cone Cholesky decomposition Online routing K = flow polytope conic optimization over flow polytope Rotations K = rotation matrices Matroids K = matroid polytope

8
**Results part 1 (Hazan + Kale, ICML’12)**

Projection-less stochastic/online algorithms with regret bounds: Projections <-> Linear optimization parameter free (no learning rate) sparse predictions Stochastic Adversarial Smooth √T T¾ Non-smooth T3/4

9
**Linear opt. vs. Projections**

Matrix completion K = SDP cone Cholesky decomposition largest singular vector Online routing K = flow polytope conic optimization over flow polytope shortest path computation Rotations K = rotation matrices convex opt. Wahba’s alg.

10
**The Frank-Wolfe algorithm**

vt+1 xt+1 xt

11
**The Frank-Wolfe algorithm (conditional grad.)**

vt+1 xt+1 xt Thm[ FW ’56]: rate of convergence = 1/Ct (C = smoothness) [Clarkson ‘06] – refined analysis [Hazan ‘07] - SDP [Jaggi ‘11] – generalization

12
**The Frank-Wolfe algorithm**

vt+1 xt+1 xt At iteration t – convex comb. of <= t vertices = ((t,K)-sparse No learning rate. Convex combination with 1/t (indep. Of diameter, gradients etc.)

13
**Online Frank-Wolfe – wrong approach**

example: [S. Bubek] K = interval [-1,1]

14
**Online Conditional Gradient (OCG)**

xt xt+1 vt+1

15
**Projections <-> Linear optimization **

parameter free (no learning rate) sparse predictions But can we get the optimal root(T) rate?? Barrier: existing projection-free algs were not linearly converging (poly-time) Stochastic Adversarial Smooth √T T¾ Non-smooth T2/3

16
**New poly-time projection free alg [Garber, Hazan 2013]**

New algorithm with convergence ~ e-t/n rate CS: “poly time” Nemirovski: “linear rate” Only linear optimization calls on the original polytope! (constantly many per iteration)

17
**Linearly converging Frank-Wolfe**

vt+1 xt+1 xt Assume optimum is within Euclidean distance r: Thm[ easy ]: rate of convergence = e-t But useless: under a ball-intersection constraint – quadratic optimization equivalent to projection

18
**Inherent problem with balls**

x* xt No intersection -> radius is bounded by distance to boundary With intersection -> hard problem equivalent to projection

19
Polytopes are OK! Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that: Contains x* Does not intersect original polytope same shape

20
**Implications for online optimization**

Projections <-> Linear optimization parameter free (no learning rate) sparse predictions Optimal rate Stochastic Adversarial Convex √T Strongly convex log(T)

22
**More research / open questions**

Projection free alg – for many problems linear step time vs. cubic or more For main ML problems today – projection-free is the only feasible optimization method Completely poly-time (log dependence on smoothness / strong convexity / diameter) Can we attain poly-time optimization using only gradient information? Thank you!

Similar presentations

OK

Jie Gao Joint work with Amitabh Basu*, Joseph Mitchell, Girishkumar Stony Brook Distributed Localization using Noisy Distance and Angle Information.

Jie Gao Joint work with Amitabh Basu*, Joseph Mitchell, Girishkumar Stony Brook Distributed Localization using Noisy Distance and Angle Information.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on computer malwarebytes anti-malware Ppt on polynomials of 9 Ppt on metro bridge construction Ppt on two stage rc coupled amplifier Ppt on uses of soil Ppt on hydro power station Ppt on statistics and probability for class 9 Ppt on 60 years of indian parliament house Ppt on accounting standard 17 Ppt on carbon and its compounds