Download presentation

1
**Projection-free Online Learning**

Dan Garber Elad Hazan

2
**Super-linear operations are infeasible!**

Matrix completion Super-linear operations are infeasible!

3
**Online convex optimization**

Incurred loss x2 f1 f1(x1) x1 f2 f2(x2) fT(xT) linear (convex) bounded cost functions Total loss = t ft(xt) Regret = t ft(xt) minx* t ft(x*) Matrix completion: set = low rank matrices, Xij = prediction user i movie j functions: f(X) = |X * Eij - ±1|^2

4
**Online gradient descent**

yt+1 The algorithm: move in the direction of the vector -ct (gradient of the current cost function) ct xt+1 xt yt+1 = xt - ct and project back to the convex set Thm [Zinkevich]: if = 1/ t then this alg attains worst case regret of t ft(xt) - t ft(x*) = O( T)

5
**Computational efficiency?**

Gradient step: linear time Projection step: quadratic program !! Online mirror descent: general convex program The convex decision set K: In general O(m½ n3 ) Simplex / Euclidean ball / cube – linear time Flow polytope – conic opt. O(m½ n3) PSD cone (matrix completion) – Cholesky decomposition O(n3)

6
**Projections out of the question!**

Matrix completion Projections out of the question!

7
**Computationally difficult learning problems**

Matrix completion K = SDP cone Cholesky decomposition Online routing K = flow polytope conic optimization over flow polytope Rotations K = rotation matrices Matroids K = matroid polytope

8
**Results part 1 (Hazan + Kale, ICML’12)**

Projection-less stochastic/online algorithms with regret bounds: Projections <-> Linear optimization parameter free (no learning rate) sparse predictions Stochastic Adversarial Smooth √T T¾ Non-smooth T3/4

9
**Linear opt. vs. Projections**

Matrix completion K = SDP cone Cholesky decomposition largest singular vector Online routing K = flow polytope conic optimization over flow polytope shortest path computation Rotations K = rotation matrices convex opt. Wahba’s alg.

10
**The Frank-Wolfe algorithm**

vt+1 xt+1 xt

11
**The Frank-Wolfe algorithm (conditional grad.)**

vt+1 xt+1 xt Thm[ FW ’56]: rate of convergence = 1/Ct (C = smoothness) [Clarkson ‘06] – refined analysis [Hazan ‘07] - SDP [Jaggi ‘11] – generalization

12
**The Frank-Wolfe algorithm**

vt+1 xt+1 xt At iteration t – convex comb. of <= t vertices = ((t,K)-sparse No learning rate. Convex combination with 1/t (indep. Of diameter, gradients etc.)

13
**Online Frank-Wolfe – wrong approach**

example: [S. Bubek] K = interval [-1,1]

14
**Online Conditional Gradient (OCG)**

xt xt+1 vt+1

15
**Projections <-> Linear optimization **

parameter free (no learning rate) sparse predictions But can we get the optimal root(T) rate?? Barrier: existing projection-free algs were not linearly converging (poly-time) Stochastic Adversarial Smooth √T T¾ Non-smooth T2/3

16
**New poly-time projection free alg [Garber, Hazan 2013]**

New algorithm with convergence ~ e-t/n rate CS: “poly time” Nemirovski: “linear rate” Only linear optimization calls on the original polytope! (constantly many per iteration)

17
**Linearly converging Frank-Wolfe**

vt+1 xt+1 xt Assume optimum is within Euclidean distance r: Thm[ easy ]: rate of convergence = e-t But useless: under a ball-intersection constraint – quadratic optimization equivalent to projection

18
**Inherent problem with balls**

x* xt No intersection -> radius is bounded by distance to boundary With intersection -> hard problem equivalent to projection

19
Polytopes are OK! Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that: Contains x* Does not intersect original polytope same shape

20
**Implications for online optimization**

Projections <-> Linear optimization parameter free (no learning rate) sparse predictions Optimal rate Stochastic Adversarial Convex √T Strongly convex log(T)

22
**More research / open questions**

Projection free alg – for many problems linear step time vs. cubic or more For main ML problems today – projection-free is the only feasible optimization method Completely poly-time (log dependence on smoothness / strong convexity / diameter) Can we attain poly-time optimization using only gradient information? Thank you!

Similar presentations

OK

Satyen Kale (Yahoo! Research) Joint work with Sanjeev Arora (Princeton)

Satyen Kale (Yahoo! Research) Joint work with Sanjeev Arora (Princeton)

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on computer malwarebytes anti-malware Ppt on polynomials of 9 Ppt on metro bridge construction Ppt on two stage rc coupled amplifier Ppt on uses of soil Ppt on hydro power station Ppt on statistics and probability for class 9 Ppt on 60 years of indian parliament house Ppt on accounting standard 17 Ppt on carbon and its compounds