Presentation on theme: "Projection-free Online Learning"— Presentation transcript:
1 Projection-free Online Learning Dan Garber Elad Hazan
2 Super-linear operations are infeasible! Matrix completionSuper-linear operations are infeasible!
3 Online convex optimization Incurred lossx2f1f1(x1)x1f2f2(x2)fT(xT)linear (convex) bounded cost functionsTotal loss = t ft(xt)Regret = t ft(xt) minx* t ft(x*)Matrix completion: set = low rank matrices, Xij = prediction user i movie j functions: f(X) = |X * Eij - ±1|^2
4 Online gradient descent yt+1The algorithm: move in the direction of the vector -ct (gradient of the current cost function)ctxt+1xtyt+1 = xt - ctand project back to the convex setThm [Zinkevich]: if = 1/ t then this alg attains worst case regret oft ft(xt) - t ft(x*) = O( T)
5 Computational efficiency? Gradient step: linear timeProjection step: quadratic program !!Online mirror descent: general convex programThe convex decision set K:In general O(m½ n3 )Simplex / Euclidean ball / cube – linear timeFlow polytope – conic opt. O(m½ n3)PSD cone (matrix completion) – Cholesky decomposition O(n3)
6 Projections out of the question! Matrix completionProjections out of the question!
7 Computationally difficult learning problems Matrix completion K = SDP cone Cholesky decompositionOnline routing K = flow polytope conic optimization over flow polytopeRotations K = rotation matricesMatroids K = matroid polytope
8 Results part 1 (Hazan + Kale, ICML’12) Projection-less stochastic/online algorithms with regret bounds:Projections <-> Linear optimizationparameter free (no learning rate)sparse predictionsStochasticAdversarialSmooth√TT¾Non-smoothT3/4
9 Linear opt. vs. Projections Matrix completion K = SDP cone Cholesky decomposition largest singular vectorOnline routing K = flow polytope conic optimization over flow polytope shortest path computationRotations K = rotation matrices convex opt. Wahba’s alg.
15 Projections <-> Linear optimization parameter free (no learning rate)sparse predictionsBut can we get the optimal root(T) rate??Barrier: existing projection-free algs were not linearly converging (poly-time)StochasticAdversarialSmooth√TT¾Non-smoothT2/3
16 New poly-time projection free alg [Garber, Hazan 2013] New algorithm with convergence ~ e-t/n rate CS: “poly time” Nemirovski: “linear rate”Only linear optimization calls on the original polytope! (constantly many per iteration)
17 Linearly converging Frank-Wolfe vt+1xt+1xtAssume optimum is within Euclidean distance r:Thm[ easy ]: rate of convergence = e-tBut useless: under a ball-intersection constraint – quadratic optimization equivalent to projection
18 Inherent problem with balls x*xtNo intersection -> radius is bounded by distance to boundaryWith intersection -> hard problem equivalent to projection
19 Polytopes are OK!Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that:Contains x*Does not intersect original polytopesame shape
20 Implications for online optimization Projections <-> Linear optimizationparameter free (no learning rate)sparse predictionsOptimal rate StochasticAdversarialConvex√TStrongly convexlog(T)
22 More research / open questions Projection free alg – for many problems linear step time vs. cubic or moreFor main ML problems today – projection-free is the only feasible optimization methodCompletely poly-time (log dependence on smoothness / strong convexity / diameter)Can we attain poly-time optimization using only gradient information?Thank you!