Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.

Slides:

Advertisements

Similar presentations

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Peter Richtarik Why parallelizing like crazy and being lazy can be good.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Unsupervised Learning

Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.

PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Optimization Tutorial

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling Foundations of Computational Mathematics – Montevideo, Uruguay – December 2014.

Patch-based Image Deconvolution via Joint Modeling of Sparse Priors Chao Jia and Brian L. Evans The University of Texas at Austin 12 Sep

Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.

Distributed Optimization with Arbitrary Local Solvers

The loss function, the normal equation,

Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Chapter 5. Operations on Multiple R. V.'s 1 Chapter 5. Operations on Multiple Random Variables 0. Introduction 1. Expected Value of a Function of Random.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Online Learning for Matrix Factorization and Sparse Coding

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Multivariate Dyadic Regression Trees for Sparse Learning Problems Xi Chen Machine Learning Department Carnegie Mellon University (joint work with Han Liu)

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Chapter 2-OPTIMIZATION

Ultra-high dimensional feature selection Yun Li

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Multiplicative updates for L1-regularized regression

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Dan Roth Department of Computer and Information Science

Empirical risk minimization

Boosting and Additive Trees (2)

Generalization and adaptivity in stochastic convex optimization

Machine Learning Basics

Optimization in Machine Learning

Parallel and Distributed Block Coordinate Frank Wolfe

Convolutional networks

CRISP: Consensus Regularized Selection based Prediction

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Empirical risk minimization

CS639: Data Management for Data Science

Recap: Naïve Bayes classifier

Multiple features Linear Regression with multiple variables

Primal Sparse Max-Margin Markov Networks

Multiple features Linear Regression with multiple variables

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling Joint work with Peter Richtárik (Edinburgh) & Tong Zhang (Rutgers & Baidu)

Supervised Statistical Learning input (e.g., image, text, clinical measurements, …) input (e.g., image, text, clinical measurements, …) label (e.g. spam/no spam, stock price) Predicted label True label GOAL A_i \in \R^d, \enspace y_i \in \R \mathrm{Find}\enspace w\in \R^d : Training set of data Predictor Data Algorithm Predictor

Supervised Statistical Learning input input label Predicted label True label GOAL A_i \in \R^d, \enspace y_i \in \R \mathrm{Find}\enspace w\in \R^d : Training set of data Predictor Data Algorithm Predictor Predicted label True label Input Label

Empirical Risk Minimization input input label Predicted label True label GOAL A_i \in \R^d, \enspace y_i \in \R \mathrm{Find}\enspace w\in \R^d : Training set of data Predictor Data Algorithm Predictor Input Label empirical risk regularization regularization n = # samples (big!)

\[\min_{w\in \mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n loss(A_i^\top w, y_i)\] \[(A_1,y_1), (A_2,y_2), \dots, (A_n,y_n)\sim \emph{Distribution}\] n = # samples (big!) empirical loss regularization regularization ERM problem: Empirical Risk Minimization

Algorithm: QUARTZ Z. Q., P. Richtárik (UoE) and T. Zhang (Rutgers & Baidu Big Data Lab, Beijing) Randomized dual coordinate ascent with arbitrary sampling arXiv: , 2014

Primal-Dual Formulation \[\min_{w \in \mathbb{R }^d}\;\; \left[ P(w) \equiv \frac{1}{n} \sum_{i=1} ^n \phi_i(A_i^ \top w) + \lambda g(w)\right] \] Fenchel conjugates: ERM problem Dual problem

Intuition behind QUARTZ Fenchel’s inequality weak duality Optimality conditions

The Primal-Dual Update STEP 1: PRIMAL UPDATE STEP 2: DUAL UPDATE Optimality conditions

STEP 1: Primal update STEP 2: Dual update Just maintaining

SDCA: SS. Shwartz & T. Zhang, 09/2012 mSDCA M. Takáč, A. Bijral, P. Richtárik & N. Srebro, 03/2013 ASDCA: SS. Shwartz & T. Zhang, 05/2013 AccProx-SDCA: SS. Shwartz & T. Zhang, 10/2013 DisDCA: TB. Yang, 2013 Iprox-SDCA: PL. Zhao & T. Zhang, 01/2014 APCG: QH. Lin, Z. Lu & L. Xiao, 07/2014 SPDC: Y. Zhang & L. Xiao, 09/2014 QUARTZ: Z. Q., P. Richtárik & T. Zhang, 11/2014 Randomized Primal-Dual Methods

Convergence Theorem Expected Separable Overapproximation ESO Assumption Convex combination constant

Iteration Complexity Result (*)

Complexity Results for Serial Sampling

Experiment: Quartz vs SDCA, uniform vs optimal sampling

QUARTZ with Standard Mini-Batching

Data Sparsity A normalized measure of average sparsity of the data “Fully sparse data” “Fully dense data”

Iteration Complexity Results

Theoretical Speedup Factor Linear speedup up to a certain data-independent mini-batch size: Further data-dependent speedup:

Plots of Theoretical Speedup Factor Linear speedup up to a certain data-independent mini-batch size: Further data-dependent speedup:

Theoretical vs Pratical Speedup astro_ph; sparsity: 0.08%; n=29,882; cov1; sparsity: 22.22%; n=522,911;

Comparison with Accelerated Mini- Batch P-D Methods

Distribution of Data n = # dual variables Data matrix

Distributed Sampling Random set of dual variables

Distributed Sampling & Distributed Coordinate Descent Peter Richtárik and Martin Takáč Distributed coordinate descent for learning with big data arXiv: , 2013 Previously studied (not in the primal-dual setup): Olivier Fercoq, Z. Q., Peter Richtárik and Martin Takáč Fast distributed coordinate descent for minimizing non strongly convex losses 2014 IEEE Int Workshop on Machine Learning for Signal Processing, 2014 Jakub Marecek, Peter Richtárik and Martin Takáč Fast distributed coordinate descent for minimizing partially separable functions arXiv: , strongly convex & smooth convex & smooth

Complexity of Distributed QUARTZ \[\frac{n}{c\tau} + \max_i\frac{\lambda_{\max}\left( \sum_{j=1}^d \left(1+\frac{(\tau-1)(\omega_j-1)}{\max\{n/c-1,1\}}+ \left(\frac{\tau c}{n} - \frac{\tau-1}{\max\{n/c-1,1\}}\right) \frac{\omega_j'- 1}{\omega_j'}\omega_j\right) A_{ji}^\top A_{ji}\right)}{\lambda\gamma c\tau} \]

Reallocating Load: Theoretical Speedup

Theoretical vs Practical Speedup

More on ESO ESO: second order /curvature information local second order /curvature information lost get

Computation of ESO Parameters \[ \mathbf{E} \left\| \sum_{i\in \hat{S}} A_i \alpha_i\right\|^2 \;\;\leq \;\; \sum_{i=1}^n {\color{blue} p_i} {\color{red} v_i}\|\alpha_i\|^2 \] \[\Updownarrow\] \[ P \circ A^\top A \preceq Diag({\color{blue}p}\circ {\color{red}v})\] Lemma (QR’14b) \[A = [A_1,A_2,\dots,A_n]\] Sampling Data

Conclusion  QUARTZ (Randomized coordinate ascent method with arbitrary sampling ) o Direct primal-dual analysis (for arbitrary sampling) optimal serial sampling tau-nice sampling (mini-batch) distributed sampling o Theoretical speedup factor which is a very good predictor of the practical speedup factor depends on both the sparsity and the condition number shows a weak dependence on how data is distributed  Accelerated QUARTZ?  Randomized fixed point algorithm with relaxation?  …?