Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Slides:

Advertisements

Similar presentations

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Peter Richtarik Why parallelizing like crazy and being lazy can be good.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Linear Regression.

Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Optimization Tutorial

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling Foundations of Computational Mathematics – Montevideo, Uruguay – December 2014.

Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

CMPUT 466/551 Principal Source: CMU

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Chapter 2: Lasso for linear models

Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.

Distributed Optimization with Arbitrary Local Solvers

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

Optimality Conditions for Nonlinear Optimization Ashish Goel Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

The Role of Specialization in LDPC Codes Jeremy Thorpe Pizza Meeting Talk 2/12/03.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015.

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009.

Learning With Structured Sparsity

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Linear Models for Classification

Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent.

Ultra-high dimensional feature selection Yun Li

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

The Duality Theorem Primal P: Maximize

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Multiplicative updates for L1-regularized regression

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Dan Roth Department of Computer and Information Science

Boosting and Additive Trees (2)

Generalization and adaptivity in stochastic convex optimization

Machine Learning Basics

Optimization in Machine Learning

Probabilistic Models for Linear Regression

CSCI B609: “Foundations of Data Science”

Estimating Networks With Jumps

PEGASOS Primal Estimated sub-GrAdient Solver for SVM

Convolutional networks

Biointelligence Laboratory, Seoul National University

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

CS639: Data Management for Data Science

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Coauthors Zheng Qu Edinburgh, Mathematics Zheng Qu, P.R. and Tong Zhang Randomized dual coordinate ascent with arbitrary sampling arXiv: , 2014 Tong Zhang Rutgers, Statistics Baidu, Big Data Lab

Part 1 MACHINE LEARNING

Statistical nature of data \[ \phi_i(a)=\tfrac{1}{2\gamma}(a-b_i)^2 \] \[\Downarrow\] \[ \frac{1}{n}\sum_{i=1}^n \phi_i(A_i^\top w) =\frac{1}{2\gamma}\|Aw-b\|_2^2 \] Data (e.g., image, text, measurements, …) Label \[A_i \in \mathbb{R}^{d\times m}, \qquad y_i \in \mathbb{R}^m \]

Prediction of labels from data \[(A_i,y_i)\sim \emph{Distribution}\] \[A_i^\top w \approx y_i\] Find Such that when (data, label) pair is drawn from the distribution Then Predicted labelTrue label Linear predictor

Measure of Success \[\mathbf{E} \left[ loss(A_i^\top w, y_i)\right] \] We want the expected loss (=risk) to be small: datalabel

Finding a Linear Predictor via Empirical Risk Minimization \[\min_{w\in \mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n loss(A_i^\top w, y_i)\] \[(A_1,y_1), (A_2,y_2), \dots, (A_n,y_n)\sim \emph{Distribution}\] Draw i.i.d. data (samples) from the distribution Output predictor which minimizes the empirical risk:

Part 2 OPTIMIZATION

Primal Problem d = # features (parameters) n = # samples 1 - strongly convex function (regularizer) \[\min_{w \in \mathbb{R }^d}\;\; \left[ P(w) \equiv \frac{1}{n} \sum_{i=1} ^n \phi_i(A_i^ \top w) + \lambda g(w)\right] \] - smooth & convex regularization parameter

Assumption 1 Loss functions have Lipschitz gradient \[ \|\nabla \phi_i(a)-\nabla \phi_i(a')\|\;\; \leq\;\; \frac{1}{\gamma}\;\; \|a-a'\|, \quad a,a'\in \mathbb{R}^m\] Lipschitz constant

Assumption 2 \[g(w)\geq g(w') + \left + \frac{1}{2}\|w-w'\|^2, \quad w,w'\in \mathbb{R}^d\] Regularizer is 1-strongly convex subgradient

Dual Problem - strongly convex \[D(\alpha) \equiv - \lambda g^*\left(\frac{1}{\la mbda n}\sum_{i=1}^n A_i\alpha_i\right) - \frac{1}{n}\sum_{i= 1}^n \phi_i^*(- \alpha_i) 1 – smooth & convex \[\max_{\alpha=(\alpha_1,\dots,\alpha_n) \in \mathbb{R}^{N}=\mathbb{R}^{nm}} D(\alpha)\] \[g^*(w') = \max_{w\in \mathbb{R}^d} \left\{(w')^\top w - g(w)\right\}\] \[\phi_i^*(a') = \max_{a\in \mathbb{R}^m} \left\{(a')^\top a - \phi_i(a)\right\}\]

Part 3 ALGORITHM

Quartz

Fenchel Duality \[ P(w) - D(\alpha) \;\; = \;\; \lambda \left( g(w) + g^*\left(\bar{\alpha}\right)\right) + \frac{1}{n}\sum_{i=1}^n \phi_i(A_i^\top w) + \phi_i^*(-\alpha_i) = \] \[\lambda (g(w) + g^* \left(\bar{\alpha}\right)- \left\langle w, \bar{\alpha} \right\rangle ) + \frac{1}{n}\sum_{i=1}^n \phi_i(A_i^\top w) + \phi_i^*(-\alpha_i) + \left\langle A_i^\top w, \alpha_i \right\rangle\] \[\bar{\alpha} = \frac{1}{\lambda n} \sum_{i=1}^n A_i \alpha_i\] \[w = \nabla g^*(\bar{\alpha})\] \[\alpha_i = -\nabla \phi_i(A_i^\top w)\] Weak duality Optimality conditions

The Algorithm

Quartz: Bird’s Eye View \[(\alpha^t,w^t) \qquad \Rightarrow \qquad (\alpha^{t+1},w^{t+1})\] \[w^{t+1} \leftarrow (1-\theta)w^t + \theta {\color{red}\nabla g^*(\bar{\alpha}^t)}\] \[\alpha_i^{t+1} \leftarrow \left(1-\frac{\theta}{{\color{blue}p_i}}\right) \alpha_i^{t} + \frac{\theta}{{\color{blue}p_i}}{\color{red} \left(-\nabla \phi_i(A_i^\top w^{t+1})\right)} \] STEP 1: PRIMAL UPDATE STEP 2: DUAL UPDATE Choose a random set ${\color{blue}S_t}$ of dual variables ${\color{blue}p_i = \mathbf{P}(i\in S_t)}$

The Algorithm STEP 1 STEP 2 Convex combination constant

Randomized Primal-Dual Methods SDCA: SS Shwartz & T Zhang, 09/2012 mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013 ASDCA: SS Shwartz & T Zhang, 05/2013 AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014 SPDC: Y Zhang & L Xiao, 09/2014 Quartz: Z Qu, P R & T Zhang, 11/2014 {\footnotesize \begin{tabular}{|c|c|c|c|c|c|c|c|c|c} \hline Algorithm & 1-nice & 1-optimal & $\tau$-nice & arbitrary &{\begin{tabular}{c}additional\\speedup\end{tabular} } & { \begin{tabular}{c}direct\\p-d\\analysis\end{tabular}} & acceleration\\ \hline SDCA & $\bullet$ & & & & & & \\ \hline mSDCA & $\bullet$ & & $\bullet$ & & $\bullet$ & & \\ \hline ASDCA & $\bullet$ & & $\bullet$ & & & & $\bullet$ \\ \hline AccProx-SDCA &$\bullet$ & & & & & &$\bullet$\\ \hline DisDCA &$\bullet$ & &$\bullet$ & & & & \\ \hline Iprox-SDCA & $\bullet$ & $\bullet$ & & & & & \\ \hline APCG &$\bullet$ & & & & & &$\bullet$ \\ \hline SPDC &$\bullet$ & $\bullet$ &$\bullet$ & & & $\bullet$ &$\bullet$ \\ \hline \bf{Quartz} &{\color{red}$\bullet$} &{\color{red}$\bullet$} &{\color{red}$\bullet$} &{\color{red}$\bullet$} &{\color{red}$\bullet$} & {\color{red}$\bullet$} & \\ \hline \end{tabular} }

Part 4 MAIN RESULT

Assumption 3 (Expected Separable Overapproximation) \[ \mathbf{E} \left\| \sum_{i\in \hat{S}} A_i \alpha_i\right\|^2 \;\;\leq \;\; \sum_{i=1}^n {\color{blue} p_i} {\color{red} v_i}\|\alpha_i\|^2 \] \[ {\color{blue} p_i} = \mathbb{P}(i\in \hat{S}) \] Parameters ${\color{red}v_1,\dots,\color{red}v_n}$ satisfy: inequality must hold for all \[\alpha_1,\dots,\alpha_n \in \mathbb{R}^m\]

Complexity Theorem (QRZ’14) \[k\;\;\geq \;\; \max_i \left( \frac{1}{{\color{blue}p_i}} + \frac{{\color{red}v_i}}{{\color{blue}p_i} \lambda \gamma n}\right) \log \left( \frac{P(w^0)-D(\alpha^0)}{\epsilon} \right)\] \[\mathbf{E}\left[P(w^t)-D(\alpha^t)\right] \leq \epsilon\]

Part 5a UPDATING ONE DUAL VARIABLE AT A TIME \[A = [A_1,A_2,A_3,A_4,A_5] = \left(\begin{matrix} 0 & 0 & 6 & 4 & 9\\ 0 & 3 & 0 & 0 & 0\\ 0 & 0 & 3 & 0 & 1\\ 1 & 8 & 0 & 0 & 0\\ \end{matrix}\right)\]

\begin{table} \begin{tabular}{|c|c|} \hline & \\ Optimal sampling & $ \displaystyle n +\frac{\tfrac{1}{n}\sum_{i=1}^n L_i}{\lambda \gamma}$ \\ & \\ \hline & \\ Uniform sampling & $\displaystyle n +\frac{\max_i L_i}{\lambda \gamma}$ \\ & \\ \hline \end{tabular} \end{table} Complexity of Quartz specialized to serial sampling

Data \begin{table} \begin{tabular}{|c|c|c|c|} \hline {\bf Dataset} & \begin{tabular}{c}{\bf \# Samples}\\ $n$\end{tabular} & \begin{tabular}{c}{\bf \# features}\\ $d$ \end{tabular}& \begin{tabular}{c}{\bf density} \\ $nnz(A)/(nd)$\end{tabular}\\ \hline astro-ph & 29,882 & 99,757 & 0.08\% \\ \hline CCAT & 781,265 & 47,236 & 0.16\% \\ \hline cov1 & 522,911 & 54 & 22.22\%\\ \hline w8a & 49,749 & 300 & 3.91\% \\ \hline ijcnn1 & 49,990 & 22 & 59.09\% \\ \hline webspam & 350,000 & 254 & 33.52\% \\\hline \end{tabular} \end{table}

Standard primal update Experiment: Quartz vs SDCA, uniform vs optimal sampling “Aggressive” primal update Data = \texttt{cov1}, \quad $n=522,911$, \quad$\lambda = 10^{-6}$

Part 5b TAU-NICE SAMPLING (STANDARD MINIBATCHING)

Data sparsity A normalized measure of average sparsity of the data “Fully sparse data”“Fully dense data”

Complexity of Quartz \begin{table} \begin{tabular}{|c|c|} \hline &\\ \begin{tabular}{c}Fully sparse data\\ (${\color{blue}\tilde{\omega}=1}$) \end{tabular} & $\displaystyle \frac{n}{{\color{red}\tau}}+\frac{\max_i L_i}{\lambda \gamma {\color{red}\tau}}$\\ & \\ \hline & \\ \begin{tabular}{c}Fully dense data \\(${\color{blue}\tilde{\omega}=n}$) \end{tabular} & $\displaystyle \frac{n}{{\color{red}\tau}}+\frac{\max_i L_i}{\lambda \gamma}$ \\ & \\ \hline & \\ \begin{tabular}{c}Any data\\ (${\color{blue}1\leq \tilde{\omega}\leq n}$) \end{tabular} & $\displaystyle \frac{n}{{\color{red}\tau}}+\frac{\left(1+\frac{({\color{blue}\tilde{\omega}}-1)({\color{red}{\color{red}\tau}}-1)}{n-1}\right)\max_i L_i}{\lambda \gamma {\color{red}\tau}} $ \\ & \\ \hline \end{tabular} \end{table}

Speedup \[ \frac{T(1)}{T({\color{red}\tau})} \geq \frac{{\color{red}\tau}}{1+\frac{\tilde{\omega}-1}{n-1}}\geq \frac{{\color{red}\tau}}{2} \] \[1\leq {\color{red}\tau} \leq 2+\lambda \gamma n\] Assume the data is normalized: Then: Linear speedup up to a certain data-independent minibatch size: \[ T({\color{red}\tau}) \;\; = \;\; \frac{ \left(1+\frac{({\color{blue}\tilde{\omega}}-1)({\color{red}\tau}-1)}{(n-1)(1+\lambda \gamma n)}\right) }{{\color{red}\tau}} \times T({\color{red}1}) \] Further data-dependent speedup, up to the extreme case: \[ T({\color{red}\tau}) \leq \frac{2}{{\color{red}\tau}} \times T({\color{red}1}) \] \[ {\color{blue}\tilde{\omega}} = O(1) \] \[ T({\color{red}\tau}) = {\cal O} \left(\frac{T({\color{red}1})}{\color{red}\tau}\right)\]

Speedup: sparse data

Speedup: denser data

Speedup: fully dense data

astro_ph: n = 29,882 density = 0.08%

CCAT: n = 781,265 density = 0.16%

\begin{tabular}{c|c|c} \hline Algorithm & Iteration complexity & $g$\\ \hline && \\ SDCA & $\displaystyle n + \frac{1}{\lambda \gamma}$ & $\tfrac{1}{2}\|\cdot\|^2$ \\ &&\\ \hline &&\\ ASDCA & $\displaystyle 4 \times \max\left\{\frac{n}{{\color{red}\tau}},\sqrt{\frac{n}{\lambda\gamma {\color{red}{\color{red}\tau}}}},\frac{1}{\lambda \gamma {\color{red}\tau}},\frac{n^{\frac{1}{3}}}{(\lambda \gamma {\color{red}\tau})^{\frac{2}{3}}}\right\}$ & $\tfrac{1}{2}\|\cdot\|^2$\\ &&\\ \hline &&\\ SPDC & $\displaystyle \frac{n}{{\color{red}\tau}}+\sqrt{\frac{n}{\lambda \gamma {\color{red}\tau}}}$ & general \\ &&\\ \hline &&\\ \bf{Quartz } & $\displaystyle \frac{n}{{\color{red}\tau}}+\left(1+\frac{(\tilde \omega -1)({\color{red}\tau}-1)}{n-1}\right)\frac{1}{\lambda \gamma {\color{red}\tau}}$ & general\\ &&\\ \hline \end{tabular} Primal-dual methods with tau-nice sampling SS-Shwartz & T Zhang ‘13 Y Zhang & L Xiao ‘14

\begin{tabular}{c|c|c|c|c} \hline Algorithm & $\gamma\lambda n = \Theta(\tfrac{1}{\tau})$ & $\gamma\lambda n = \Theta(1)$ & $\gamma\lambda n = \Theta(\tau)$ & $\gamma\lambda n = \Theta(\sqrt{n})$\\ \hline & $\kappa = n\tau$ & $\kappa = n$ & $\kappa = n/\tau$ & $\kappa = \sqrt{n}$\\ \hline &&&&\\ SDCA & $n \tau$ & $n$ & $n$ & $n$\\ &&&&\\ \hline &&&&\\ ASDCA & $n$ & $\displaystyle \frac{n}{\sqrt{\tau}}$ & $\displaystyle \frac{n}{\tau}$ & $\displaystyle \frac{n}{\tau} + \frac{n^{3/4}}{\sqrt{\tau}}$\\ &&&&\\ \hline &&&&\\ SPDC & $n$ & $\displaystyle \frac{n}{\sqrt{\tau}}$ & $\displaystyle \frac{n}{\tau}$ & $\displaystyle \frac{n}{\tau} + \frac{n^{3/4}}{\sqrt{\tau}}$\\ &&&&\\ \hline &&&&\\ \bf{Quartz } & $\displaystyle n + \tilde{\omega}\tau$ & $\displaystyle \frac{n}{\tau} +\tilde{\omega}$ & $\displaystyle \frac{n}{\tau}$ & $\displaystyle \frac{n}{\tau} + \frac{\tilde{\omega}}{\sqrt{n}}$\\ &&&&\\ \hline \end{tabular} For sufficiently sparse data, Quartz wins even when compared against accelerated methods Accelerated

Part 6 ESO Zheng Qu and P.R. Coordinate Descent with Arbitrary Sampling II: Expected Separable Overapproximation arXiv: , 2014

Computation of ESO parameters \[ \mathbf{E} \left\| \sum_{i\in \hat{S}} A_i \alpha_i\right\|^2 \;\;\leq \;\; \sum_{i=1}^n {\color{blue} p_i} {\color{red} v_i}\|\alpha_i\|^2 \] \[\Updownarrow\] \[ P \circ A^\top A \preceq Diag({\color{blue}p}\circ {\color{red}v})\] Lemma (QR’14b) {For simplicity, assume that m = 1} \[A = [A_1,A_2,\dots,A_n]\]

ESO \[\lambda'(M) \equiv \max_{\alpha\in \mathbb{R}^n} \frac{\alpha^\top M \alpha}{\alpha^\top Diag(M) \alpha}\] \[{\color{red} v_i} = \underbrace{\min\{\lambda'(P(\hat{S})),\lambda'(A^\top A)\}}_{\equiv\beta(\hat{S},A)} \|A_i\|^2\] For any sampling, ESO holds with Theorem (QR’14b) where \[\mathbf{P}(|\hat{S}|=\tau)=1 \quad \Rightarrow \quad\lambda'(P(\hat{S}))=\tau\] \[1 \leq \lambda'(A^\top A) \leq \underbrace{\max_j \|A_{j:}\|_0}_{\omega} \leq n\]

ESO \[\lambda'(M) \equiv \max_{\alpha\in \mathbb{R}^n} \frac{\alpha^\top M \alpha}{\alpha^\top Diag(M) \alpha}\] \[{\color{red} v_i} = \underbrace{\min\{\lambda'(P(\hat{S})),\lambda'(A^\top A)\}}_{\equiv\beta(\hat{S},A)} \|A_i\|^2\] \[\mathbf{P}(|\hat{S}|=\tau)=1 \quad \Rightarrow \quad\lambda'(P(\hat{S}))=\tau\] \[1 \leq \lambda'(A^\top A) \leq \underbrace{\max_j \|A_{j:}\|_0}_{\omega} \leq n\]

Part 7 DISTRIBUTED QUARTZ

Distributed Quartz: Perform the Dual Updates in a Distributed Manner Quartz STEP 2: DUAL UPDATE Data required to compute the update

Distribution of Data n = # dual variables Data matrix

Distributed sampling

Random set of dual variables

Distributed sampling & distributed coordinate descent P.R. and Martin Takáč Distributed coordinate descent for learning with big data arXiv: , 2013 Previously studied (not in the primal-dual setup): Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč Fast distributed coordinate descent for minimizing non strongly convex losses 2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014 Jakub Marecek, P.R. and Martin Takáč Fast distributed coordinate descent for minimizing partially separable functions arXiv: , June strongly convex & smooth convex & smooth

Complexity of distributed Quartz \[\frac{n}{c\tau} + \max_i\frac{\lambda_{\max}\left( \sum_{j=1}^d \left(1+\frac{(\tau-1)(\omega_j-1)}{\max\{n/c-1,1\}}+ \left(\frac{\tau c}{n} - \frac{\tau-1}{\max\{n/c-1,1\}}\right) \frac{\omega_j'- 1}{\omega_j'}\omega_j\right) A_{ji}^\top A_{ji}\right)}{\lambda\gamma c\tau} \]

Reallocating load: theoretical speedup n = 1,000,000 density = 100% n = 1,000,000 density = 0.01%

Reallocating load: experiment Data: webspam n = 350,000 density = 33.51%

Experiment Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv: , 2013

LASSO: 3TB data nodes

Experiment Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

LASSO: 5TB data (d = 50b) nodes

Related Work Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv: , 2014 Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv: , 2014 P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv: , 2013

END

Part 5.4 PRODUCT SAMPLING

Product sampling \[X_1 := \{1,2\}, \quad X_2 := \{3,4,5\}\] Product sampling: Non-serial importance sampling ! In general: Each row (feature) has nonzeros in a single group of columns (examples) only

Complexity Meaning of the above notation:

Part X LOSS FUNCTIONS & REGULARIZERS

Example 1: Quadratic Loss \[ \phi_i(a)=\tfrac{1}{2\gamma}(a-b_i)^2 \] \[\Downarrow\] \[ \frac{1}{n}\sum_{i=1}^n \phi_i(A_i^\top w) =\frac{1}{2\gamma}\|Aw-b\|_2^2 \]

Example 2: Logistic Loss \[ \phi_i(a)=\tfrac{4}{\gamma}\log\left(1+e^{-b_i a}\right) \] \[\Downarrow\] \[ \frac{1}{n}\sum_{i=1}^n \phi_i(A_i^\top w) =\frac{4}{\gamma}\sum_{i=1}^n \log\left(1+ \exp^{-b_i A_i^\top w}\right) \]

0-1 loss 01 1 \[ \phi_i(a)=\left\{\begin{array} {ll} 1 & \quad a\leq 1 \\ 0 & \quad a\geq 1 \end{array}\right. \]

Hinge loss 01 1 \[ \phi_i(a)=\left\{\begin{array} {ll} 1-a & \quad a\leq 1 \\ 0 & \quad a\geq 1 \end{array}\right. \]

Example 3: Smoothed hinge loss 01 1 \[ \phi_i(a)=\left\{\begin{array} {ll} 1-a-\gamma/2 & \quad a\leq 1-\gamma \\ \displaystyle\frac{(1-a)^2}{2\gamma} & \quad 1-\gamma \leq a \leq 1 \\ 0 & \quad a\geq 1 \end{array}\right. \]

Example 4: Squared hinge loss 01 1 \[ \phi_i(a)=\left\{\begin{array} {ll} \displaystyle \frac{(1-a)^2}{2\gamma} & \quad a\leq 1 \\ 0 & \quad a\geq 1 \end{array}\right. \]