Parallel Coordinate Descent for L1-Regularized Loss Minimization

Slides:



Advertisements
Similar presentations
Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
Pixel Recovery via Minimization in the Wavelet Domain Ivan W. Selesnick, Richard Van Slyke, and Onur G. Guleryuz *: Polytechnic University, Brooklyn, NY.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Efficient Large-Scale Structured Learning
RECITATION 1 APRIL 14 Lasso Smoothing Parameter Selection Splines.
Optimization Tutorial
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Wangmeng Zuo, Deyu Meng, Lei Zhang, Xiangchu Feng, David Zhang
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Chapter 2: Lasso for linear models
Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Ilias Theodorakopoulos PhD Candidate
Compressed sensing Carlos Becker, Guillaume Lemaître & Peter Rennert
Lecture 4: Embedded methods
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
GRADIENT PROJECTION FOR SPARSE RECONSTRUCTION: APPLICATION TO COMPRESSED SENSING AND OTHER INVERSE PROBLEMS M´ARIO A. T. FIGUEIREDO ROBERT D. NOWAK STEPHEN.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
Compressed Sensing Compressive Sampling
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom.
GraphLab A New Framework for Parallel Machine Learning
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Learning with large datasets Machine Learning Large scale machine learning.
Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.
1 Blockwise Coordinate Descent Procedures for the Multi-task Lasso with Applications to Neural Semantic Basis Discovery ICML 2009 Han Liu, Mark Palatucci,
1 Active learning based survival regression for censored data Bhanukiran Vinzamuri Yan Li Chandan K.
1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...
Online Learning for Collaborative Filtering
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Compressive Sensing for Multimedia Communications in Wireless Sensor Networks By: Wael BarakatRabih Saliba EE381K-14 MDDSP Literary Survey Presentation.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
An Efficient Greedy Method for Unsupervised Feature Selection
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
Rank Minimization for Subspace Tracking from Incomplete Data
Learning from Big Data Lecture 5
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Logistic Regression & Elastic Net
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Matrix Factorization Reporter : Sun Yuanshuai
Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Loco: Distributing Ridge Regression with Random Projections Yang Song Department of Statistics.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Jeremy Watt and Aggelos Katsaggelos Northwestern University
Large-scale Machine Learning
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Learning Mid-Level Features For Recognition
Boosting and Additive Trees (2)
Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.
ECE 5424: Introduction to Machine Learning
CNNs and compressive sensing Theoretical analysis
Large Scale Support Vector Machines
Support Vector Machine I
CS639: Data Management for Data Science
Presentation transcript:

Parallel Coordinate Descent for L1-Regularized Loss Minimization Joseph K. Bradley Aapo Kyrola Danny Bickson Carlos Guestrin Remember: Emphasize how the naïve way of parallelizing CD works…and is surprising! Don't say "you can see...”!

L1-Regularized Regression Stock volatility (label) Bigrams from financial reports (features) 5x106 features 3x104 samples (Kogan et al., 2009) Lasso (Tibshirani, 1996) Sparse logistic regression (Ng, 2004) Produces sparse solutions Useful in high-dimensional settings (# features >> # examples)

From Sequential Optimization... Many algorithms Gradient descent, stochastic gradient, interior point methods, hard/soft thresholding, ... Coordinate descent (a.k.a. Shooting (Fu, 1998)) One of the fastest algorithms (Friedman et al., 2010; Yuan et al., 2010) But for big problems? 5x106 features 3x104 samples

...to Parallel Optimization We use the multicore setting: shared memory low latency We could parallelize: Matrix-vector ops E.g., interior point W.r.t. samples E.g., stochastic gradient (Zinkevich et al., 2010) W.r.t. features E.g., shooting Not great empirically. Best for many samples, not many features. “As I’ll go on to explain in more detail later, coordinate descent seems inherently sequential…” Analysis not for L1. Inherently sequential? Surprisingly, no!

Our Work Shotgun: Parallel coordinate descent for L1-regularized regression Parallel convergence analysis Linear speedups up to a problem-dependent limit Large-scale empirical study 37 datasets, 9 algorithms Surprisingly, the simple solution works.

Lasso Goal: Regress on , given samples Objective: Squared error (Tibshirani, 1996) Goal: Regress on , given samples Objective: Squared error L1 regularization I’ll speak in terms of the Lasso. The analysis for sparse logreg is similar.

Shooting: Sequential SCD where Lasso: contour Stochastic Coordinate Descent (SCD) (e.g., Shalev-Shwartz & Tewari, 2009) While not converged, Choose random coordinate j, Update xj (closed-form minimization)

Is SCD inherently sequential? Shotgun: Parallel SCD where Lasso: Shotgun (Parallel SCD) While not converged, On each of P processors, Choose random coordinate j, Update xj (same as for Shooting) Nice case: Uncorrelated features At the end of this slide: Depending on certain properties of our optimization problem, we might be able to do parallel updates, and we might not. We will now take a closer look at what these properties are. Bad case: Correlated features Is SCD inherently sequential?

Is SCD inherently sequential? where Lasso: Coordinate update: (closed-form minimization) Collective update:

Is SCD inherently sequential? where Lasso: Theorem: If A is normalized s.t. diag(ATA)=1, Decrease of objective Updated coordinates Sequential progress This inequality states that the decrease in objective, for which lower is a good thing, is less than or equal to the sum of these two terms. Sequential progress is always negative, which is good. Interference can be negative or positive. Note that, if we only update one variable (Shooting), then the interference term disappears, and this holds with equality. State: From here on, I will assume that A is normalized s.t. diag(AtA) = 1. Interference

Is SCD inherently sequential? Theorem: If A is normalized s.t. diag(ATA)=1, Nice case: Uncorrelated features (ATA)jk=0 for j≠k. (if ATA is centered) Bad case: Correlated features (ATA)jk≠0 “There is a range for potential for parallelism—and we’ll now see how to capture this range in terms of the Gram matrix AtA.”

Main Theorem: Shotgun Convergence Convergence Analysis where Lasso: Main Theorem: Shotgun Convergence Assume # parallel updates = spectral radius of ATA Final objective Optimal objective iterations Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)

Convergence Analysis Lasso: Nice case: Bad case: where Theorem: Shotgun Convergence final - opt objective Assume iterations # parallel updates where = spectral radius of ATA. Nice case: Uncorrelated features Bad case: Correlated features (at worst)

Convergence Analysis Experiments match the theory! Theorem: Shotgun Convergence Experiments match the theory! Assume . Up to a threshold... linear speedups predicted. Mug32_singlepixcam P (# simulated parallel updates) Iterations to convergence Pmax=158 56870 4 Ball64_singlepixcam P (# simulated parallel updates) Iterations to convergence Pmax=3 The last point for mug32_singlepixcam is (37 iterations, 456 parallel updates). “This is simulated parallelism.”

Thus far... Shotgun Now for some experiments… The naive parallelization of coordinate descent works! Theorem: Linear speedups up to a problem-dependent limit. Now for some experiments…

Experiments: Lasso 35 Datasets 7 Algorithms λ=.5, 10 Shotgun P=8 (multicore) Shooting (Fu, 1998) Interior point (Parallel L1_LS) (Kim et al., 2007) Shrinkage (FPC_AS, SpaRSA) (Wen et al., 2010; Wright et al., 2009) Projected gradient (GPSR_BB) (Figueiredo et al., 2008) Iterative hard thresholding (Hard_l0) (Blumensath & Davies, 2009) Also ran: GLMNET, LARS, SMIDAS 35 Datasets # samples n: [128, 209432] # features d: [128, 5845762] λ=.5, 10 Optimization Details Pathwise optimization (continuation) Asynchronous Shotgun with atomic operations Explain that the 2 lambda values give sparse/not sparse solutions. Note that the optimizations gave significant speedups but lowered speedups from parallelism? Hardware 8 core AMD Opteron 8384 (2.69 GHz)

Experiments: Lasso Shotgun & Parallel L1_LS used 8 cores. Sparse Compressed Imaging Sparco (van den Berg et al., 2009) Shotgun faster Shotgun faster Other alg. runtime (sec) Other alg. runtime (sec) On this (data,λ) Shotgun: 1.212s Shooting: 3.406s (State these fairly quickly.) Point out that the plots are Log-Log plots. Notes from making plots: Sparse compressed: Shooting, FPC_AS, GPSR_BB, hard_l0, l1_ls, sparsa (I want: 1, 5, 2, 6, 3, 4) Sparco: Shooting, FPC_AS, GPSR_BB, l1_ls, sparsa (I want: 1, 4, 2, 5, 3) Shotgun slower Shotgun slower Shotgun runtime (sec) Shotgun runtime (sec) (Parallel) Shotgun & Parallel L1_LS used 8 cores.

Experiments: Lasso Shotgun & Parallel L1_LS used 8 cores. Single-Pixel Camera (Duarte et al., 2008) Shotgun runtime (sec) Other alg. runtime (sec) Shotgun faster Shotgun slower Large, Sparse Datasets Shotgun faster Other alg. runtime (sec) Shotgun slower Shotgun runtime (sec) (Parallel) Shotgun & Parallel L1_LS used 8 cores.

Experiments: Lasso Shooting is one of the fastest algorithms. Single-Pixel Camera (Duarte et al., 2008) Shotgun runtime (sec) Other alg. runtime (sec) Shotgun faster Shotgun slower Large, Sparse Datasets Shotgun faster Shooting is one of the fastest algorithms. Shotgun provides additional speedups. Other alg. runtime (sec) Shotgun slower Shotgun runtime (sec) (Parallel) Shotgun & Parallel L1_LS used 8 cores.

Experiments: Logistic Regression Coordinate Descent Newton (CDN) (Yuan et al., 2010) Uses line search Extensive tests show CDN is very fast. Algorithms Shooting (CDN) Shotgun CDN Stochastic Gradient Descent (SGD) Parallel SGD (Zinkevich et al., 2010) Averages results of 8 instances run in parallel SGD Lazy shrinkage updates (Langford et al., 2009) Used best of 14 learning rates

Experiments: Logistic Regression Zeta* dataset: low-dimensional setting Shooting CDN SGD better Objective Value Parallel SGD Shotgun CDN Be more explicit about why one slide (setting) is better than the other. “Note that this plot has time on a linear scale. As we move to the high-dimensional setting, we change to a log scale for time because the difference between Shotgun and SGD becomes that much bigger.” Time (sec) *From the Pascal Large Scale Learning Challenge http://www.mlbench.org/instructions/ Shotgun & Parallel SGD used 8 cores.

Experiments: Logistic Regression rcv1 dataset (Lewis et al, 2004): high-dimensional setting Shooting CDN SGD and Parallel SGD better Objective Value Shotgun CDN Time (sec) Shotgun & Parallel SGD used 8 cores.

Shotgun: Self-speedup Aggregated results from all tests But we are doing fewer iterations!  Optimal Explanation: Memory wall (Wulf & McKee, 1995) The memory bus gets flooded. Lasso Iteration Speedup Speedup Logistic regression uses more FLOPS/datum. Extra computation hides memory latency. Better speedups on average! Logistic Reg. Time Speedup State that each iteration is taking longer. Not so great  Lasso Time Speedup # cores

Conclusions Thanks! Shotgun: Parallel Coordinate Descent Linear speedups up to a problem-dependent limit Significant speedups in practice Large-Scale Empirical Study Lasso & sparse logistic regression 37 datasets, 9 algorithms Compared with parallel methods: Parallel matrix/vector ops Parallel Stochastic Gradient Descent Future Work Hybrid Shotgun + parallel SGD More FLOPS/datum, e.g., Group Lasso (Yuan and Lin, 2006) Alternate hardware, e.g., graphics processors TO DO: ADD POSTER # ONCE I FIND IT OUT! Code & Data: http://www.select.cs.cmu.edu/projects Thanks!

References Blumensath, T. and Davies, M.E. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. Duarte, M.F., Davenport, M.A., Takhar, D., Laska, J.N., Sun, T., Kelly, K.F., and Baraniuk, R.G. Single-pixel imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83–91, 2008. Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. of Sel. Top. in Signal Processing, 1(4):586–597, 2008. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. Fu, W.J. Penalized regressions: The bridge versus the lasso. J. of Comp. and Graphical Statistics, 7(3):397– 416, 1998. Kim, S. J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale l1-regularized least squares. IEEE Journal of Sel. Top. in Signal Processing, 1(4):606–617, 2007. Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.-NAACL, 2009. Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. In NIPS, 2009a. Lewis, D.D., Yang, Y., Rose, T.G., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. Ng, A.Y. Feature selection, l1 vs. l2 regularization and rotational invariance. In ICML, 2004. Shalev-Shwartz, S. and Tewari, A. Stochastic methods for l1 regularized loss minimization. In ICML, 2009. Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statistical Society, 58(1):267–288, 1996. van den Berg, E., Friedlander, M.P., Hennenfent, G., Herrmann, F., Saab, R., and Yılmaz, O. Sparco: A testing framework for sparse reconstruction. ACM Transactions on Mathematical Software, 35(4):1–16, 2009. Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation. SIAM Journal on Scientific Computing, 32(4):1832–1857, 2010. Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T. Sparse reconstruction by separable approximation. IEEE Trans. on Signal Processing, 57(7):2479–2493, 2009. Wulf, W.A. and McKee, S.A. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 23(1):20–24, 1995. Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin, C. J. A comparison of optimization methods and software for large-scale l1-reg. linear classification. JMLR, 11:3183– 3234, 2010. Zinkevich, M., Weimer, M., Smola, A.J., and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010.

TO DO References slide Backup slides Discussion with reviewer about SGD vs SCD in terms of d,n

Experiments: Logistic Regression Zeta* dataset: low-dimensional setting Shooting CDN SGD Parallel SGD Objective Value Shotgun CDN better (Saved slide with test error plot.) Test Error *Pascal Large Scale Learning Challenge: http://www.mlbench.org/instructions/ Time (sec)

Experiments: Logistic Regression rcv1 dataset (Lewis et al, 2004): high-dimensional setting SGD and Parallel SGD Shooting CDN Shotgun CDN Objective Value better (Saved slide with test error plot.) Test Error Time (sec)

Shotgun: Improving Self-speedup Lasso: Time Speedup Logistic Regression: Time Speedup Max Max Speedup (time) Speedup (time) Mean Logistic regression uses more FLOPS/datum.  Better speedups on average. Mean Min Min # cores # cores Lasso: Iteration Speedup Max (Plots with max and min, not just mean.) Mean Min Speedup (iterations) # cores

Shotgun: Self-speedup Lasso: Time Speedup Not so great  Max Speedup (time) Mean Min Explanation: Memory wall (Wulf & McKee, 1995) The memory bus gets flooded. # cores # cores Speedup (iterations) Lasso: Iteration Speedup Max (saved slide just in case) Mean Min But we are doing fewer iterations! 