StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Regularized risk minimization
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Regularization David Kauchak CS 451 – Fall 2013.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
A Casual Chat on Convex Optimization in Machine Learning Data Mining at Iowa Group Qihang Lin 02/09/2014.
Wangmeng Zuo, Deyu Meng, Lei Zhang, Xiangchu Feng, David Zhang
Chapter 2: Lasso for linear models
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Computer vision: models, learning and inference
1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)
Lecture 5 A Priori Information and Weighted Least Squared.
Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
x – independent variable (input)
Numerical Optimization
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
1 Systems of Linear Equations Iterative Methods. 2 B. Iterative Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
1 Systems of Linear Equations Iterative Methods. 2 B. Direct Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
Advanced Topics in Optimization
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
UNCONSTRAINED MULTIVARIABLE
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Blockwise Coordinate Descent Procedures for the Multi-task Lasso with Applications to Neural Semantic Basis Discovery ICML 2009 Han Liu, Mark Palatucci,
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Fast and incoherent dictionary learning algorithms with application to fMRI Authors: Vahid Abolghasemi Saideh Ferdowsi Saeid Sanei. Journal of Signal Processing.
Discriminant Functions
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
NONNEGATIVE MATRIX FACTORIZATION WITH MATRIX EXPONENTIATION Siwei Lyu ICASSP 2010 Presenter : 張庭豪.
Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Scalable training of L1-regularized log-linear models
Predictive Learning from Data
Data Driven Resource Allocation for Distributed Learning
Large Margin classifiers
Multiplicative updates for L1-regularized regression
ECE 5424: Introduction to Machine Learning
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Local Search Algorithms
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Jan Rupnik Jozef Stefan Institute
Probabilistic Models for Linear Regression
Lecture 8 Generalized Linear Models &
Linear Regression.
Logistic Regression & Parallel SGD
Large Scale Support Vector Machines
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Support Vector Machine I
Solving Linear Systems: Iterative Methods and Sparse Systems
Sparse Principal Component Analysis
Local Search Algorithms
L23 Numerical Methods part 3
Logistic Regression Geoff Hulten.
Presentation transcript:

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson and Carlos Guestrin University of Washington

Coordinate descent Simple and good optimization algorithm Fast in practice Understood with theory No learning rate or other parameters 😃

Lasso objective Solution is sparse—majority of weights equal 0

Nonnegative Lasso objective

Nonnegative Lasso objective StingyCD can also solve normal Lasso Also straightforward to extend to linear SVM

Inside an iteration of CD Residuals vector For chosen coordinate, compute

Major drawback of CD “Zero updates” Zero updates are wasteful! Due to sparsity, zero updates are very common! Computing gradient requires time

StingyCD Skip updates guaranteed to be zero Skip condition requires just constant time

Geometry of a zero update Residuals vector

StingyCD StingyCD makes 3 simple changes to CD

Change 1: Reference residuals vector Reference updated infrequently (once every several epochs)

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 1: Reference residuals vector

Change 2: Track reference distance

Change 2: Track reference distance

Change 2: Track reference distance

Change 2: Track reference distance

Change 2: Track reference distance

Change 2: Track reference distance

Change 2: Track reference distance

Change 3: Threshold reference distance

Change 3: Threshold reference distance

Summary of StingyCD changes Before each iteration, check skip condition Constant time \ Constant time

Reference update trade-off

Scheduling reference updates Relative time to converge

StingyCD empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD

Skipping more updates with StingyCD+ \

Skipping more updates with StingyCD+ \

Skipping more updates with StingyCD+ \

Probability of useful update StingyCD+ models the probability each update is useful (i.e. nonzero) Efficiently compute probability with lookup table

StingyCD+ empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD StingyCD+ \

StingyCD+ empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD StingyCD+ \

Combining StingyCD+ with other methods Popular sparse logistic regression algorithms: Approximate proximal newton Working set algorithms Both rely on Lasso subproblem solvers Compare CD, StingyCD+ as subproblem solvers \

Sparse logistic regression results Time (s) Relative suboptimality CD ProxNewt CD ProxNewt w/ Working Sets StingyCD+ ProxNewt StingyCD+ ProxNewt w/ Working Sets \

Sparse logistic regression results Time (min) Relative suboptimality CD ProxNewt CD ProxNewt w/ Working Sets StingyCD+ ProxNewt StingyCD+ ProxNewt w/ Working Sets \

Takeaways Thank you! StingyCD makes simple changes to CD Avoids wasteful computation Further gains possible with relaxations Can combine with other methods Future directions Extend to more problem settings Apply ”stingy updates” to other algorithms Thank you!