Distributed Optimization with Arbitrary Local Solvers

Slides:



Advertisements
Similar presentations
Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,
Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Peter Richtarik Why parallelizing like crazy and being lazy can be good.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
Optimization Tutorial
Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =
Separating Hyperplanes
Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.
Visual Recognition Tutorial
1 2 Extreme Pathway Lengths and Reaction Participation in Genome Scale Metabolic Networks Jason A. Papin, Nathan D. Price and Bernhard Ø. Palsson.
Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
1 A Second Stage Network Recourse Problem in Stochastic Airline Crew Scheduling Joyce W. Yen University of Michigan John R. Birge Northwestern University.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Distributed Combinatorial Optimization
Martin Burger Total Variation 1 Cetraro, September 2008 Numerical Schemes Wrap up approximate formulations of subgradient relation.
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
Collaborative Filtering Matrix Factorization Approach
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Latent (S)SVM and Cognitive Multiple People Tracker.
Biointelligence Laboratory, Seoul National University
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Online Learning for Collaborative Filtering
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Matrix Factorization Reporter : Sun Yuanshuai
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
The role of optimization in machine learning
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Lecture 07: Soft-margin SVM
Distributed Submodular Maximization in Massive Datasets
Kernels Usman Roshan.
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Lecture 07: Soft-margin SVM
Usman Roshan CS 675 Machine Learning
CS639: Data Management for Data Science
ADMM and DSO.
Chapter 6. Large Scale Optimization
CS249: Neural Language Model
Presentation transcript:

Distributed Optimization with Arbitrary Local Solvers Optimization and Big Data 2015, Edinburgh May 6, 2015 Distributed Optimization with Arbitrary Local Solvers Jakub Konečný joint work with Chenxin Ma, Martin Takáč – Lehigh University Peter Richtárik – University of Edinburgh Martin Jaggi – ETH Zurich

Introduction Why we need distributed algorithms

The Objective Optimization problem formulation Regularized Empirical Risk Minimization Jakub Konečný

Traditional efficiency analysis Given algorithm , the time needed is Main trend – Stochastic methods Small , big Time needed to run one iteration of algorithm Total number of iterations needed Target accuracy Jakub Konečný

Motivation to distribute data Typical computer “Typical” dataset CIFAR-10/100 ~ 200 MB [1] Yahoo Flickr Creative Commons 100M ~12 GB [2] ImageNet ~ 125 GB [3] Internet Archive ~ 80 TB [4] 1000 Genomes ~ 464 TB (Mar 2013; still growing) [5] Google Ad prediction, Amazon recommendations ~ ?? PB RAM: 8 – 64 GB Disk space: 0.5 – 3 TB Jakub Konečný

Motivation to distribute data Where does the problem size come from? Often, both would be BIG at the same time Both can be in order of billions Jakub Konečný

Computational bottlenecks Processor – RAM communication Super fast Processor – Disk communication Not as fast Computer – Computer communication Quite slow Designing an optimization scheme with communication efficiency in mind, is key to speeding up distributed optimization Jakub Konečný

Distributed efficiency analysis There is lot of potential for improvement, if because most of the time is spent on communication Time for round of communication Jakub Konečný

Distributed algorithms – examples Hydra [6] Distributed coordinate descent (Richtárik, Takáč) One round communication SGD [7] (Zinkevich et al.) DANE [8] Distributed Approximate Newton (Shamir et al.) Seems good in practice; theory not satisfactory Show that above method is weak CoCoA [9] Upon which this work builds (Jaggi et al.) Jakub Konečný

flexibility of this paradigm Our goal Split the main problem to meaningful subproblems Run arbitrary local solver to solve the local objective Reach accuracy on the subproblem Main problem Subproblems Solved locally Results in improved flexibility of this paradigm Jakub Konečný

Efficiency analysis revisited Such framework yields the following paradigm Jakub Konečný

Efficiency analysis revisited Target local accuracy With decreasing increases decreases With increasing Jakub Konečný

An example of Local Solver Take Gradient Descent (GD) for Naïve distributed GD – with single gradient step, just picks a particular value of But for GD, perhaps different value is optimal, corresponding to, say, 100 steps For various algorithms, different values of are optimal. That explains why more local iterations could be helpful for greater efficiency Jakub Konečný

Experiments (demo) Local Solver – Coordinate Descent Jakub Konečný

Problem specification

Problem specification (primal) Jakub Konečný

Problem specification (dual) This is the problem we will be solving Jakub Konečný

Assumptions - smoothness of - strong convexity Implies – strong convexity of - strong convexity Implies – smoothness of Jakub Konečný

The Algorithm

Necessary notation Partition of data points: Masking of a partition Complete Disjoint Masking of a partition Jakub Konečný

Data distribution Computer # owns Data points Dual variables Not a clear way to distribute the objective function Jakub Konečný

The Algorithm “Analysis friendly” version Jakub Konečný

Necessary properties for efficiency Locality Subproblem can be formed solely based on information available locally to computer Independence Local solver can run independently, without need for any communication with other computers Local changes Outputs only – change in coordinates stored locally Efficient maintenance To form new subproblem with new dual variable we need to send and receive only a single vector in Jakub Konečný

More notation… Denote Then, Jakub Konečný

The Subproblem Multiple ways to choose Value of aggregation parameter depends on it For now, let us focus on Jakub Konečný

Subproblem intuition Consistency in Local under-approximation (shifted) Jakub Konečný

The Subproblem Closer look The problematic term It will be the focus in the following slides Constant; added for convenience in analysis Linear combination of columns stored locally Separable term; dependent only on variables stored locally Jakub Konečný

Dealing with Three steps needed (A) Impossible locally (B) Easy operation (C) Impossible locally – is distributed (B) Apply gradient (A) Form primal point (C) Multiply by Jakub Konečný

Dealing with Note that we need only Course Suppose we have available, and can run local solver to obtain local update Form a vector to send to master node Receive another vector from master node Form and be ready to run local solver again The local coordinates Partition identity matrix Jakub Konečný

Dealing with Local workflow Single iteration Run local solver in iteration Obtain local update Compute Send to a master node Master node: Form Compute and send it back Receive Master node has to remember extra vector Single iteration Jakub Konečný

The Algorithm “Implementation friendly” version Jakub Konečný

Results (theory)

Local decrease assumption Jakub Konečný

Reminder The new distributed efficiency analysis Jakub Konečný

Theorem (strongly convex case) If we run the algorithm with and then, Jakub Konečný

Theorem (general convex case) If we run the algorithm with and then, Jakub Konečný

Results (Experiments)

Experimental Results Coordinate Descent, various # of local iterations Jakub Konečný

Experimental Results Coordinate Descent, various # of local iterations Jakub Konečný

Experimental Results Coordinate Descent, various # of local iterations Jakub Konečný

Different subproblems Big/small regularization parameter Jakub Konečný

Extras Possible to formulate different subproblems Jakub Konečný

Extras Possible to formulate different subproblems With – Useful for SVM dual Jakub Konečný

Extras Possible to formulate different subproblems Primal only Used with (see [6]) Similar theoretical results Jakub Konečný

Mentioned datasets [1] http://www.cs.toronto.edu/~kriz/cifar.html [2] http://yahoolabs.tumblr.com/post/89783581601/ one-hundred- million-creative-commons-flickr-images [3] http://www.image-net.org/ [4] http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web- crawl-data-available-for-research/ [5] http://www.1000genomes.org Jakub Konečný

References [6] Richtárik, Peter, and Martin Takáč. "Distributed coordinate descent method for learning with big data." arXiv preprint arXiv:1310.2059 (2013). [7] Zinkevich, Martin, et al. "Parallelized stochastic gradient descent." Advances in Neural Information Processing Systems. 2010. [8] Ohad Shamir, Nathan Srebro, and Tong Zhang. "Communication efficient distributed optimization using an approximate Newton-type method." arXiv preprint arXiv:1312.7853 (2013). [9] Jaggi, Martin, et al. "Communication-efficient distributed dual coordinate ascent." Advances in Neural Information Processing Systems. 2014. Jakub Konečný