Peter Richtarik Why parallelizing like crazy and being lazy can be good.

Slides:

Advertisements

Similar presentations

QUEUE Education is the best friend. An educated person is respected everywhere. Education beats the beauty and the youth. Chanakya.

Advertisements

An Introduction to Calculus. Calculus Study of how things change Allows us to model real-life situations very accurately.

Maximal Independent Subsets of Linear Spaces. Whats a linear space? Given a set of points V a set of lines where a line is a k-set of points each pair.

Data-Assimilation Research Centre

Curved Trajectories towards Local Minimum of a Function Al Jimenez Mathematics Department California Polytechnic State University San Luis Obispo, CA

ABSTRACT: We examine how to determine the number of states of a hidden variables when learning probabilistic models. This problem is crucial for improving.

§1 Greedy Algorithms ALGORITHM DESIGN TECHNIQUES

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Complex numbers i or j.

Success with ModelSmart3D Pre-Engineering Software Corporation Written by: Robert A. Wolf III, P.E. Copyright 2001, Pre-Engineering Software Corporation,

Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Brian Peasley and Stan Birchfield

A*-tree: A Structure for Storage and Modeling of Uncertain Multidimensional Arrays Presented by: ZHANG Xiaofei March 2, 2011.

Distributed Constraint Satisfaction Problems M OHSEN A FSHARCHI.

13-Optimization Assoc.Prof.Dr. Ahmet Zafer Şenalp Mechanical Engineering Department Gebze Technical.

1 Lecture 5 PRAM Algorithm: Parallel Prefix Parallel Computing Fall 2008.

16.410: Eric Feron / MIT, Spring 2001 Introduction to the Simplex Method to solve Linear Programs Eric Feron Spring 2001.

5 x4. 10 x2 9 x3 10 x9 10 x4 10 x8 9 x2 9 x4.

Parallel algorithms for expression evaluation Part1. Simultaneous substitution method (SimSub) Part2. A parallel pebble game.

Computational Facility Layout

Large Scale Computing Systems

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Kick-off Meeting, July 28, 2008 ONR MURI: NexGeNetSci Distributed Coordination, Consensus, and Coverage in Networked Dynamic Systems Ali Jadbabaie Electrical.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

Optimization Tutorial

Network Coding in Peer-to-Peer Networks Presented by Chu Chun Ngai

The perception of Shading and Reflectance E.H. Adelson, A.P. Pentland Presenter: Stefan Zickler.

Peter Richtarik Operational Research and Optimization Extreme* Mountain Climbing * in a billion dimensional space on a foggy day.

Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.

Distributed Optimization with Arbitrary Local Solvers

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

Lecture 6 Image Segmentation

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Online Learning for Collaborative Filtering

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.

David Stotts Computer Science Department UNC Chapel Hill.

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Cohesive Subgraph Computation over Large Graphs

The role of optimization in machine learning

Sathya Ronak Alisha Zach Devin Josh

Large-scale Machine Learning

Multiplicative updates for L1-regularized regression

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Distributed Computation Framework for Machine Learning

Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.

Classification with Perceptrons Reading:

BUPT final year projects

Towards Next Generation Panel at SAINT 2002

Different Architectures

Parallel and Distributed Block Coordinate Frank Wolfe

CS639: Data Management for Data Science

Presentation transcript:

Peter Richtarik Why parallelizing like crazy and being lazy can be good

I. Optimization

Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Western General Hospital ( Creutzfeldt-Jakob Disease) Arup (Truss Topology Design) Ministry of Defence dstl lab (Algorithms for Data Simplicity) Royal Observatory (Optimal Planet Growth)

Big Data digital images & videos transaction records government records health records defence internet activity (social media, wikipedia,...) scientific measurements (physics, climate models,...) BIG Volume BIG Velocity BIG Variety

God’s Algorithm = Teleportation

If You Are Not a God... x0x0 x1x1 x2x2 x3x3

II. Randomized Coordinate Descent Methods [the cardinal directions of big data optimization]

P. R. and M. Takáč Iteration complexity of randomized block coordinate descent methods for minimizing a composite function Mathematical Programming A, 2012 Yu. Nesterov Efficiency of coordinate descent methods on huge-scale optimization problems SIAM J Optimization, 2012

Find the minimizer of 2D Optimization Contours of function Goal:

Randomized Coordinate Descent in 2D N S E W

N S E W 1

1 N S E W 2

3 N S E W 12

3 N S E W 12 4

3 N S E W

3 N S E W

3 N S E W

3 N S E W S O L V E D !

1 Billion Rows & 100 Million Variables

Bridges are Indeed Optimal!

P. R. and M. Takáč Parallel coordinate descent methods for big data optimization ArXiv: , 2012 M. Takáč, A. Bijral, P. R. and N. Srebro Mini-batch primal and dual methods for SVMs ICML 2013

Failure of Naive Parallelization 1a 1b 0

Failure of Naive Parallelization 1a 1b 1 0

Failure of Naive Parallelization 1 2b 2a

Failure of Naive Parallelization 1 2b 2a 2

Failure of Naive Parallelization 2

Parallel Coordinate Descent

Theory

Reality

A Problem with Billion Variables

P. R. and M. Takáč Distributed coordinate descent methods for big data optimization Manuscript, 2013

Distributed Coordinate Descent 1.2 TB LASSO problem solved on the HECToR supercomputer with 2048 cores

III. Randomized Lock-Free Methods [optimization as lock breaking]

A Lock with 4 Dials Setup: Combination maximizing F opens the lock x = (x 1, x 2, x 3, x 4 )F(x) = F(x 1, x 2, x 3, x 4 ) A function representing the “quality” of a combination Optimization Problem: Find combination maximizing F

Optimization Algorithm

P. R. and M. Takáč Randomized lock-free gradient methods Manuscript, 2013 F. Niu, B. Recht, C. Re, and S. Wright HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS, 2011

A System of Billion Locks with Shared Dials # dials = n x1x1 x2x2 x3x3 x4x4 xnxn Lock 1) Nodes in the graph correspond to dials 2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge = # locks

How do we Measure the Quality of a Combination? F : R n R Each lock j has its own quality function F j depending on the dials it owns However, it does NOT open when F j is maximized The system of locks opens when is maximized F = F 1 + F F n

1) Randomly select a lock 2) Randomly select a dial belonging to the lock 3) Adjust the value on the selected dial based only on the info corresponding to the selected lock An Algorithm with (too much?) Randomization

IDLE Synchronous Parallelization J4 J7 J1 J5 J8 J2 time J6 J9 J3 Processor 1 Processor 2 Processor 3 WASTEFUL

Crazy (Lock-Free) Parallelization time J4J5J6J7J8J9J1J2J3 Processor 1 Processor 2 Processor 3 NO WASTE

Crazy Parallelization

Theoretical Result Average # dials in a lock Average # of dials common to 2 locks # Locks # Processors

Computational Insights

IV. Final Two Slides

Why parallelizing like crazy and being lazy can be good? Randomization Effectivity Tractability Efficiency Scalability (big data) Parallelism Distribution Asynchronicity Parallelization

Tools Probability Machine LearningMatrix Theory HPC