Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Scaling Up Graphical Model Inference
Danny Bickson Parallel Machine Learning for Large-Scale Graphs
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
SkewTune: Mitigating Skew in MapReduce Applications
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Libperf libperf provides a tracing interface into the Linux Kernel Performance Counters (LKPC) subsystem recently introduced into the Linux Kernel mainline.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
GraphLab A New Framework for Parallel Machine Learning
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
GraphX: Graph Analytics on Spark
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Parallel and Distributed Systems for Probabilistic Reasoning
TensorFlow– A system for large-scale machine learning
Big Data: Graph Processing
CSCI5570 Large Scale Data Processing Systems
A New Parallel Framework for Machine Learning
Big Learning with Graphs
Distributed Graph-Parallel Computation on Natural Graphs
CSCI5570 Large Scale Data Processing Systems
Data Structures and Algorithms in Parallel Computing
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Distributed Systems CS
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
Scalable Parallel Interoperable Data Analytics Library
HPML Conference, Lyon, Sept 2018
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
Big Data I: Graph Processing, Distributed Machine Learning
Splash Belief Propagation:
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein

Big Data is Everywhere 72 Hours a Minute YouTube 28 Million Wikipedia Pages 900 Million Facebook Users 6 Billion Flickr Photos 2 “… data a new class of economic asset, like currency or gold.” “…growing at 50 percent a year…”

How will we design and implement Big learning systems? Big Learning 3

Shift Towards Use Of Parallelism in ML GPUsMulticoreClustersCloudsSupercomputers ML experts repeatedly solve the same parallel design challenges: Race conditions, distributed state, communication… Resulting code is very specialized: difficult to maintain, extend, debug… Graduate students Avoid these problems by using high-level abstractions 4

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 5

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 6 Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2 MapReduce – Reduce Phase Image Features Attractive Face Statistics Ugly Face Statistics U A A U U U A A U A U A Attractive FacesUgly Faces

MapReduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction MapReduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization Is there more to Machine Learning ? 10

Carnegie Mellon Exploit Dependencies

12 Hockey Scuba Diving Underwater Hockey Scuba Diving Hockey

Graphs are Everywhere Users Movies Netflix Collaborative Filtering Docs Words Wiki Text Analysis Social Network Probabilistic Analysis 13

Properties of Computation on Graphs Dependency Graph Iterative Computation My Interests Friends Interests Local Updates

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization 15

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization 16 Map Reduce?

Properties of Graph-Parallel Algorithms Dependency Graph Iterative Computation My Interests Friends Interests Local Updates

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Data-Mining PageRank Triangle Counting Collaborative Filtering Tensor Factorization Map Reduce?

Carnegie Mellon What is GraphLab?

Shared Memory GraphLab Ideal Speedup #proc Speedup BP+Parameter Learning Ideal Speedup #proc Speedup Gibbs Sampling Ideal Speedup #proc Speedup CoEM Ideal Speedup #proc Speedup Lasso Ideal Speedup #proc Speedup Compressed Sensing 20 GraphLab: A New Parallel Framework for Machine Learning Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein (UAI 2010)

Bayesian Tensor Factorization Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM SVD LDA …Many others… Linear Solvers Splash Sampler Alternating Least Squares Shared Memory

22 Limited CPU Power Limited Memory Limited Scalability

Distributed Cloud -Distributing State -Data Consistency -Fault Tolerance -Distributing State -Data Consistency -Fault Tolerance 23 Unlimited amount of computation resources! (up to funding limitations) Unlimited amount of computation resources! (up to funding limitations)

The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 24

Data Graph Data associated with vertices and edges Vertex Data: User profile Current interests estimates Edge Data: Relationship (friend, classmate, relative) Graph: Social Network 25

Distributed Graph Partition the graph across multiple machines. 26

Distributed Graph Ghost vertices maintain adjacency structure and replicate remote data. “ghost” vertices 27

Distributed Graph Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …) “ghost” vertices 28

The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 29

Pagerank(scope){ // Update the current vertex data // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; } Update Functions User-defined program: applied to a vertex and transforms data in scope of vertex Dynamic computation Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation 30

BSP ML Problem: Synchronous Algorithms can be Inefficient Bulk Synchronous Asynchronous Splash BP 31

Shared Memory Dynamic Schedule e e f f g g k k j j i i h h d d c c b b a a CPU 1 CPU 2 a a h h a a b b b b i i 32 Process repeats until scheduler is empty Scheduler

Distributed Scheduling e e i i h h b b a a f f g g k k j j d d c c a a h h f f g g j j c c b b i i Each machine maintains a schedule over the vertices it owns. 33 Distributed Consensus used to identify completion

Ensuring Race-Free Code How much can computation overlap? 34

CPU 1 CPU 2 Common Problem: Write-Write Race 35 Processors running adjacent update functions simultaneously modify shared data: CPU1 writes:CPU2 writes: Final Value

The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 36

Racing Collaborative Filtering 37

Serializability 38 For every parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

Serializability Example 39 Read Write Update functions one vertex apart can be run in parallel. Edge Consistency Overlapping regions are only read. Overlapping regions are only read. Stronger / Weaker consistency levels available User-tunable consistency levels trades off parallelism & consistency Stronger / Weaker consistency levels available User-tunable consistency levels trades off parallelism & consistency

Stronger / Weaker Consistency 40 Write Does not read adjacent vertices Full Consistency Parallel Updates must be two vertices apart Vertex Consistency Parallel Updates can be adjacent.

Solution 1 Graph Coloring Distributed Consistency Solution 2 Distributed Locking

Edge Consistency via Graph Coloring Vertices of the same color are all at least one vertex apart. Therefore, All vertices of the same color can be run in parallel! 42

Chromatic Distributed Engine Time Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Ghost Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Ghost Synchronization Completion + Barrier 43

Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d 44 Users Movies

Netflix Collaborative Filtering 45 Ideal D=100 D=20 # machines Hadoop MPI GraphLab # machines

The Cost of Hadoop 46

Problems Require a graph coloring to be available. Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round. 47

Solution 1 Graph Coloring Distributed Consistency Solution 2 Distributed Locking

Edge Consistency can be guaranteed through locking. : RW Lock 49

Consistency Through Locking Acquire write-lock on center vertex, read-lock on adjacent. 50

Solution Pipelining CPU Machine 1 Machine 2 A A C C B B D D Consistency Through Locking Multicore Setting PThread RW-Locks Distributed Setting Distributed Locks Challenges Latency A A C C B B D D A A C C B B D D A A 51

No Pipelining lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 Time 52

Pipelining / Latency Hiding Hide latency using pipelining lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 lock scope 2 Time lock scope 3 Process request 2 Process request 3 scope 2 acquired scope 3 acquired update_function 2 release scope 2 53

The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 54

What if machines fail? How do we provide fault tolerance? What if machines fail? How do we provide fault tolerance?

Checkpoint 1: Stop the world 2: Write state to disk

Snapshot Performance 57 No Snapshot SnapshotOne slow machine Because we have to stop the world, One slow machine slows everything down! Because we have to stop the world, One slow machine slows everything down! Snapshot time Slow machine

How can we do better? Take advantage of consistency

Checkpointing 1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems. 59 snapshotted Not snapshotted

Checkpointing Fine Grained Chandy-Lamport. 60 Easily implemented within GraphLab as an Update Function!

Async. Snapshot Performance 61 No Snapshot SnapshotOne slow machine No penalty incurred by the slow machine!

Summary Extended GraphLab abstraction to distributed systems Two different methods of achieving consistency Graph Coloring Distributed Locking with pipelining Efficient implementations Asynchronous Fault Tolerance with fined-grained Chandy-Lamport Performance Useability Efficiency Scalabilitys 62