A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.

Slides:

Advertisements

Similar presentations

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Distributed Graph Processing Abhishek Verma CS425.

Spark: Cluster Computing with Working Sets

Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Distributed Computations

Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Distributed Systems CS Programming Models- Part V Replication and Consistency- Part I Lecture 18, Oct 29, 2014 Mohammad Hammoud 1.

PRASHANTHI NARAYAN NETTEM.

GraphLab A New Framework for Parallel Machine Learning

Pregel: A System for Large-Scale Graph Processing

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Computer System Architectures Computer System Software

Database System Development Lifecycle © Pearson Education Limited 1995, 2005.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

Carnegie Mellon University GraphLab Tutorial Yucheng Low.

Network Aware Resource Allocation in Distributed Clouds.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

Minimum Spanning Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Chapter 6 Distributed File Systems Summary Bernard Chen 2007 CSc 8230.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Data Structures and Algorithms in Parallel Computing

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.

PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+

TensorFlow– A system for large-scale machine learning

Big Data: Graph Processing

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

Distributed Shared Memory

CSCI5570 Large Scale Data Processing Systems

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

PREGEL Data Management in the Cloud

CSCI5570 Large Scale Data Processing Systems

Predictive Performance

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Distributed Systems CS

Database System Architectures

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN

Why do we need GraphLab? Machine Learning and Data Mining (MLDM) problems increasingly need systems that can execute MLDM algorithms in parallel on large clusters. Implementing MDLM algorithms in parallel on current systems like Hadoop and MPI can be both prohibitively complex and costly. The MLDM community needs a high-level abstraction to handle the complexities of graph and network algorithms.

Introduction A BRIEF REVIEW OF KEY CONCEPTS

Sequential vs Parallel Processing In sequential processing, threads are run on a single node in the order that they are requested. In parallel processing, the independent threads are divided among the available nodes Node 1Node Node 1Node Threads

Data Parallelism Data parallelism focuses on distributing the data across different parallel computing nodes. It is achieved when each processer performs the same task on different pieces of distributed data. Data Set Node 1 Node 2 Node 3

Data Parallelism Data set: Numbers 1-9, Task Set: ‘Is Prime’

Task Parallelism Task parallelism focuses on distributing execution processes across different parallel computing nodes. It is achieved when each processor executes a different process on the same or different data. Task Set Node 1 Node 2 Node 3

Task Parallelism ‘Is Prime’, ‘Is Even’, ‘Is Odd’ ‘Is Prime’ ‘Is Even’ ‘Is Odd’ Data set: Numbers 1-9, Task Set: ‘Is Prime’, ‘Is Even’, ‘Is Odd’

Graph Parallelism Graph parallelism focuses on distributing vertices from a sparse graph G = {V,E} across different parallel computing nodes. It is achieved when each processor executes a vertex program Q which can interact with neighboring instances Q(u), (u,v) in V. Graph Node 1 Node 2 Node 3

Graph Parallelism G = {V,E} v1v1 v2v2 v3v3 Data set: Numbers 1-9, Task Set: ‘Is Prime’, ‘Is Even’, ‘Is Odd’

MLDM Algorithm Properties

Graph Structured Computation DEPENDENCY GRAPH Many of the recent advances in MLDM have focused on modeling the dependencies between data. By modeling dependencies, we are able to extract more signal from noisy data.

Asynchronous Iterative Computation Synchronous systems update all parameters simultaneously (in parallel) using parameter values from the previous time step as input Asynchronous systems update parameters using the most recent parameter values as input. Many MLDM algorithms benefit from asynchronous systems.

Dynamic Computation Static computation requires the algorithm to update all vertices equally often. This wastes time recomputing vertices who have effectively converged. Dynamic computation allows the algorithm to potentially save time by only recomputing vertices whose neighbors have recently updated.

Serializability 15 For every parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Parallel Sequential time

Serializability Serializability ensures that all parallel executions have an equivalent sequential execution, which eliminates race conditions. Race conditions are a programming fault which can produce undetermined program states and behaviors. Many MLDM algorithms converge faster if serializability is ensured. Some, like Dynamic Advanced Life Support algorithm, require serializability for correctness and/or stability.

Distributed GraphLab Abstraction

PageRank Algorithm

Data Graph The GraphLab abstraction stores the program state as a directed graph called the data graph, G = (V, E, D), where D is the user defined data. Data is broadly defined as model parameters, algorithm state, and statistical data. Graph Based Data Representation

Data Graph – PageRank Example 20 A graph, G = {V,E}, with arbitrary data associated with each vertex and edge. Vertex Data: stores R(v) Current PageRank estimate Edge Data: store w u,v Directed weight of the link Graph: Web graph

Update Functions An update function is a stateless procedure that modifies the data within the scope of a vertex and schedules the future execution of the update functions on other vertices. GraphLab update takes a vertex v and its scope S v and returns the new versions of the data in the scope as well as a set vertices T: Update: f(v,S v ) -> (S v, T)

Update Function: PageRank The update function for PageRank computes a weighted sum of the current ranks of neighboring vertices and assigns it as the rank of the current vertex. The algorithm is adaptive: neighbors are scheduled for update only if the value of the current vertex changes by more than a predefined threshold. Current Vertex Scope

The GraphLab Execution Model The GraphLab execution model enables efficient distribution by relaxing the execution- ordering requirements of the shared memory and allowing the GraphLab runtime engine to determine best order in which to run vertices. It eliminates messaged and isolates the user- defined algorithm from the movement of the data, allowing the system to choose when and how to move the program state. GraphLab Execution Model Input: Data Graph G = (V, E, D) Input: Initial vertex set T = {v 1, v 2,…} while T is not Empty do v <- RemoveNext(T) (T’, S v ) <- f(v, S v ) T <- T U T’ Output: Modified Data Graph G = (V, E, D’)

Ensuring Serializability GraphLab ensures a serializable execution by stipulating that for every parallel execution, there exists a sequential execution of update functions which produces the same result. GraphLab several consistency models which allow the runtime to optimize the parallel execution while maintaining serializability. The greater the consistency, the lower the parallelism. Full Consistency Edge Consistency Vertex Consistency

Read Write Full Consistency A full consistency which ensure that the scopes of concurrently executing update functions do not overlap. The update function has complete read- write access to its entire scope. This limits the potential parallelism since concurrently executing update functions must be at least two vertices apart.

Edge Consistency The edge consistency model ensures each update function has exclusive read-write access to its vertex and adjacent edges, but read only access to adjacent vertices This increases parallelism by allowing update functions with slightly overlapping scopes to safely run in parallel. Read Write

Vertex Consistency The vertex consistency model only provides write access to the central vertex data. This allows all update functions to be run in parallel, providing maximum parallelism. However, the this is the least consistent model available. Read Write

Global Values Many MLDM algorithms require the maintenance of global statistics describing data stored in the data graph. GraphLab defines global values as values which are read by update functions and written with sync operations.

Sync Operation The sync operation is an associative commutative sum which is defined over all parts of the graph. This supports tasks like normalization that are common in MLDM algorithms. The sync operation runs continuously in the background to maintain updated estimates of the global value. Ensuring serializability of the sync operation is costly and requires synchronization and halting all computation.

Distributed GraphLab Design

Distributed Data Graph A graph is distributed into k parts where k is much greater than the number of machines. Each part, called an atom is stored as a separate file on a DFS. A meta-graph is used to store the connectivity structure and file locations of the k atoms. 31 Atom

Distributed Data Graph - Ghost Vertices Each atom maintains stores information regarding ghosts: the set of vertices and edges adjacent to the partition boundary. Ghosts vertices maintain adjacency structure and replicate remote data. They are used as caches for their true counterparts across the network and coherence is managed with version control. 32

Distributed GraphLab Engines The Distributed GraphLab engine emulates the GraphLab execution model of the shared- memory abstraction. Responsible for: Executing update functions Executing sync operations Maintaining the set of scheduled vertices T Ensuring serializability with respect to the appropriate consistency model Two Engine Types: 1)Chromatic Engine 2)Distributed Locking Engine

Chromatic Engine The Chromatic Engine uses vertex coloring to satisfy the edge consistency model by executing synchronously all vertices of the sale color in the vertex set T before proceeding to the next color. Vertex consistency is satisfied by assigning all vertices the same color. Full consistency is satisfied by ensuring that no vertex shares the same color as any of its distance two neighbors. The Chromatic Engine has low-overhead, but does not support vertex prioritization. Edge Consistency model using the Chromatic Engine

Distributed Locking Engine The Distribute Locking Engine uses the technique of mutual exclusion by associating a readers-writers lock with each vertex. Vertex consistency is achieved by acquiring a write-lock on the central vertex of each requested scope. Edge consistency is achieved by acquiring a write-lock on the central vertex and read locks on adjacent vertices. Full consistency is achieved by acquiring write- locks on the central vertex and all adjacent vertices. Write Lock Read Lock Central Vertex Scope

Pipelined Locking and Prefetching Each machine maintains a pipeline of which locks have been requested, but not yet fulfilled. The pipelining system uses callbacks instead of readers-writer locks since the later would halt the pipeline. Lock acquisition requests provide a pointer to a callback, which is called once the request is fulfilled. Pipelining reduces latency by synchronizing locked data immediately after each local lock. Pipelined Locking Engine Thread Loop while not done do if Pipeline Has Read Vertex v then Execute (T’, S v ) = f(v, S V ) //update scheduler on each machine For each machine p, Send {sϵT’: owner(s) = p} Release locks and push changes to S v in background else Wait on the Pipeline

Fault Tolerance GraphLab uses a distributed checkpoint system called Snapshot Update to introduce fault tolerance. Snapshot Update can be deployed synchronously or asynchronously. Asynchronous snapshots are more efficient and can guarantee a consistent snapshot under the following conditions: Edge consistency is used on all update functions. Schedule completes before the scope is unlocked. Snapshot Update is prioritized over other updates. Snapshot Update on vertex v If v was already snapshotted then Quit Save D v // Save current vertex foreach u ϵ N[v] do // Loop over neighbors if u was not snapshotted then Save data on edge D u v Schedule u for a Snapshot Update Mark v as snapshotted

System Design Overview Initialization Phase Distributed File system Raw Graph Data Distributed File system Atom Index Atom File Atom File Atom File Atom File Atom File Atom File (MapReduce) Graph Builder Parsing + Partitioning Atom Collection Index Construction GraphLab Execution Phase Distributed File system Atom Index Atom File Atom File Atom File Atom File Atom File Atom File Cluster TCP RCP Comms Monitoring + Atom Placement GL Engine

Locking Engine Design Overview Async RPC Comm. (Over TCP) GraphLab Engine Update Fn. Exec Thread Scope Prefetch (Locks + Data) Pipeline Update Fn. Exec Thread Scheduler Remote Graph Cache Distributed Locks Local Graph Storage Distributed Graph

Applications

Netflix Movie Recommendation The Netflix movie recommendation task uses collaborative filtering to predict the movie ratings for each user based on the ratings of similar users. The alternating least squares(ALS) algorithm is often used and can be represented using the GraphLab abstraction The sparse matrix R defines a bipartite graph connecting each user with the movies that they rated. Vertices are users and movies and edges contain the ratings for a user-movie pair. 1 2 a b c d

Netflix Comparisons The GraphLab implementation was compared against Hadoop and MPI using between 4 to 64 machines. GraphLab performs between times faster than Hadoop. It also slightly outperformed the optimized MPI implementation. Hadoop MPI GraphLab # Machines

Netflix Scaling with Intensity Plotted is the speedup achieved for varying values of dimensionality, d, and the corresponding number of cycles required per update. Extrapolating to obtain the theoretically optimal runtime, the estimated overhead of Distributed GraphLab at 64 machines is 12x for dimensionality 5 and 4.9x for dimensionality 100. This overhead includes graph loading and communication and gives a measurable objective for future optimizations Ideal D=100 D=20 # machines

Video Co-segmentation (CoSeg) Video co-segmentation automatically identifies and clusters spatio-temporal segments of video that share similar texture and color characteristics. Frames of high-resolution video are processed by coarsening each frame to a regular grid of rectangular super-pixels. The CoSeg algorithm predicts the best label (e.g. sky, building, grass, pavement, trees for each super pixel). High-Res ImageSuper-Pixel

CoSeg Algorithm Implementation CoSeg uses Gaussian Mixture Model in conjunction with Loopy Belief Propagation. Updates that are expected to change vertex values significantly are prioritized. Distributed GraphLab is the only distributed graph abstraction that allows the use of prioritized scheduling. CoSeg scales excellently due to having a very sparse graph and high computational intensity.

Named Entity Recognition (NER) Named Entity Recognition is the task of determining the type (e.g., Person, Place, or Thing) of a noun-phrase (e.g. Obama, Chicago, or Car) from its context (e.g. “President..”, “Lives near..”, or “bought a..”). The data graph of bipartite with one set of vertices corresponding to the noun-phrases and other corresponding to each contexts. There is an edge between a noun-phrase and a context if the noun-phrase occurs in the context. FoodReligion onionCatholic garlicFreemasonry noodlesMarxism blueberriesCatholic Chr.

NER Comparisons The GraphLab implementation of NER achieved 20-30x speedup over Hadoop and was comparable to the optimized MPI. However, GraphLab scaled poorly achieving only a 3x improvement using 16x more machines. This poor performance can be attributed to the large vertex data size, dense connectivity, and poor partitioning.

EC2 Cost Evaluation The price-runtime curves for GraphLab and Hadoop illustrate the monetary cost of deploying either system. The price-runtime curve demonstrates diminishing returns: the cost of attaining reduced runtimes increases faster than linearly. For the Netflix application, GraphLab is about two orders of magnitude more cost-effective than Hadoop.

Conclusion Distributed GraphLab extends the shared memory GraphLab to the distributed setting by: Refining the execution model Relaxing the schedule requirements Introducing a new distributed data-graph Introducing new execution engines Introducing fault tolerance. Distributed Graphlab outperforms Hadoop by 20-60x and is competitive with tailored MPI implementations.

Asynchronous Iterative Update Dependency graph of shoppers’ preferences Local update of shopper