GraphLab A New Framework for Parallel Machine Learning

GraphLab A New Framework for Parallel Machine Learning
Yucheng Low Aapo Kyrola Carlos Guestrin Joseph Gonzalez Danny Bickson Joe Hellerstein

Exponential Parallelism
13 Million Wikipedia Pages 3.6 Billion photos on Flickr Exponentially Increasing Parallel Performance Constant Sequential Performance Exponentially Increasing Sequential Performance Processor Speed GHz [c] A decade ago, processors have experienced exponentially increasing sequantial performance. [c] the last 5 years or so sequential performance has stagnated ,[c] but instead we are observing expontially increasing parallel performance. [c] need to take advantage parallelism to make use of the ever increasing dataset sizes that are now available. Release Date

Parallel Programming is Hard
Designing efficient Parallel Algorithms is hard Race conditions and deadlocks Parallel memory bottlenecks Architecture specific concurrency Difficult to debug ML experts repeatedly address the same parallel design challenges Graduate students any non-trivial algorithm to scale will require the designer to reason about race conditions, deadlocks as well as a variety of other systems issues. ML experts have to repeated address these same parallel design challeges Therefore we typically Make use of High level abstractions to manage much of the complexity for us. An abstraction that has gained significant popularity lately is the MapReduce abstraction. We will quickly review it here. Avoid these problems by using high-level abstractions.

MapReduce – Map Phase CPU 1 CPU 2 CPU 3 CPU 4
. 9 CPU 2 4 2 . 3 CPU 3 2 1 . 3 CPU 4 2 5 . 8 2 parts. A Map stage and a Reduce stage. The Map stage represents embarassingly parallel computation. That is, each computation is independent and can performed on different macheina without any communciation. Embarrassingly Parallel independent computation No Communication needed

. 1 CPU 2 8 4 . 3 CPU 3 1 8 . 4 CPU 4 8 4 . For instance, we could use MapReduce to perform feature extraction on a large number of pictures. For instance, .. To compute an attractiveness score. 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed

7 . 5 CPU 2 6 7 . 5 CPU 3 1 4 . 9 CPU 4 3 4 . 1 2 . 9 2 4 . 1 4 2 . 3 8 4 . 3 2 1 . 3 1 8 . 4 2 5 . 8 8 4 . Embarrassingly Parallel independent computation No Communication needed

MapReduce – Reduce Phase
CPU 1 22 26 . CPU 2 17 26 . 31 The Reduce stage is essentially a “fold” or an aggregation operation over the results. This for instance can be used to compile summary statistics. 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 3 4 . Fold/Aggregation

Interdependent Computation: Not MapReduceable
Related Data But what if we have a million images, all are related, and we want to reason about them together? Say for instance, people who are attractive tend to make friends with attractive people. Computation becomes interdependent and this is no longer easily MapReduceable Interdependent Computation: Not MapReduceable

Parallel Computing and ML
Not all algorithms are efficiently data parallel Data-Parallel Complex Parallel Structure Kernel Methods Belief Propagation Feature Extraction SVM Tensor Factorization Algorithms such as cross validation and feature extraction are data parallel algorithms, which are easily map-reduceable. But there is a much larger world of algorithms which have complex parallel structure and do not fit well in the MapReduce framework. However, there are some useful properties common to a number of ML algorithms which we can exploit. Cross Validation Sampling Deep Belief Networks Neural Networks Lasso

Common Properties 1) Sparse Data Dependencies Sparse Primal SVM
Tensor/Matrix Factorization 2) Local Computations Sampling Belief Propagation Many ML algorithms have relatively sparse data dependencies, that is to say, there are not too many relationships between the data elements. Computation is local. This means that computation can be broken down into pieces where each piece does not depend on a large amount of program state. Finally, many ML algoithms tend to be iterative in nature. That is: the same set of operations are done repeatedly. 3) Iterative Updates Expectation Maximization Optimization Operation A Operation B

Gibbs Sampling 1) Sparse Data Dependencies X1 X2 X3
2) Local Computations X4 X5 X6 For instance, Gibbs sampling is one simple example which satisfies the 3 properties.The data dependencies of Gibbs sampling can be sparse depending on the underlying graphical model. The amount of state needed to sample a single variable is small: This is essentially governed by the size of the Markov Blanket. For instance to sample x6 you only need informatoun about x3 and x9. Finally, Gibbs sampling is iterative in nature as this procedure is repeated ientically on all the vertices. 3) Iterative Updates X7 X8 X9

GraphLab is the Solution
Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Implementation here is multi-core Distributed implementation in progress [c] The GraphLab abstraction we built is designed to target these properties specifically. To make it easy to express data depednencies, to express iterative procedures GraphLab simplifies the design of parallel progras by…. Note that the current implementation described here is multi-core, but the abstraction generalizes to the distributed setting. A distributed version is work in progress.

A New Framework for Parallel Machine Learning
GraphLab A New Framework for Parallel Machine Learning

Update Functions and Scopes
GraphLab Data Graph Shared Data Table GraphLab Model The GraphLab model is defined in 4 parts. The Data Graph which is used to express sparse data dependencies in your computation. And the Shared Data Table which is used to express global data as well as global computation In addition, we also have the scheduler which determines the order of computation And the scope system which provides thread safety and consistency. Update Functions and Scopes Scheduling

Data Graph A Graph with data associated with every vertex and edge.
x3: Sample value C(X3): sample counts The data graph is a graph with data associated with every vertex and edge. Data can be any kind of data. Parameters, vectors, matrices, images, and so on. For instance in the case of Gibbs sampling, the data graph will e exactly the markov random field.. on each vertex we might sture the current variable sample as well as a histogram of all the samples seen so far. While on the edges we will store the binary potential. Φ(X6,X9): Binary potential :Data

Update Functions Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex The data graph is modified via update functions. Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex. Where the scope of the vertex consists of the data on the vertex itself, adjacent vertices and edges. For instance, in the case of Gibbs sampling, the sampling function just …. And this fits nicely within the framkwork of the update function.

Update Function Schedule
CPU 1 e f g k j i h d c b a a h a i These update functions are evalauted in parallel based on a schedule, which abstractly represents a sequence of tasks to be executed.. For instance, here we have 2 cpus, and a data graph. Each processor then reads a vertex from the schedule…executes it. CPU 2 b d

Update Function Schedule
CPU 1 e f g k j i h d c b a a i And this repeats. in parallel until the scheduler runs out of tasks or some termination condition is reached. CPU 2 b d

Scheduler determines the order of Update Function Evaluations
Static Schedule Scheduler determines the order of Update Function Evaluations Synchronous Schedule: Every vertex updated simultaneously Round Robin Schedule: Every vertex updated sequentially Scheduler determines the order of Uddate function evaluations. We could define simple static schedules: Synchronous -> every update function is evaluated simultaneously -> also known as Jacobi Round-Robin -> where every update function is evaluated sequentially -> Gauss-Siedel

Need for Dynamic Scheduling
Converged Slowly Converging Focus Effort However, static schedules are insufficient. For instance, one part of the problem could be easy, can could have converged in much fewer iterations than another part of the problem. Dynamic scheduling could allow us to “focus” computation on the difficult part of the problem.

Dynamic Schedule CPU 1 CPU 2 e f g k j i h d c b a b a h a b i
To see how a dynamic schedule might work, we can repeat the earlier scheduling example. As usual, each cpu reads an element off the scheduler, and runs the update funciton, but the updte function has the oppurtunity to insert new tasks back into the schedule i CPU 2

Obtain different algorithms simply by changing a flag!
Dynamic Schedule Update Functions can insert new tasks into the schedule FIFO Queue Wildfire BP [Selvatici et al.] Priority Queue Residual BP [Elidan et al.] Splash Schedule Splash BP [Gonzalez et al.] Dynamic schedule is a scheduler where update functions can insert new tasks into the schedule. Many types of dynamic schedules are possible. Essentially by changing the underlying datastructure, we can obtain different kinds of schedules. For instance, if we were to do Belief Ppropagation…. We can obtain different BP style algorithms just by changing a single flag! Obtain different algorithms simply by changing a flag! --scheduler=fifo --scheduler=priority --scheduler=splash

Global Information What if we need global information?
However, all these is restricted to just local computation. What if we need some global information? For instance, what if you want to specify algorithm parameters to update functions? … Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?

Shared Data Table (SDT)
Global constant parameters Constant: Temperature Constant: Total # Samples This brings us to the shared data table. Shared Data table is a table where global contant parameters can be stored. This table is accessible, but read-only to all update functions. For instance, for Gibbs, I could store parameters such as the total number of samples I like to draw. Or some temperature parameter. Can also do computation.

Sync Operation Sync is a fold/reduce operation over the graph
Accumulate performs an aggregation over vertices Apply makes a final modification to the accumulated data Example: Compute the average of all the vertices 1 Sync! 6 1 5 2 3 Accumulate Function: Add 8 Apply Function: Divide by |V| The Sync operation is similar to MapReduce’s “reduce” operation in that it performs a reduction over all the graph data. It can be associated with an entry in the shared data table, and It is defined as two user functions: an accumulate ad an apply. The accumulate function performs an aggregation over vertices, while the apply function makes a final modifcation to the accumulated data. So for instance, if I would like to compute the average of all the vertices, I would define an accumulate function which just adds, while an apply function which ….. Then when I call sync on the associated entry in the shred data table. The add function will be called with ain initial 0 value and the first vertex in the graph. This the repeats for all the vertices in the graph, accumuating the result as it goes along. When it is done, the apply function is called on the final value. And the result is stored back in the shared data table. This operation allows global information to be aggregated. More importantly, GraphLab allows this operation to be run at the same time as other update functions allowing this to be used to perform background tasks, like estimating the termination condition for instance. 2 22 9 1 3 2 1 1 2 1

Shared Data Table (SDT)
Global constant parameters Global computation (Sync Operation) Constant: Temperature Sync: Loglikelihood Constant: Total # Samples SO in context of the Gibbs example, Sync: Sample Statistics

Safety and Consistency

Write-Write Race Write-Write Race
If adjacent update functions write simultaneously Left update writes: Final Value Right update writes: But if two adjacent vertices are updated simultaneously…. We could get a clash on the center edge. The result on the center edges could become inconsistent. For instance, the left vertex could write the blue histogram to the edge. The right. But due to the collision, it is possible for the final value can be very wrong.

Race Conditions + Deadlocks
Just one of the many possible races Race-free code is extremely difficult to write GraphLab design ensures race-free operation Just one of the many possible races that could occur in your code. Race free code is extremely difficult to write The GraphLab design however, can ensure race-free operation. Graphlab does this through the use of scope rules.

Scope Rules Full Consistency
GraphLab allows you to pick from a few consistency models, of which the strongest is called the full consiteny model. This works by ensuring that all update function scopes do not overlap Guaranteed safety for all update functions

Full Consistency Full Consistency
It however has reduced opportunities for parallelsim Only allows update fnctions two vertices apart to be run in parallel. Only allow update functions two vertices apart to be run in parallel Reduced opportunities for parallelism

Obtaining More Parallelism
Not all update functions will modify the entire scope! Full Consistency Edge Consistency We can however relax this to try to obtain more parallelism . By taking advantage of the fact… not all update functions will modifythe entire scope! .. . So we have an edge consistency model which only guarantees safe access to data on the current vertex and adjacent edge. Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

Edge Consistency Edge Consistency
Edge consistency model is weaker, and comes at a corresponding increase in parallelism as now update functions which are just one vertex apart can be run in parallel.

Obtaining More Parallelism
Full Consistency Edge Consistency Vertex Consistency Finally, if the update functions does not even need to access edge data, for instance, Map style operations. Vertex consistency is sufficient. The vertex consistency model is the weakest only guarantees safe access to the vertex data itself. “Map” operations. Feature extraction on vertex data

Vertex Consistency Vertex Consistency
Largest amount of parallelism available since adjacent vertices can be all run in parallel

Sequential Consistency
GraphLab guarantees sequential consistency For every parallel execution, there exists a sequential execution of update functions which will produce the same result. CPU 1 time Parallel With the right choice of model, …. … If we have the followig simple 3 vertex graph. And graphlab using 2 cpus in parallel perform a particular sequence of update functions. If the execution is sequentially consistent means that There exiss a sequentialization of the update functions onto a single processor sequence such that the final result is exactly the same as the parallel verison CPU 2 CPU 1 Sequential

Update Functions and Scopes
GraphLab Data Graph Shared Data Table GraphLab Model To summarize, GraphLab , we have seen The Data Graph which expressed the sparse data dependencies in your computation. The shared data table which provides global information and global computation. The scheduling and the update functions which provide large scale parallelism as well as consistency guarantees Together… akes up graphlab Update Functions and Scopes Scheduling

Experiments

Experiments Shared Memory Implemention in C++ using Pthreads
Tested on a 16 processor machine 4x Quad Core AMD Opteron 8384 64 GB RAM Belief Propagation +Parameter Learning Gibbs Sampling CoEM Lasso Compressed Sensing SVM PageRank Tensor Factorization To demonstrate the GraphLab abstraction, we implemented a shared memory version in C++ using pthreads, and tested it on a 16 processor machine with 64 GB of RAM. We implemented a large number of algorithms, but I will only present result on the 4 on the left

Graphical Model Learning
3D retinal image denoising Data Graph: 256x64x64 vertices Update Function Belief Propagation Sync Acc: Compute inference statistics Apply:Take a gradient step Sync: Edge-potential The first experiment demonstrate use of the complete GraphLab pipeline. 3D retina images are acquired measuring the density at each voxel with a laser beam. The resultant images are however quite noisy, and the aim to use belief propagation to denoise the image. Parameter learning is used to estimate the edge potentials. …. The datagraph is pairwise markov random field arranged as a 3d cube. That’s over a million variables. The update function used here is … We store the shared edge potentials in the shared data talbe and we use a sync operation to perform parameter learning. How do we do that? The accumulation phase is used to gather inference statistics on the gaph, while the apply phase is used to take a gradient step

Better Optimal Splash Schedule Approx. Priority Schedule I will now present speedup results from running the problem on 1 to 16 processors. We tested two schedulers on this problem, a priority queue scheduler and a splash scheduler. And we obtained nearly linear speedup, achieving.15x to 15.5x speedup on 16 proc. However, we can do better with GraphLab 15.5x speedup on 16 cpus

Standard parameter learning takes gradient only after inference is compute With GraphLab: Take gradient step while inference is running 2100 sec Runtime 3x faster! Inference Gradient Step Parallel Inference + Gradient Step 700 sec Typical parameter learning set up requires the gradient step to be computed only after inference is complete. However, since Graphlab allows you to perform Sync operations in the background while update functions are running, we can easily experiment with simultaneous parameter learning and inference: taking gradient steps while the inference algorithm is still running. And as we can see, by doing this, we can attain an additional 3x speedup over the typical algorithm. Iterated Simultaneous

Gibbs Sampling Two methods for sequentially consistency: Scopes
Edge Scope graphlab(gibbs, edge, sweep); Scheduling Graph Coloring CPU 1 CPU 2 CPU 3 t0 t1 t2 t3 graphlab(gibbs, vertex, colored); Our next problem is Gibbs sampling and here we will demonstrate 2 different methods …… Parallel gibbs sampling requires the constraint that neighboring vertices do not get sampled at the same time. If you recall this is naturally expressed using the edge consistency model. This means that I can use any scheduler on this problem, and I will havea correct Gibbs sampler. A different method of achieving the same constaint is to color the graph, such that adjacent vertices are of different colors, and using a scheduler which only schedules the colors in sequence. This allows me to use a the weakest vertex consistency model since the scheduler provides the guarantees for me.

Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Pair-wise MRF 14K Vertices 100K Edges 10x Speedup Scheduling reduces locking overhead Better Optimal Colored Schedule Round robin schedule We demonstrate this on a protein-protein interaction network which is a .pairwise MRF……. Using the round-robin scheduler on the edge consistency model provides a good 8x speedup on 16 processors. While using the colored scheduler, we can attain nearly a 10x speed up due to the decreased locking overhead.

CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs
Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? Vertices Edges Small 0.2M 20M Large 2M 200M Hadoop 95 Cores 7.5 hrs the dog Australia Catalina Island <X> ran quickly travelled to <X> <X> is pleasant Our 3rd experiment CoEM, a named entity recognition task. We test the scalability of our GraphLab implementation. The aim of coEM is to classify noun phrases. For instance…. The CoEM problem can be represented as a bipartite graph with noun phrases on the left, and contexts on the right. An edge between a noun phrase and a context means that the NP was observed with that context in the corpus. For instance, on the top edge, it means that the phrase “the dog ran quickly” was observed.

CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores
Better Optimal Large Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Small 15x Faster! 6x fewer CPUs! We now show the performance of our implementation. On the small problem, we achieve a respectable 12x speedup on 16 processors. On the large problem, we are able to achieve nearly a perfect speedup. This is due to the large amount of work available. So we used 6x fewer CPUS to get 15x faster performance. 46

Lasso L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed The lasso task is to perform L1 regularized linear regression. The algorithm we used is the shooting algorithm, which is basically coordinate descent. This allows us to represent the problem as a bipartite graph in GraphLab by treating X as an adjacency matrix. Due to the properties of the shooting algorithm, the full consistency scope is necessary. We can see why this is the case with the example here.

Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed The shhoting algorithm allows me to descend on the purple vertices simultanrously. Since they do not interact through any of the Y’s.

Shooting Algorithm (Coordinate Descent) Due to the properties of the update, full consistency is needed But I cannot perform independent coordinate descent on \beta2 and \beta3 at the same time since they interact through Y2 and Y3. Predict 12 month volatility of a stock from the company’s annual reports. We derived a sparse dataset and a dense dataset from this data. Full consistency, as expected, performance is poor on denser model. Finance Dataset from Kogan et al [2009].

Full Consistency Better Optimal Sparse Dense Now for some results.
With the full consistency scope. We obtain a poor 2.1x speedup on 16 processors. But we do better on the sparser dataset, attaining a 4x speedup. This is expected as the full consistency model has much fewer oopportunities for parallelism on denser model

Why does this work? (Open Question)
Relaxing Consistency Better Optimal Dense Sparse Now, however, here is an interesting observation we made. What if we relax the consistency to the weakest vertex consistency model. To our surprise, it still converges to the same loss! But of course, now at a significant increase in performance, attaining about 9x speedup on 16 processors on the dense dataset, and a 5x speedup on the sparse dataset.. Just as a comparison, here are the curves when the full consistency model is used. It is however, still an open question. Why does the system still converge in this setting. Why does this work? (Open Question)

GraphLab An abstraction tailored to Machine Learning
Provides a parallel framework which compactly expresses Data/computational dependencies Iterative computation Achieves state-of-the-art parallel performance on a variety of problems Easy to use GraphLab is an abstraction tailored specifically to the needs of a good number of ML algorithm. And most importantly.. It is easy to use.

Future Work Distributed GraphLab GPU GraphLab
Load Balancing Minimize Communication Latency Hiding Distributed Data Consistency Distributed Scalability GPU GraphLab Memory bus bottle neck Warp alignment State-of-the-art performance for <Your Algorithm Here> . Future work includes Distributed GRaphLab as well as GPU GraphLab. There are a large number of complex problems to solve along the way. And the hope is that if you use GraphLab, we should be able to scale your algorithm from the shared memory setting to the distributed setting with minimal additional work.

Parallel GraphLab 1.0 Available Today
Documentation… Code… Tutorials… We are open sourcing our reference implementation of Parallel GraphLab today.

GraphLab A New Framework for Parallel Machine Learning

Similar presentations

Presentation on theme: "GraphLab A New Framework for Parallel Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GraphLab A New Framework for Parallel Machine Learning

Similar presentations

Presentation on theme: "GraphLab A New Framework for Parallel Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback