Distributed Cluster Computing Platforms

Distributed Cluster Computing Platforms

Outline What is the purpose of Data Intensive Super Computing?
MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing

Why DISC DISC stands for Data Intensive Super Computing
A lot of applications. scientific data, web search engine, social network economic, GIS New data are continuously generated People want to understand the data BigData analysis is now considered as a very important method for scientific research.

What are the required features for the platform to handle DISC?
Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. Try to understand all these four features during the introduction of the concrete platform below.

Google MapReduce Programming Model Implementation Refinements
Evaluation Conclusion

Motivation: large scale data processing
Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure, top queries in a day etc. Want to use hundreds or thousands of CPUs but want to only focus on the functionality MapReduce hides messy details in a library: Parallelization Data distribution Fault-tolerance Load balancing

Motivation: Large Scale Data Processing
Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs … Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From:

MapReduce Automatic parallelization & distribution Fault-tolerant
Provides status and monitoring tools Clean abstraction for programmers

Programming Model Borrows from functional programming
Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

Architecture

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished.

Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Example vs. Actual Source Code
Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)

Example Page 1: the weather is good Page 2: today is good
Page 3: good weather is good.

Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2:
(today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).

Reduce Input Worker 1: Worker 2: Worker 3: Worker 4: Worker 5: (the 1)
(is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)

Reduce Output Worker 1: Worker 2: Worker 3: Worker 4: Worker 5:
(the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)

Some Other Real Examples
Term frequencies through the whole Web repository Count of URL access frequency Reverse web-link graph

Implementation Overview
Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs

Architecture

Execution

Parallel Execution

Task Granularity And Pipelining
Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ machines

Locality Effect: Thousands of machines read input at local disk speed Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Without this, rack switches limit read rate

Fault Tolerance Master detects worker failures
Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third- party libraries!

Fault Tolerance On worker failure:
Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine

Optimizations No reduce can start until map is complete:
A single slow disk controller can rate-limit the whole process Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”) Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?

Optimizations “Combiner” functions can run on same machine as a mapper
Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?

Refinement Sorting guarantees within each reduce partition
Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters

Performance Tests run on cluster of 1800 machines: 4 GB of memory
Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps Two benchmarks: MR_Grep Scan byte records to extract records matching a rare pattern (92K matching records) MR_Sort Sort byte records (modeled after TeraSort benchmark)

MR_Grep Locality optimization helps:
1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs

MR_Sort Backup tasks reduce job completion time significantly
System deals well with failures Normal No Backup Tasks 200 processes killed

More and more MapReduce
MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation

Real MapReduce : Rewrite of Production Indexing System
Rewrote Google's production indexing system using MapReduce Set of 10, 14, 17, 21, 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines

MapReduce Conclusions
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details

MapReduce Programs Sorting Searching Indexing Classification TF-IDF
Breadth-First Search / SSSP PageRank Clustering

MapReduce for PageRank

PageRank: Random Walks Over The Web
If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? The PageRank of a page captures this notion More “popular” or “worthwhile” pages get a higher rank

PageRank: Visually

PageRank: Formula Given page A, and pages T1 through Tn linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank: Intuition Calculation is iterative: PRi+1 is based on PRi Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1 d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor” PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn))

PageRank: First Implementation
Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged

Distribution of the Algorithm
Key insights allowing parallelization: The 'next' table depends on 'current', but not on any other rows of 'next' Individual rows of the adjacency matrix can be processed in parallel Sparse matrix rows are relatively small

Distribution of the Algorithm
Consequences of insights: We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees These fragments can be reduced into a single PageRank value for a page by summing Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1

Phase 1: Parse HTML Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls)) PRinit is the “seed” PageRank for URL list-of-urls contains all pages pointed to by URL Reduce task is just the identity function

Phase 2: PageRank Distribution
Map task takes (URL, (cur_rank, url_list)) For each u in url_list, emit (u, cur_rank/|url_list|) Emit (URL, url_list) to carry the points-to list along through iterations PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn))

Phase 2: PageRank Distribution
Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d Emit (URL, (new_rank, url_list)) PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn))

Finishing up... A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) If so, write out the PageRank lists - done! Otherwise, feed output of Phase 2 into another Phase 2 iteration

PageRank Conclusions MapReduce isn't the greatest at iterated computation, but still helps run the “heavy lifting” Key element in parallelization is independent PageRank computations in a given step Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs So, do you think that MapReduce is suitable for PageRank? (homework, give concrete reason for why and why not.)

Dryad Dryad Design Implementation Policies as Plug-ins
Building on Dryad

Design Space Grid Internet Data- parallel Dryad Search Shared memory
Dryad is optimized for: throughput, data-parallel computation, in a private data-center. Shared memory Private data center Transaction HPC Latency Throughput

Data Partitioning DATA RAM DATA
A common scenario: too much data to process. Instead of trying to be clever, just use more machines and a brute-force algorithm. DATA

2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D
Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.

Dryad = Execution Layer
Job (Application) Pipeline ≈ Dryad Shell Cluster Machine In the same way as the Unix shell does not understand the pipeline running on top, but manages its execution (i.e., killing processes when one exits), Dryad does not understand the job running on top.

Outline Dryad Design Implementation Policies as Plug-ins
Building on Dryad

Virtualized 2-D Pipelines
This is a possible schedule of a Dryad job using 2 machines.

2D DAG multi-machine virtualized The Unix pipeline is generalized 3-ways: 2D instead of 1D spans multiple machines resources are virtualized: you can run the same large job on many or few machines

Dryad Job Structure grep1000 | sed500 | sort1000 | awk500 | perl50
Channels Input files Stage Output files sort grep awk sed perl grep sort This is the basic Dryad terminology. awk sed grep sort Vertices (processes)

Finite Streams of items
Channels Finite Streams of items distributed filesystem files (persistent) SMB/NTFS files (temporary) TCP pipes (inter-machine) memory FIFOs (intra-machine) X Items Channels are very abstract, enabling a variety of transport mechanisms. The performance and fault-tolerance of these machanisms vary widely. M

Architecture V V V NS PD PD PD data plane Files, TCP, FIFO, Network
job schedule V V V The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job. The JM controls the processes running on a cluster, but never exchanges data with them. (The data plane is completely separated from the control plane.) NS PD PD PD control plane Job manager cluster

4. Query cluster resources
Staging 1. Build 2. Send .exe 7. Serialize vertices vertex code 5. Generate graph JM code Computation Staging Cluster services 6. Initialize vertices 3. Start JM 8. Monitor Vertex execution 4. Query cluster resources

Fault Tolerance Vertex failures and channel failures are handled differently.

Outline Dryad Design Implementation Policies and Resource Management
Building on Dryad

Policy Managers R R R R Stage R Connection R-X X X X X Stage X
Each stage has a “stage manager”. Each inter-stage set of edges has a “connection manager”. The managers get upcalls for all important events in the corresponding vertices, and can make policy decisions. The user can change the managers. The managers can even rewrite the graph at run-time. R-X Manager X Manager R manager Job Manager

Duplicate Execution Manager
Slow vertex Duplicate vertex Completed vertices The handling of apparently very slow computation by duplication of vertices is handled by a stage manager. Duplication Policy = f(running times, data volumes)

Aggregation Manager S S S S S S T static S S S S S S A A A T dynamic
# 1 # 2 # 1 # 3 # 3 # 2 Aggregating data with associative operators can be done in a bandwidth-preserving fashion in the intermediate aggregations are placed close to the source data. rack # A A A # 1 # 2 # 3 T dynamic

Data Distribution (Group By)
Source Source Source m m x n Dest Dest Dest n Redistributing data is an important step for load balancing or when changing keys.

Range-Distribution Manager
[0-100) S S S Hist [0-30),[30-100) T static D D D Using a connection manager one can load-balance the data distribution at run-time, based on data statistics obtained from sampling the data stream. In this case the number of destination vertices and the ranges for each vertex are decided dynamically. T T [0-?) [0-30) [?-100) [30-100) dynamic 74

Goal: Declarative Programming
X X X X S S S We evolve towards a programming model in which resources are always allocated dynamically, based on demand. T T T T static dynamic

Outline Dryad Design Implementation Policies as Plug-ins
Building on Dryad

Software Stack Machine Learning sed, awk, grep, etc. C# SSIS
legacy code PSQL Perl C++ Queries C# Vectors SQL server Job queueing, monitoring Distributed Shell DryadLINQ C++ Dryad There is a rich software ecosystem built around dryad. In this talk I will focus on a few of the layers developed at Microsoft Research SVC. Distributed Filesystem CIFS/NTFS Cluster Services Windows Server Windows Server Windows Server Windows Server

SkyServer Query 18 select distinct P.ObjID into results
M 4n S Y H n X U N L select distinct P.ObjID into results from photoPrimary U, neighbors N, photoPrimary L where U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g))<0.05 and abs((U.g-U.r)-(L.g-L.r))<0.05 and abs((U.r-U.i)-(L.r-L.i))<0.05 and abs((U.i-U.z)-(L.i-L.z))<0.05 This is a SQL query manually translated to Dryad C++ code.

SkyServer Q18 Performance
16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up (times) 8.0 6.0 This is the performance of the query running on a small cluster. The y axis show “how many times the computation is faster compared with SQL Server 2005”. Once the whole dataset of the computation fits in the collective memory of the cluster (6+ machines), the computation can be sped-up dramatically by changing the channel transport mechanism (a one-line code change). 4.0 2.0 0.0 2 4 6 8 10 Number of Computers

DryadLINQ Declarative programming Integration with Visual Studio
Integration with .Net Type safety Automatic serialization Job graph optimizations static dynamic Conciseness DryadLINQ adds a wealth of features on top of plain Dryad.

LINQ Collection<T> collection; bool IsLegal(Key);
string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Language Integrated Query is an extension of C# which allows one to write declarative computations on collections (green part).

DryadLINQ = LINQ + Dryad
Collection<T> collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertex code Query plan (Dryad job) Data DryadLINQ translates LINQ programs into Dryad computations: - C# and LINQ data objects become distributed partitioned files. - LINQ queries become distributed Dryad jobs. - C# methods become code running on the vertices of a Dryad job. collection C# C# C# C# results

Sort & Map-Reduce in DryadLINQ
Sampl [0-30),[30-100) [30-100) [0-30) [0-100) This is how one can implement distributed sorting and map-reduce in DryadLINQ.

PLINQ public static IEnumerable<TSource>
At the bottom DryadLINQ uses LINQ to run the computation in parallel on multiple cores. public static IEnumerable<TSource> DryadSort<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending) { return source.AsParallel().OrderBy(keySelector, comparer); }

Machine Learning in DryadLINQ
Data analysis Machine learning Large Vector DryadLINQ I will now focus on a library for machine-learning algorithms we have built on top of DryadLINQ. Dryad

Very Large Vector Library
PartitionedVector<T> T T T Scalar<T> The main datastructure is a vector of type <T>, whose elements are partitioned among many machines. The partitioning is hidden from the user. The API of vectors has only 4 operations. T

Operations on Large Vectors: Map 1
f U f preserves partitioning T One can apply an arbitrary C# side-effect free function f to all objects in a vector. f U

Map 2 (Pairwise) f f T U V T U V
Or one can do it to a pair of vectors. f V

Map 3 (Vector-Scalar) f f T U V T U V
Or one can use a vector and a scalar, replicating the scalar for each element of the vector. f V 89

Reduce (Fold) f f f f f U U U U U U U U
Finally, one can fold a vector to a scalar. f U

Linear Algebra T Having vectors of vectors or matrices builds to a nice linear algebra library. T U V = , ,

Linear Regression Data Find S.t.
We will show how to compute linear regression parameters.

Analytic Solution X×XT X×XT X×XT Y×XT Y×XT Y×XT Σ Σ [ ]-1 * Map Reduce
This expression uses a query plan composed of 2 (pairwise) maps and 2 reduces. [ ]-1 * A

Linear Regression Code
Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); The complete source code for linear regression has 6 lines of code.

Expectation Maximization (Gaussians)
160 lines 3 iterations shown More complicated, even iterative algorithms, can be implemented.

Conclusions START Dryad = distributed execution environment
Application-independent (semantics oblivious) Supports rich software ecosystem Relational algebra Map-reduce LINQ Etc. DryadLINQ = A Dryad provider for LINQ This is only the beginning! We believe that Dryad and DryadLINQ are a great foundation for cluster computing. START

Some other system you should know about BigData processing
Hadoop HDFS, MapReduce (open source version of GFS and MapReduce) HIVE/Pig/Sawzall (Query Language Processing) Spark/Shark (Efficient use of cluster memory and supporting iterative mapreduce program)

Thank you! Any Questions?

Pregel as backup slides

Pregel Introduction Computation Model Writing a Pregel Program
System Implementation Experiments Conclusion

Introduction (1/2) Source: SIGMETRICS ’09 Tutorial – MapReduce: The Programming Model and Practice, by Jerry Zhao

Introduction (2/2) Large graph data Graph algorithms
Many practical computing problems concern large graphs MapReduce is ill-suited for graph processing Many iterations are needed for parallel graph processing Materializations of intermediate results at every MapReduce iteration harm performance Large graph data Graph algorithms Web graph Transportation routes Citation relationships Social networks PageRank Shortest path Connected components Clustering techniques

Single Source Shortest Path (SSSP)
Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstra’s algorithm

Example: SSSP – Dijkstra’s Algorithm
 10 5 2 3 1 9 7 4 6

10 5  2 3 1 9 7 4 6

8 5 14 7 10 2 3 1 9 4 6

8 5 13 7 10 2 3 1 9 4 6

8 5 9 7 10 2 3 1 4 6

Single Source Shortest Path (SSSP)
Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstra’s algorithm MapReduce/Pregel: parallel breadth-first search (BFS)

MapReduce Execution Overview

Example: SSSP – Parallel BFS in MapReduce
Adjacency matrix Adjacency List A: (B, 10), (D, 5) B: (C, 1), (D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6)  10 5 2 3 1 9 7 4 6 A B C D E A B C D E 10 5 1 2 4 3 9 7 6

Map input: <node ID, <dist, adj list>> <A, <0, <(B, 10), (D, 5)>>> <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Map output: <dest node ID, dist> <B, 10> <D, 5> <C, inf> <D, inf> <E, inf> <B, inf> <C, inf> <E, inf> <A, inf> <C, inf>  10 5 2 3 1 9 7 4 6 A B C D E <A, <0, <(B, 10), (D, 5)>>> <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Flushed to local disk!!

Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, inf> <B, <inf, <(C, 1), (D, 2)>>> <B, 10> <B, inf> <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, inf> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, inf>  10 5 2 3 1 9 7 4 6 A B C D E

Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 10), (D, 5)>>> <B, <10, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Map output: <dest node ID, dist> <B, 10> <D, 5> <C, 11> <D, 12> <E, inf> <B, 8> <C, 14> <E, 7> <A, inf> <C, inf> 10 5  2 3 1 9 7 4 6 A B C D E Flushed to DFS!! <A, <0, <(B, 10), (D, 5)>>> <B, <10, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Flushed to local disk!!

Reduce input: <node ID, dist> <A, <0, <(B, 10), (D, 5)>>> <A, inf> <B, <10, <(C, 1), (D, 2)>>> <B, 10> <B, 8> <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, 12> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 10 5  2 3 1 9 7 4 6 A B C D E

Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 10), (D, 5)>>> <B, <8, <(C, 1), (D, 2)>>> <C, <11, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <7, <(A, 7), (C, 6)>>> … the rest omitted … 8 5 11 7 10 2 3 1 9 4 6 A B C D E Flushed to DFS!!

Computation Model (1/3) Input Supersteps Output
(a sequence of iterations)

Computation Model (2/3) “Think like a vertex”
Inspired by Valiant’s Bulk Synchronous Parallel model (1990) Source:

Computation Model (3/3) Superstep: the vertices compute in parallel
Each vertex Receives messages sent in the previous superstep Executes the same user-defined function Modifies its value or that of its outgoing edges Sends messages to other vertices (to be received in the next superstep) Mutates the topology of the graph Votes to halt if it has no further work to do Termination condition All vertices are simultaneously inactive There are no messages in transit

Example: SSSP – Parallel BFS in Pregel
 10 5 2 3 1 9 7 4 6

 10 5 2 3 1 9 7 4 6  10       5 

10 5  2 3 1 9 7 4 6

10 5  2 3 1 9 7 4 6 11 14 8 12 7

8 5 11 7 10 2 3 1 9 4 6

8 5 11 7 10 2 3 1 9 4 6 9 13 14 15

8 5 9 7 10 2 3 1 4 6

8 5 9 7 10 2 3 1 4 6 13

8 5 9 7 10 2 3 1 4 6

Differences from MapReduce
Graph algorithms can be written as a series of chained MapReduce invocation Pregel Keeps vertices & edges on the machine that performs computation Uses network transfers only for messages MapReduce Passes the entire state of the graph from one stage to the next Needs to coordinate the steps of a chained MapReduce

C++ API Writing a Pregel program
Subclassing the predefined Vertex class Override this! in msgs out msg

Example: Vertex Class for SSSP

System Architecture Pregel system also uses the master/worker model
Maintains worker Recovers faults of workers Provides Web-UI monitoring tool of job progress Worker Processes its task Communicates with the other workers Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) Temporary data is stored on local disk

Execution of a Pregel Program
Many copies of the program begin executing on a cluster of machines The master assigns a partition of the input to each worker Each worker loads the vertices and marks them as active The master instructs each worker to perform a superstep Each worker loops through its active vertices & computes for each vertex Messages are sent asynchronously, but are delivered before the end of the superstep This step is repeated as long as any vertices are active, or any messages are in transit After the computation halts, the master may instruct each worker to save its portion of the graph

Fault Tolerance Checkpointing Failure detection Recovery
The master periodically instructs the workers to save the state of their partitions to persistent storage e.g., Vertex values, edge values, incoming messages Failure detection Using regular “ping” messages Recovery The master reassigns graph partitions to the currently available workers The workers all reload their partition state from most recent available checkpoint

Experiments Environment Naïve SSSP implementation
H/W: A cluster of 300 multicore commodity PCs Data: binary trees, log-normal random graphs (general graphs) Naïve SSSP implementation The weight of all edges = 1 No checkpointing

Experiments SSSP – 1 billion vertex binary tree: varying # of worker tasks

Experiments SSSP – binary trees: varying graph sizes on 800 worker tasks

Experiments SSSP – Random graphs: varying graph sizes on 800 worker tasks

Conclusion & Future Work
Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms Future work Relaxing the synchronicity of the model Not to wait for slower workers at inter-superstep barriers Assigning vertices to machines to minimize inter-machine communication Caring dense graphs in which most vertices send messages to most other vertices

Distributed Cluster Computing Platforms

Similar presentations

Presentation on theme: "Distributed Cluster Computing Platforms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Cluster Computing Platforms

Similar presentations

Presentation on theme: "Distributed Cluster Computing Platforms"— Presentation transcript:

Similar presentations

About project

Feedback