Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica e Statistica Dipartimento di Informatica Algoritmi AvanzatiAlgoritmi Avanzati

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Overview

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

Large graphs are ubiquitous in today’s information- based society.  Ranking search results  Analysis of social networks  Module detection of protein-protein interaction networks  Graph-based approaches for DNA Difficult to analyze because of their large size Large Graphs

MapReduce inefficiencies Set of enhanced design patterns applicable to a large class of graph algorithms that address many of those deficiencies Purpose

MapReduce Combiner  similar to reducers except that they operate directly on the output of mappers Partitioner  responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers

MapReduce

Graph algorithms 1.Computations occur at every vertex as a function of the vertex’s internal state and its local graph structure 2.Partial results in the form of arbitrary messages are “passed” via directed edges to each vertex’s neighbors 3.Computations occur at every vertex based on incoming partial results, potentially altering the vertex’s internal state

Basic implementation Message passing  The results of the computation are arbitrary messages to be passed to each vertex’s neighbors.  Mappers emit intermediate key-value pairs where the key is the destination vertex id and the value is the message  Mappers must also emit the vertex structure with the vertex id as the key

Basic implementation Local Aggregation  Combiners reduce the amount of data that must be shuffled across the network  only effective if there are multiple key-value pairs with the same key computed on the same machine that can be aggregated

Networks bandwidth is a scarce source “A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network.”

In-Mapper Combining Problems with combiners  Combiners semantics is underspecified in MapReduce  Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all.  Do not actually reduce the number of key-value pairs that are emitted by the mappers in the first place

In-Mapper Combining Number of key-value  Key-value pairs are still generated on a per-document basis  unnecessary object creation and destruction  object serialization and deserialization

In-Mapper Combining The basic idea is that mappers can preserve state across the processing of multiple input key-value pairs and defer emission of intermediate data until all input records have been processed.

Classic mapper

Improved mapper

In-Mapper Combining

In-Mapper Combining with Hadoop Before processing data call the method Initialize to initialize an associative array for holding counts Accumulate partial term counts in the associative array across multiple documents Emit key-value pairs only when the mapper has processed all documents through the method Close

In-Mapper Combining qualities It provides control over when local aggregation occurs and how it exactly takes place The mappers will generate only those key-value pairs that need to be shuffled across the network to the reducers

In-Mapper Combining awareness It breaks the functional programming underpinnings of MapReduce A bottleneck associated with the in-mapper combining pattern

In-Mapper Combining “Block and Flush” Instead of emitting intermediate data only after every key-value pair has been processed, emit partial results after processing every n key-value pairs Track the memory footprint and flush intermediate key-value pairs once memory usage has crossed a certain threshold

Schimmy Network traffic dominates the execution time Shuffling the graph structure between the map and reduce phases is highly inefficient, especially if we are interested in an iterative MapReduce jobs. Furthermore, in many algorithms the topology of the graph and associated metadata do not change (only each vertex’s state does)

Schimmy intuition The schimmy design pattern is the parallel merge join  S and T are two relations sorted by the join key  Join by scanning through both relations simultaneously Example:  S and T both divided into ten files partitioned in the same manner by the join key  Merge join the first file of S with the first file of T, the second file with S with the second file of T, etc.

Schimmy applied to MapReduce Divide graph G in n files  G = G1 ∪ G2 ∪... ∪ Gn MapReduce execution framework guarantees that intermediate keys are processed in sorted order  Set the number of reducers to n This guarantees that the intermediate keys processed by reducer R1 are exactly the same as the vertex ids in G1 and so on up to Rn and Gn

Schimmy mapper

Classic reducer

Schimmy with Hadoop Before processing data call the method Initialize to open the file containing the graph partition corresponding to the intermediate keys that are to be processed by the reducer Advances the file stream in the graph structure until the corresponding vertex’s structure is found Once the reduce computation is completed, the vertex’s state is updated with the revised PageRank value and written back to disk

Schimmy qualities It eliminates the need to shuffle G across the network

Schimmy problems The MapReduce execution framework arbitrarily assigns reducers to cluster nodes Accessing vertex data structures will almost always involve remote reads

Range Partitioning The graph is splitted into multiple blocks Hash function that assigns each vertex to a block with uniform probability The hash function does not consider the topology of the graph

Range Partitioning For graph processing it is highly advantageous for adjacent vertices to be stored in the same block Intra-block links are maximized and the inter-block links are minimized Web pages within a given domain are much more densely hyperlinked than pages across domains

Range Partitioning If web pages from the same domain are assigned to consecutive vertex ids, partition the graph into integer ranges Split the graph with |V| vertices into 100 blocks, block 1 contains vertex ids [1, |V |/100) block 2 contains vertex ids [|V |/100, 2|V |/100) and so on

Range Partitioning With sufficiently large block sizes, we can ensure that only a very small number of domains are split across more than one block.

The Graph ClueWeb09 collection, a best-first web crawl by Carnegie Mellon University in early 2009  50.2 million documents (1.53 TB)  1.4 billion links (stored as a 7.0 GB concise binary representation)  The structure of the graph has most pages having a small number of predecessors, but a few highly connected pages with several million

The Cluster 10 worker nodes  2 hyperthreaded 3.2 GHz Intel Xeon CPUs, 4GB of RAM  20 physical cores (40 virtual cores) Connected by gigabit Ethernet to a commodity switch

Results

Future works Improve partitioning to cluster based on actual graph topology (using MapReduce) Modifying Hadoop’s scheduling algorithm to improve Schimmy Improve the in-mapper combining by storing more of the graph in memory between iterations

Conclusions MapReduce is an emerging technology However, this generality and flexibility comes at a significant performance cost when analyzing large graphs, because standard best practices do not sufficiently address serializing, partitioning, and distributing the graph across a large cluster

Bibliography J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceeding MLG, 2010. Jeffrey Dean,Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, 2008. J. Lin, C. Dyer, Data-Intensive Text Processing with MapReduce, Synthesis Lectures on Human Language Technologies, 2010.

Questions?

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Similar presentations

Presentation on theme: "Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Similar presentations

Presentation on theme: "Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica."— Presentation transcript:

Similar presentations

About project

Feedback