Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

MapReduce How to painlessly process terabytes of data.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.

Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Large-scale file systems and Map-Reduce

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Introduction to MapReduce and Hadoop

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

CS 345A Data Mining MapReduce This presentation has been altered.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Map Reduce, Types, Formats and Features

Presentation transcript:

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica e Statistica Dipartimento di Informatica Algoritmi AvanzatiAlgoritmi Avanzati

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Overview

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

Large graphs are ubiquitous in today’s information- based society.  Ranking search results  Analysis of social networks  Module detection of protein-protein interaction networks  Graph-based approaches for DNA Difficult to analyze because of their large size Large Graphs

MapReduce inefficiencies Set of enhanced design patterns applicable to a large class of graph algorithms that address many of those deficiencies Purpose

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

MapReduce Combiner  similar to reducers except that they operate directly on the output of mappers Partitioner  responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers

MapReduce

Graph algorithms 1.Computations occur at every vertex as a function of the vertex’s internal state and its local graph structure 2.Partial results in the form of arbitrary messages are “passed” via directed edges to each vertex’s neighbors 3.Computations occur at every vertex based on incoming partial results, potentially altering the vertex’s internal state

Basic implementation Message passing  The results of the computation are arbitrary messages to be passed to each vertex’s neighbors.  Mappers emit intermediate key-value pairs where the key is the destination vertex id and the value is the message  Mappers must also emit the vertex structure with the vertex id as the key

Basic implementation Local Aggregation  Combiners reduce the amount of data that must be shuffled across the network  only effective if there are multiple key-value pairs with the same key computed on the same machine that can be aggregated

Networks bandwidth is a scarce source “A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network.”

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

In-Mapper Combining Problems with combiners  Combiners semantics is underspecified in MapReduce  Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all.  Do not actually reduce the number of key-value pairs that are emitted by the mappers in the first place

In-Mapper Combining Number of key-value  Key-value pairs are still generated on a per-document basis  unnecessary object creation and destruction  object serialization and deserialization

In-Mapper Combining The basic idea is that mappers can preserve state across the processing of multiple input key-value pairs and defer emission of intermediate data until all input records have been processed.

Classic mapper

Improved mapper

In-Mapper Combining

In-Mapper Combining with Hadoop Before processing data call the method Initialize to initialize an associative array for holding counts Accumulate partial term counts in the associative array across multiple documents Emit key-value pairs only when the mapper has processed all documents through the method Close

In-Mapper Combining qualities It provides control over when local aggregation occurs and how it exactly takes place The mappers will generate only those key-value pairs that need to be shuffled across the network to the reducers

In-Mapper Combining awareness It breaks the functional programming underpinnings of MapReduce A bottleneck associated with the in-mapper combining pattern

In-Mapper Combining “Block and Flush” Instead of emitting intermediate data only after every key-value pair has been processed, emit partial results after processing every n key-value pairs Track the memory footprint and flush intermediate key-value pairs once memory usage has crossed a certain threshold

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

Schimmy Network traffic dominates the execution time Shuffling the graph structure between the map and reduce phases is highly inefficient, especially if we are interested in an iterative MapReduce jobs. Furthermore, in many algorithms the topology of the graph and associated metadata do not change (only each vertex’s state does)

Schimmy intuition The schimmy design pattern is the parallel merge join  S and T are two relations sorted by the join key  Join by scanning through both relations simultaneously Example:  S and T both divided into ten files partitioned in the same manner by the join key  Merge join the first file of S with the first file of T, the second file with S with the second file of T, etc.

Schimmy applied to MapReduce Divide graph G in n files  G = G1 ∪ G2 ∪... ∪ Gn MapReduce execution framework guarantees that intermediate keys are processed in sorted order  Set the number of reducers to n This guarantees that the intermediate keys processed by reducer R1 are exactly the same as the vertex ids in G1 and so on up to Rn and Gn

Schimmy mapper

Classic reducer

Schimmy with Hadoop Before processing data call the method Initialize to open the file containing the graph partition corresponding to the intermediate keys that are to be processed by the reducer Advances the file stream in the graph structure until the corresponding vertex’s structure is found Once the reduce computation is completed, the vertex’s state is updated with the revised PageRank value and written back to disk

Schimmy qualities It eliminates the need to shuffle G across the network

Schimmy problems The MapReduce execution framework arbitrarily assigns reducers to cluster nodes Accessing vertex data structures will almost always involve remote reads

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

Range Partitioning The graph is splitted into multiple blocks Hash function that assigns each vertex to a block with uniform probability The hash function does not consider the topology of the graph

Range Partitioning For graph processing it is highly advantageous for adjacent vertices to be stored in the same block Intra-block links are maximized and the inter-block links are minimized Web pages within a given domain are much more densely hyperlinked than pages across domains

Range Partitioning If web pages from the same domain are assigned to consecutive vertex ids, partition the graph into integer ranges Split the graph with |V| vertices into 100 blocks, block 1 contains vertex ids [1, |V |/100) block 2 contains vertex ids [|V |/100, 2|V |/100) and so on

Range Partitioning With sufficiently large block sizes, we can ensure that only a very small number of domains are split across more than one block.

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

The Graph ClueWeb09 collection, a best-first web crawl by Carnegie Mellon University in early 2009  50.2 million documents (1.53 TB)  1.4 billion links (stored as a 7.0 GB concise binary representation)  The structure of the graph has most pages having a small number of predecessors, but a few highly connected pages with several million

The Cluster 10 worker nodes  2 hyperthreaded 3.2 GHz Intel Xeon CPUs, 4GB of RAM  20 physical cores (40 virtual cores) Connected by gigabit Ethernet to a commodity switch

Results

Introduction Basic Implementation of MapReduce Algorithm Optimizations  In-Mapper Combining  Schimmy  Range Partitioning Results Future Work and Conclusions Table of contents

Future works Improve partitioning to cluster based on actual graph topology (using MapReduce) Modifying Hadoop’s scheduling algorithm to improve Schimmy Improve the in-mapper combining by storing more of the graph in memory between iterations

Conclusions MapReduce is an emerging technology However, this generality and flexibility comes at a significant performance cost when analyzing large graphs, because standard best practices do not sufficiently address serializing, partitioning, and distributing the graph across a large cluster

Bibliography J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceeding MLG, Jeffrey Dean,Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, J. Lin, C. Dyer, Data-Intensive Text Processing with MapReduce, Synthesis Lectures on Human Language Technologies, 2010.

Questions?