Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Review: The Greedy Method
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
The number of edge-disjoint transitive triples in a tournament.
Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Scheduling Master - Slave Multiprocessor Systems Professor: Dr. G S Young Speaker:Darvesh Singh.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Bin Packing: From Theory to.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Mining High Utility Itemset in Big Data
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Case Studies: Bin Packing.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.
Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
On Detecting Termination in Cognitive Radio Networks Shantanu Sharma 1 and Awadhesh Kumar Singh 2 1 Ben-Gurion University of the Negev, Israel 2 National.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
BIN SORTING Problem Pack the following items in bins of size Firstly, find the lower bound by summing the numbers to be packed.
Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Upper and Lower Bounds on the cost of a Map-Reduce Computation
15-826: Multimedia Databases and Data Mining
Assignment Problems of Different-Sized Inputs in MapReduce
Private and Secure Secret Shared MapReduce
Sanjoy Baruah The University of North Carolina at Chapel Hill
ICS 353: Design and Analysis of Algorithms
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA 28th International Symposium on Distributed Computing (DISC 2014) Austin, Texas, USA (12-15 October 2014)

Cluster Computing – Terabytes or Petabytes amount of data cannot be processed on a single computer – Cluster of computers – How to mask failures, e.g., hardware failures MapReduce is a programming model used for parallel processing over large-scale data Introduction 2

3 Worker Master process Worker fork Assign map tasks Assign reduce tasks Read Local write Remote read, sort Output File 0 Output File 1 Write Chunk 0 Chunk 1 Chunk 2 Input Data MapReduce job: Map Phase and Reduce Phase Map Phase: applies a user-defined Map function Reduce Phase: applies a user-defined Reduce function

Mapper 1 Reducer for I Mapper I 1 1 like Introduction MapReduce working example – Word Count 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana

Mapper 1 Reducer for I Mapper I 1 1 like Introduction Inputs and outputs in our context 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana Inputs Outputs

Values, provided by each mapper, have some sizes (input size) Reduce capacity: an upper bound on the sum of the sizes of the values that are assigned to the reducer Example: reducer capacity to be the size of the main memory of the processors on which reducers run We consider two special matching problems Reducer Capacity 6

State-of-the-Art F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman, “Upper and Lower Bounds on the Cost of a Map- Reduce Computation,” PVLDB, Unit input size Reducer Size – Maximum number of inputs that a given reducer can have. 7

Problem Statement Communication cost between the map and the reduce phases is a significant factor How we can reduce the communication cost? – A lesser number of reducers, and hence, a smaller communication cost – How to minimize the total number of reducers while respecting their limited capacity? Not an easy task – All-to-All mapping schema problem – X-to-Y mapping schema problem 8 Mapper for 1 st input Reducer for k 1 ( 1, 2 ) Reducer for k 2 ( 1, 3 ) Reducer for k 3 ( 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 k2k2 input 2 k1k1 k3k3 input 3 k2k2 k3k3 Mapper for 1 st input Reducer for k 1 ( 1, 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 input 2 k1k1 input 3 k1k1 input 1 input 2 input 3 input 1 input 2 input 3 Notation k i : key

A set of inputs is given Each pair of inputs corresponds to one output Example – Computing common friends Lists of friends of m persons are given Find common friends of the given m persons Every two friend lists must be assigned to a single common reducer A2A Mapping Schema Problem 9

Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3) fl 4 Reducer for k 2 (1, 2, 4) Reducer for k 3 (3, 4) Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 k2k2 fl 2 k1k1 k2k2 fl 3 k1k1 k3k3 fl 4 k2k2 k3k3 Reducer capacity is enough to hold some of the friend lists together 10 Notations k i : key fl i : i th friend list 1, 21, 32, 32, 41, 43, 4 A2A Mapping Schema Problem

Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3, 4) fl 4 Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 fl 2 k1k1 fl 3 k1k1 fl 4 k1k1 Reducer capacity is enough to hold all the friend lists together 11 Notations k i : key fl i : i th friend list 1, 2 1, 3 1, 4 2, 3 2, 4 3, 4 A2A Mapping Schema Problem

What to do? – Assigns the given m inputs to the given number of reducers, without exceeding q, in a manner that every given input is coupled with every other given input in at least one reducer in common Polynomial time solution for one and two reducers NP-hard for z > 2 reducers 12 A2A Mapping Schema Problem

Heuristics for A2A Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm – Pseudo-polynomial bin-packing algorithm * – 2-step Algorithms – The selection of a prime number p A fixed reducer capacity is given 13 * D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.

Two disjoint sets X and Y are given Each pairs of element  x i, y j  (where x i  X, y j  Y,  i, j) of the sets X and Y corresponds to one output Example – Skew Join Two relations X(A, B) and Y(B, C) are given where lots of tuple have a common “b” value Every tuple with an identical “b” value is required to assign to at least one reducer X2Y Mapping Schema Problem 14

X2Y Mapping Schema Problem What to do? – Assigns each input of the set X with each input of the set Y to at least one reducer in common, without exceeding q Polynomial for one reducer – Can we assign all the inputs of the sets X and Y to a single reducer NP-hard for z > 1 reducers 15

Heuristics for X2Y Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm A fixed reducer capacity is given 16

Conclusion Reducer capacity – An important parameter to be considered in all MapReduce algorithms – The capacity is in terms of, not necessarily identical, memory auxiliary size, augmented and added to the index of the data item(s) Two assignment schemas of MapReduce are given – All-to-All (A2A) mapping schema problem – X-to-Y (X2Y) mapping schema problem Several heuristics for A2A and X2Y mapping schema problems are provided 17

Foto Afrati 1, Shlomi Dolev 2, Ephraim Korach 3, Shantanu Sharma 2, and Jeffrey D. Ullman 4 1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel 3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel 4 Department of Computer Science, Stanford University, USA Presentation is available at