Presentation is loading. Please wait.

Presentation is loading. Please wait.

Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.

Similar presentations


Presentation on theme: "Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D."— Presentation transcript:

1 Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA 28th International Symposium on Distributed Computing (DISC 2014) Austin, Texas, USA (12-15 October 2014)

2 Cluster Computing – Terabytes or Petabytes amount of data cannot be processed on a single computer – Cluster of computers – How to mask failures, e.g., hardware failures MapReduce is a programming model used for parallel processing over large-scale data Introduction 2

3 3 Worker Master process Worker fork Assign map tasks Assign reduce tasks Read Local write Remote read, sort Output File 0 Output File 1 Write Chunk 0 Chunk 1 Chunk 2 Input Data MapReduce job: Map Phase and Reduce Phase Map Phase: applies a user-defined Map function Reduce Phase: applies a user-defined Reduce function

4 Mapper 1 Reducer for I Mapper 2 1 1 I 1 1 like Introduction MapReduce working example – Word Count 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana

5 Mapper 1 Reducer for I Mapper 2 1 1 I 1 1 like Introduction Inputs and outputs in our context 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana Inputs Outputs

6 Values, provided by each mapper, have some sizes (input size) Reduce capacity: an upper bound on the sum of the sizes of the values that are assigned to the reducer Example: reducer capacity to be the size of the main memory of the processors on which reducers run We consider two special matching problems Reducer Capacity 6

7 State-of-the-Art F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman, “Upper and Lower Bounds on the Cost of a Map- Reduce Computation,” PVLDB, 2013. Unit input size Reducer Size – Maximum number of inputs that a given reducer can have. 7

8 Problem Statement Communication cost between the map and the reduce phases is a significant factor How we can reduce the communication cost? – A lesser number of reducers, and hence, a smaller communication cost – How to minimize the total number of reducers while respecting their limited capacity? Not an easy task – All-to-All mapping schema problem – X-to-Y mapping schema problem 8 Mapper for 1 st input Reducer for k 1 ( 1, 2 ) Reducer for k 2 ( 1, 3 ) Reducer for k 3 ( 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 k2k2 input 2 k1k1 k3k3 input 3 k2k2 k3k3 Mapper for 1 st input Reducer for k 1 ( 1, 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 input 2 k1k1 input 3 k1k1 input 1 input 2 input 3 input 1 input 2 input 3 Notation k i : key

9 A set of inputs is given Each pair of inputs corresponds to one output Example – Computing common friends Lists of friends of m persons are given Find common friends of the given m persons Every two friend lists must be assigned to a single common reducer A2A Mapping Schema Problem 9

10 Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3) fl 4 Reducer for k 2 (1, 2, 4) Reducer for k 3 (3, 4) Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 k2k2 fl 2 k1k1 k2k2 fl 3 k1k1 k3k3 fl 4 k2k2 k3k3 Reducer capacity is enough to hold some of the friend lists together 10 Notations k i : key fl i : i th friend list 1, 21, 32, 32, 41, 43, 4 A2A Mapping Schema Problem

11 Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3, 4) fl 4 Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 fl 2 k1k1 fl 3 k1k1 fl 4 k1k1 Reducer capacity is enough to hold all the friend lists together 11 Notations k i : key fl i : i th friend list 1, 2 1, 3 1, 4 2, 3 2, 4 3, 4 A2A Mapping Schema Problem

12 What to do? – Assigns the given m inputs to the given number of reducers, without exceeding q, in a manner that every given input is coupled with every other given input in at least one reducer in common Polynomial time solution for one and two reducers NP-hard for z > 2 reducers 12 A2A Mapping Schema Problem

13 Heuristics for A2A Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm – Pseudo-polynomial bin-packing algorithm * – 2-step Algorithms – The selection of a prime number p A fixed reducer capacity is given 13 * D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.

14 Two disjoint sets X and Y are given Each pairs of element  x i, y j  (where x i  X, y j  Y,  i, j) of the sets X and Y corresponds to one output Example – Skew Join Two relations X(A, B) and Y(B, C) are given where lots of tuple have a common “b” value Every tuple with an identical “b” value is required to assign to at least one reducer X2Y Mapping Schema Problem 14

15 X2Y Mapping Schema Problem What to do? – Assigns each input of the set X with each input of the set Y to at least one reducer in common, without exceeding q Polynomial for one reducer – Can we assign all the inputs of the sets X and Y to a single reducer NP-hard for z > 1 reducers 15

16 Heuristics for X2Y Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm A fixed reducer capacity is given 16

17 Conclusion Reducer capacity – An important parameter to be considered in all MapReduce algorithms – The capacity is in terms of, not necessarily identical, memory auxiliary size, augmented and added to the index of the data item(s) Two assignment schemas of MapReduce are given – All-to-All (A2A) mapping schema problem – X-to-Y (X2Y) mapping schema problem Several heuristics for A2A and X2Y mapping schema problems are provided 17

18 Foto Afrati 1, Shlomi Dolev 2, Ephraim Korach 3, Shantanu Sharma 2, and Jeffrey D. Ullman 4 1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece afrati@softlab.ece.ntua.gr 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel {dolev,sharmas}@cs.bgu.ac.il 3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel korach@bgu.ac.il 4 Department of Computer Science, Stanford University, USA ullman@cs.stanford.edu Presentation is available at http://www.cs.bgu.ac.il/~sharmas/publication.html


Download ppt "Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D."

Similar presentations


Ads by Google