Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Similar presentations


Presentation on theme: "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data."— Presentation transcript:

1 Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data Reading Group Presentation

2 Motivation Map-Reduce framework Compared to relational DBMS “simplified” for data processing in search engines Problem: join multiple heterogeneous datasets Not quite fit into map-reduce Ad-hoc solutions: map-reduce on one data set while reading data from the other dataset on the fly

3 Contribution Goal: support relational algebra primitives without sacrificing existing generality and simplicity Proposal: map-reduce-merge

4 Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

5 Let’s Refresh Our Memory Functional programming model

6 Comments Low-cost unreliable commodity hardware Failure often occurs during each map/reduce task Coordinator re-run mapper or reducer Homogenization: for equi-join Transform each dataset into (join key, payload) Then apply map-reduce to merge entries from different datasets Problem: only equi-joins may take lots of extra disk space, incur excessive communications

7 Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

8 Map-Reduce-Merge Primitives key join

9 Focusing on Merge Two sets of inputs generated by multiple reducers: Which α reducers and β reducers match? How to get the next key-value pair? Customized preprocessing for inputs? Merging algorithm? All of these are customizable

10 Focusing on Merge Two sets of inputs generated by multiple reducers: Partition Selector: Which α reducers and β reducers match? Iterator: How to get the next key-value pair? Processor: Customized preprocessing for inputs? Merger: Merging algorithm? All of these are customizable

11 Example: Emp & Dept EmployeeDepartment

12 Partition Selector LHS: reduce key:dept-id, emp-idpartition key: dept-id RHS: reduce key:dept-id, partition key: dept-id Assuming #reducer is the same, LHS reducer K matches RHS reducer K

13 Processor Pre-processing for each input E.g. building hash table for hash join This example is sort-merge Processor is empty

14 Iterator for sort-merge

15 Merger

16 Other Iterators Nested-loop: For each (k,v) of the first input, get all the second input Then rewind the second input and process the next (k,v) of the first input Hash join: Read all of one input, then read all of the other input

17 Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

18 Relation Relation R with an attribute set A A is broken down into a key part K, and a value part V

19 Relational Operators Generalized selection: choosing a subset of records Filtering can be done in mapper/reducer/merger Projection: choosing a subset of attributes User-defined mapper (k,v)  (k’,v’) Aggregation Group-by is performed before reduce Easy to implement aggregation in reducer Joins (set union, intersection, difference, cartesian product) Sort-merge, hash join, nested-loop Rename

20 Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

21 Partition Selector In general: LHS has R1 reducers, RHS has R2 reducers, performing cartesian-product like operator Suppose R1  R2, use R1 merger, where merger j selects: Input from LHS reducer j Input from RHS all reducers Remote reads: R1*(1+R2) = R1 + R1*R2 Natural equi-join case: Let R1==R2==R, use R merger, where merger j selects: LHS reducer j and RHS reducer j Remote reads: 2*R

22 Combining Phases Entire workflow consists of multiple map-reduce-merge To avoid remote copying: ReduceMap, MergeMap: co-locate next mapper with previous reducer or merger ReduceMerger: co-locate merger with one of the reducer ReduceMergeMap

23 Map-Reduce-Merge Library Put common merge implementations into a library Joins Common iterators etc.

24 Configuration API for building a Customized Workflow Map/ reduce Map/reduce/mergeMultiple Map/reduce/merge

25 Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

26 Webgraphs Each row: (URL, in-links, out-links) Potentially large number of links Only a few are needed for many operations Store each column of the table in a separate file Reconstruct the table by join E.g. compute the intersection of in-links and out- links

27 TPC-H Query 2

28 After Combining Phases

29 Conclusion Extend map-reduce Support relational operators However, the merge step seems quite complicated


Download ppt "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data."

Similar presentations


Ads by Google