Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Similar presentations


Presentation on theme: "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)"— Presentation transcript:

1 Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA) SIGMOD 2007 (Industrial) Presented by Kisung Kim 2010. 7. 14

2 Contents  Introduction  Map-Reduce  Map-Reduce-Merge  Applications to Relational Data Processing  Case Study  Conclusion

3 Introduction  New challenges of data processing –A vast amount of data collected from the entire WWW  Solutions of search engine companies –Customized parallel data processing systems –Use large clusters of shared-nothing commodity nodes –Ex) Google’s GFS, BigTable, MapReduce Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop

4 Introduction  Properties of data-intensive systems –Simple  Adopt only a selected subset of database principles –Sufficiently generic and effective –Parallel data processing system deployed on large clusters of shared-nothing commodity nodes –Refactoring of data processing into two primitives:  Map function  Reduce function  Map-Reduce allow users not to worry about the nuisance details of: –Coordinating parallel sub-tasks –Maintaining distributed file storage\  This abstraction can greatly increase user productivity

5 Introduction  Map-Reduce framework is best at handling homogeneous datasets –Ex) Joining multiple heterogeneous datasets does not quite fit into the Map-Reduce framework  Extending Map-Reduce to process heterogeneous datasets simultaneously –Processing data relationships is ubiquitous –Join-enabled Map-Reduce system can provide a highly parallel yet cost effective alternative –Include relational algebra in the subset of the database principles  Relational operators can be modeled using various combinations of the three primitives: Map, Reduce, and Merge

6 Map-Reduce  Input dataset is stored in GFS  Mapper –Read splits of the input dataset –Apply map function to the input records –Produce intermediate key/value sets –Partition the intermediate sets into # of reducers sets  Reducer –Read their part of intermediate sets from mappers –Apply reduce function to the values of a same key –Output final results split mapper reducer map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] Signatures of Map, Reduce Function Input Intermediate Sets Final Results

7 Join using Map-Reduce:  Use homogenization procedure –Apply one map/reduce task on each dataset –Insert a data-source tag into every value –Extract a key attribute common for all heterogeneous datasets –Transformed datasets now have two common attributes  Key and data-source  Problems –Take lots of extra disk space and incur excessive map-reduce communications –Limited only to queries that can be rendered as equi-joins

8 Join using Map-Reduce: Homogenization KeyValue 101, “Value1” 851, “Value2” 3201, “Value3” KeyValue 102, “Value4” 542, “Value5” 3202, “Value6” map reduce map reduce map reduce Dataset 1 Dataset 2 Collect records with same key

9 Map-Reduce-Merge  Signatures –α, β, γ represent dataset lineages –Reduce function produces a key/value list instead of just values –Merge function reads data from both lineages  These three primitives can be used to implement the parallel version of several join algorithm map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] Map-Reduce

10 Merge Modules  Merge function –Process two pairs of key/values  Processor function –Process data from one source only –Users can define two processor functions  Partition selector –Determine from which reducers this merger retrieves its input data based on the merger number  Configurable iterator –A merger has two logical iterators –Control their relative movement against each others

11 Merge Modules Partition Selector Processor Iterator Merge Reducer Output Reducers for 1 st DatasetReducers for 2 nd Dataset Reducer Output

12 Applications to Relational Data Processing  Map-Reduce-Merge can be used to implement primitive and some derived relational operators –Projection –Aggregation –Generalized selection –Joins –Set union –Set intersection –Set difference –Cartesian product –Rename  Map-Reduce-Merge is relationally complete, while being load- balanced, scalable and parallel

13 Example: Hash Join split mapper reducer merger split mapper reducer Use a hash partitioner Read from every mapper for one designated partition Read from two sets of reducer outputs that share the same hashing buckets One is used as a build set and the other probe

14 Case Study: TPC-H Query 2  Involves 5 tables, 1 nested query, 1 aggregate and group by clause, and 1 order by

15 Case Study: TPC-H Query 2  Map-Reduce-Merge workflow 13 passes of Map-Reduce-Merge 10 mappers, 10 reducers, and 4 mergers 6 passes of Map-Reduce-Merge 5 mappers, 4 reduce-merge-mappers, 1 reduce-mapper and 1 reducer Combining phases

16 Conclusion  Map-Reduce-Merge programming model –Retain Map-Reduce’s many great features –Add relational algebra to the list of database principles it upholds –Contains several configurable components that enable many data- processing patterns  Next step –Develop an SQL-like interface and an optimizer to simplify the process of developing a Map-Reduce-Merge workflow –This work can readily reuse well-studied RDBMS techniques


Download ppt "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)"

Similar presentations


Ads by Google