Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon,

Outline Introduction Map-Reduce Map-Reduce-Merge Conclusions

Introduction New data-processing systems should consider alternatives to using big, traditional databases Map-Reduce does a good job, in a limited context, with extraordinary simplicity Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity

Introduction (cont’d)
Application SQL Sawzall ≈SQL LINQ, SQL Parallel Databases Sawzall Pig, Hive DryadLINQ Scope Language Map-Reduce Hadoop Dryad Execution GFS BigTable HDFS S3 Cosmos Azure SQL Server Storage

Map-Reduce : Motivation
Many special purpose tasks that operate on and produce large amounts of data Crawled documents, web requests, etc Inverted indices, summaries, other kinds of derived data Needs to be distributed across large number of machines to finish in a reasonable time Parallelize the computation Distribute data Obscures original computation with these extra concerns

Map-Reduce : Benefits Automatic parallelization and distribution
User code complexity and size reduced Transparent fault-tolerance I/O scheduling Fine grained partitioning of tasks Dynamically scheduled on available workers Status and monitoring

Map-Reduce : Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list (out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list (out_value) Produces a set of merged output values (usually just one)

Map-Reduce : Data Flow Data Map Reduce

Map-Reduce : Data Flow Map : Generate new Key and its value
Reduce : Integrate values of same key Map Reduce Key1 Value1 KeyA ValueX KeyB ValueY ValueZ A=X B=Y,Z

Map-Reduce : Architecture
Master Worker Worker Map GFS GFS Reduce Worker Worker Reduce Map

Map-Reduce : Architecture
Master Assigns and maintains the state of each map/reduce task Propagating intermediate files to reduce tasks Worker Execute Map or Reduce by request of Master

Map-Reduce : Distributed Processing
Input File Input 1 Input 2 … Input M Map Map … Map Intermediate File 1 2 … 1 2 … R … 2 … R Shuffle Reduce Shuffle Reduce Shuffle Reduce … Output File Output 1 Output 2 Output R …

Map-Reduce : Example Inverted Index wordID docID Location 101 1 2 201
203 3 301 302 DocID=1 IDS 연구실의 페이지 DocID=2 IDB 연구실의 페이지 Word docID 연구실 101 의 201 페이지 203 IDS 301 IDB 302

Map-Reduce : Example (cont’d)
Input data to Map Output of Map Data Map Reduce Key(docID) Value(Text) 1 IDS 연구실의 페이지 2 IDB 연구실의 페이지 Key (wordID) Value (docID:Location) 301 1:0 101 1:1 201 1:2 203 1:3 Key (wordID) Value (docID:Location) 302 2:0 101 2:1 201 2:2 203 2:3

Shuffle Collect same keys and convey them to Reduce Reduce writes the final result Key (wordID) Value (docID:Location) 101 1:1 2:1 201 1:2 2:2 203 1:3 2:3 301 1:0 302 2:0 Data Map Reduce 101=1:1, 2:1 201=1:2, 2:2 203=1:3, 2:3 301=1:0 302=2:0

Other Examples Distributed Grep Count URL Access Frequency <URL, 1> <URL, total count> Reverse Web-Link Graph <target, source> <target, list(source)>

Map-Reduce-Merge Map-Reduce is an extremely simple model, but with limited context Map-Reduce handles mainly homogeneous datasets Relational operators are hard to implement with Map-Reduce(especially join operations) Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete

Map-Reduce-Merge Adds a merge phase to the Map-Reduce algorithm
Allows processing of multiple heterogeneous datasets Like Map and Reduce, the Merge phase is implemented by the developer Example: Two datasets: department and employee Goal: compute employee’s bonus based on individual rewardsand department bonus adjustment

Map-Reduce-Merge Example Match keys on dept_id in tables

Map-Reduce-Merge: Extending Map-Reduce
Change to reduce phase / Merge phase Phases 1. Map: (k1, v1) → [(k2, v2)] 2. Reduce: (k2, [v2]) → [v3] becomes: 2. Reduce: (k2, [v2]) → (k2, [v3]) 3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])

Map-Reduce-Merge Additional user-definable operations
Merger: same principle as map and reduce analogous to the map and reduce definitions, define logic to do the merge operation Processor: processes data from one source process data on an individual source Partition selector: selects the data that should go to the merger which data should go to which merger? Configurable iterator: how to iterate through each list as the merging is done how to step through each of the lists as you merge

Map-Reduce-Merge

Map-Reduce-Merge : Relational Data Processing
Relational operators can be implemented using the Map-Reduce-Merge model. This includes: Projection Aggregation Generalized selection Joins Set union Set intersection Set difference Etc…

Map-Reduce-Merge : Example, Set Union
The two Map-Reduces emit each a sorted list of unique elements The Merge merges the two lists by iterating in the following way: Store the smallest value of two and increase it’s iterator by one If they are equal, store one of them and increase both iterators

Map-Reduce-Merge : Example, Set Difference
We have two sets, A and B, we want to compute A-B The two Map-Reduces emit each a sorted list of unique elements The merge iterates simultaneously over the two lists: If the value of A is less than B’s, store A’s value If the value of B is smaller, increment B’s iterator If the two are equal, increment both iterators

Map-Reduce-Merge : Example, Sort-Merge Join
Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer Reduce: data in the sets are merged into a sorted set => sort the data Merge: the merger joins the sorted data for each key range

Map-Reduce-Merge : Optimizations
Map-reduce already optimizes using locality and backup tasks Optimize the number of connections between the outputs of the reduce phase and the input of the merge phase ( Example: Set intersection) Combining two phases into one (example: ReduceMerge)

Conclusions Map-Reduce-Merge allows us to work on heterogeneous datasets Map-Reduce-Merge supports joins which Map-reduce didn’t directly do Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Similar presentations

Presentation on theme: "Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Similar presentations

Presentation on theme: "Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2"— Presentation transcript:

Similar presentations

About project

Feedback