Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Similar presentations


Presentation on theme: "MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat."— Presentation transcript:

1 MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat

2 Outline ◦Introduction ◦Programming Model ◦Implementation ◦Refinement ◦Performance ◦Related work ◦Conclusions

3 Introduction ◦What is the purpose? ◦The abstraction Input Data Map Intermediate Key/value Reduce Output File

4 Programming model ◦Map ◦Reduce ◦Example

5 Programming model ◦Real example: make an index

6 Programming Model ◦More example  Distributed grep  Count of URL Access Frequency  Reverse Web-link Graph  Term Vector per host  Inverted index  Distributed sort

7 Implementation ◦Execution overview

8 Implementation ◦Master data structure ◦Fault tolerance  Worker failure  Master failure  Semantics in the Presence of Failures ◦Locality ◦Task Granularity ◦Back Tasks

9 Refinements ◦Partitioning Function ◦Ordering Guarantees ◦Combiner Function ◦Input and Out Types ◦Side-effect ◦Skipping Bad Records ◦Local Execution ◦Status Information ◦Counters

10 Performance ◦Cluster Configuration  1800machines  Each 2GHz Intel Xeon processors  4GB memory  2*160GB IDE disk  1 Gbps Ethernet  Arranged in two-level tree-shaped

11 Performance ◦Grep  Scan through 10 10 100-byte records  Search a relatively rare three-character pattern (occur in 92,337 records)  Data transfer rate over time  The entrie computation takes approximately 150s Peaks at over 30GB/s 1764workers assigned

12 Performance ◦Sort  Sorts 10 10 100-byte records  Modeled after TeraSort benchmark  Extract a 10-byte sorting key

13 Performance ◦Sort  Input rate is less than for grep  There is a delay  The rate: input > shuffle > output  Effect of backup tasks  Machine failures

14 Related Work ◦Restricted programming models ◦Parallel processing compare to  Bulk Synchronous Programming & MPI primitive ◦Backup task mechanism compare to  Charlotte System ◦Sorting facility compare to  NOW-Sort

15 Related Work ◦Sending data over distributed queue compare to  River ◦Programming model compare to  BAD-FS

16 Conclusion ◦What is the reason for the sucess of MapReduce?  Easy to use  Problem are easily expressible  Scales to large cluster ◦Learned from this work  Restriction the programming  Network bandwidth is a scarce resource  Redundant execution


Download ppt "MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat."

Similar presentations


Ads by Google