MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat

Outline ◦Introduction ◦Programming Model ◦Implementation ◦Refinement ◦Performance ◦Related work ◦Conclusions

Introduction ◦What is the purpose? ◦The abstraction Input Data Map Intermediate Key/value Reduce Output File

Programming model ◦Map ◦Reduce ◦Example

Programming model ◦Real example: make an index

Programming Model ◦More example  Distributed grep  Count of URL Access Frequency  Reverse Web-link Graph  Term Vector per host  Inverted index  Distributed sort

Implementation ◦Execution overview

Implementation ◦Master data structure ◦Fault tolerance  Worker failure  Master failure  Semantics in the Presence of Failures ◦Locality ◦Task Granularity ◦Back Tasks

Refinements ◦Partitioning Function ◦Ordering Guarantees ◦Combiner Function ◦Input and Out Types ◦Side-effect ◦Skipping Bad Records ◦Local Execution ◦Status Information ◦Counters

Performance ◦Cluster Configuration  1800machines  Each 2GHz Intel Xeon processors  4GB memory  2*160GB IDE disk  1 Gbps Ethernet  Arranged in two-level tree-shaped

Performance ◦Grep  Scan through 10 10 100-byte records  Search a relatively rare three-character pattern (occur in 92,337 records)  Data transfer rate over time  The entrie computation takes approximately 150s Peaks at over 30GB/s 1764workers assigned

Performance ◦Sort  Sorts 10 10 100-byte records  Modeled after TeraSort benchmark  Extract a 10-byte sorting key

Performance ◦Sort  Input rate is less than for grep  There is a delay  The rate: input > shuffle > output  Effect of backup tasks  Machine failures

Related Work ◦Restricted programming models ◦Parallel processing compare to  Bulk Synchronous Programming & MPI primitive ◦Backup task mechanism compare to  Charlotte System ◦Sorting facility compare to  NOW-Sort

Related Work ◦Sending data over distributed queue compare to  River ◦Programming model compare to  BAD-FS

Conclusion ◦What is the reason for the sucess of MapReduce?  Easy to use  Problem are easily expressible  Scales to large cluster ◦Learned from this work  Restriction the programming  Network bandwidth is a scarce resource  Redundant execution

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Similar presentations

Presentation on theme: "MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Similar presentations

Presentation on theme: "MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat."— Presentation transcript:

Similar presentations

About project

Feedback