MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group

Motivation: Large Scale Data Processing Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure, top queries in a day etc. Want to use hundreds or thousands of CPUs but want to only focus on the functionality MapReduce hides messy details in a library: Parallelization Data distribution Fault-tolerance Load balancing

Outline Programming Model Implementation Refinements Evaluation Conclusion

Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other functional languages

Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Looking at Actual Code (Appendix A) #include "mapreduce/mapreduce.h“ // User's map function class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1"); } }; REGISTER_MAPPER(WordCounter);

// User's reduce function class Adder : public Reducer { virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Adder);

int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; // Store list of input files into "spec" for (int i = 1; i < argc; i++) { MapReduceInput* input = spec.add_input(); input->set_format("text"); input->set_filepattern(argv[i]); input->set_mapper_class("WordCounter"); } // Specify the output files: // /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100 MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Adder"); // Optional: do partial sums within map tasks out->set_combiner_class("Adder"); // Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); // Done: 'result' structure contains info about counters, time // taken, number of machines used, etc. return 0; }

More Examples Inverted index: word  documents Map: parse each document, emits Reduce: emits Distributed grep Map: emits a line if input document match a given pattern Reduce: identity function Distributed sort Map: extracts key from each record, emits a Reduce: emits all pairs unchanged Relies on partitioning function and ordering guarantees (later in the talk)

MapReduce Jobs Run in August 2004 (Table 1)

Implementation Overview Execution environment: Google cluster 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory 100 mbps or 1 gbps Ethernet, but limited (average) bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

Parallelization Map Divide the input into M equal-sized splits Each split is 16-64 MB large Reduce Partitioning intermediate key space into R pieces hash(intermediate_key) mod R Typical setting: 2,000 machines M = 200,000 R = 5,000

Execution Overview M input splits of 16- 64MB each Partitioning function hash(intermediate_key) mod R (0) mapreduce(spec, &result) R regions Read all intermediate data Sort it by intermediate keys

Timeline

More Details Master: Map task: state (idle/in-progress/completed), R file locations, worker machine Reduce task: state (idle/in-progress/completed), worker machine O(M+R) scheduling decisions, O(MR) space Locality preserving scheduling Schedule a map task close to the input location Prefer fine-grain tasks Dynamic load balancing Speeds up recovery

Fault Tolerance via Re-Execution On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Fine-grain: the completed tasks can be re-executed on multiple machines quickly Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely)

Backup Tasks Problem: “straggler” A machine takes an unusually long time to complete one of the last few map or reduce tasks E.g. bad disk with frequent correctable errors; other jobs running; machine configuration problems etc. Near end of phase, master schedules backup executions of the remaining in-progress tasks Whichever one finishes first "wins"

Combiner Function Purpose: reduce data sent over network Combiner function: performs partial merging of intermediate data at the map worker Typically, combiner function == reducer function Requires commutative and associative E.g. word count

Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees two failures for same record: Next worker is told to skip the record Effect: Can work around bugs in third-party libraries

Other Refinements Extensible input and output types Local execution for debugging Status web page User-defined counters Counter values returned to user code Displayed on status web page

Setup Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

Benchmarks Two benchmarks: GrepScan 10 10 100-byte records to extract records matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB) R=1 SortSort 10 10 100-byte records (modeled after TeraSort benchmark) M=15,000 (input split size about 64MB) R=4,000

Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Total time about 150 seconds; 1 minute startup time Grep

Sort 44% longer 5% longer

Conclusion MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal w/ messy details

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Similar presentations

Presentation on theme: "MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Similar presentations

Presentation on theme: "MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group."— Presentation transcript:

Similar presentations

About project

Feedback