Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Similar presentations


Presentation on theme: "MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion."— Presentation transcript:

1 MapReduce Kristof Bamps Wouter Deroey

2 Outline Problem overview MapReduce o overview o implementation o refinements o conclusion

3 Problem overview Conceptually straightforward computations o e.g. Find the most frequest search queries Large amount of data o billions of webpages/search queries Too much data for 1 computer to handle

4 Problem overview Typical solution: distribute the work over 100's of machines Downsides: o communication o recovering from machine failure o optimization o locality Has to be rewritten for each program

5 MapReduce Software framework patented by Google to support distributed computing on large datasets on computer clusters Features o parallelization o load balancing o recovering of machine failure o locality

6 Programming model Input: set of key/value pairs Output: set of key/value pairs Programmer specifies 2 functions: map and reduce Map: o takes input pair o produces an intermediate key/value pair MapReduce library groups together all intermediate pairs with same key l Reduce: o intermediate key with values for that key as input o merges values to produce smaller subset

7 MapReduce: Example map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Wordcounter

8 MapReduce: Example Example uses at Google: o distributed grep o distributed sort o web access log stats o large scale graph computations o language model processing o many more

9 Implementation: execution

10 Master data structures Stores state and identity of the machine for each map or reduce task Stores locations and file regions of output generated by map tasks Pushes information to in-progress reduce tasks

11 Fault tolerance Handling worker failures o master pings every worker o failed tasks will be rescheduled o completed map tasks that fail also rescheduled o reduce tasks notified of failure Handling master failures

12 Semantics in the presence of failures If the map and reduce functions are deterministic functions of their input values, MapReduce produces the same output as a non-faulting sequential execution Relys on atomic commits of map and reduce Each task writes output to private temporary files Map task completed: sends message to master which stores the filenames Reduce task completed: rename file to final filename

13 Locality Possible by usage of a distributed file system (e.g. GFS, HDFS,...) Master uses location information to determine where to schedule what task Greatly reduces network traffic

14 Backup tasks Possibility of stragglers o takes unusually long time to complete o can be caused by bad hard disk, competition for bandwidth,.... Solution: schedule backup tasks o backup execution of in-progress tasks o task is completed when the backup or primary are finished

15 Refinements: partitioning function Users specify number of reduce jobs (R) Data gets partitioned between each job using intermediate key (e.g. hash(key) modulo R) Possible to specify partitioning function example: input are URLs, we want all entries for 1 host in a single file o e.g. partitioning function: hash(Hostname(urlkey)) mod R

16 Refinements: combiner function Significant repetition in the intermediate keys possible o e.g. WordCount: ("the", 1) Optional "combine" function: o partial merging o typically same code as reduce task o executed on machine that does map task o difference with reduce: output

17 Refinements: input and output types MapReduce supports multiple formats o e.g. "text" mode: threats each line as a key/value pair o each format knows how to split itself Reader interface: o allows specification of custom input type o does not have to be text, users can specify an interface that reads from a database or something else Output types: similar to input types

18 Refinements: skipping bad records Sometimes there are bugs that cause a Map or Reduce task to crash on certain records o usually fixed by debugging, though not always feasible On crash: o send message to master o includes sequence number of the argument When master sees more than 1 failure: o indicate this record can be skipped when issuing next re- execution

19 Other refinements Local execution: o difficult to debug regular MapReduce applications o sequential execution on 1 machine User-defined counters

20 Conclusion MapReduce simplifies distributed large-scale computations Allows programmers to focus on the problem without worrying about details


Download ppt "MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion."

Similar presentations


Ads by Google