Introduction to MapReduce

Introduction to MapReduce
Most of the contents about MapReduce in this set of slides are borrowed from the presentation by Michael Kleber from Google dated on January 14, 2008.

Part 2 Reference Texts Tom White, Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition), O'Reilly Media, April 11, 2015, ISBN: Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan and Claypool Publishers, April 30, 2010, ISBN: Jason Swartz, Learning Scala, O'Reilly Media, December 8, 2014, ISBN: Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, Learning Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, February 27, 2015, ISBN: Bill Chambers and Matei Zaharia, Spark: The Definitive Guide: Big Data Processing Made Simple, O'Reilly Media, March 8, 2018, ISBN:

Why we want to use MapReduce?
MapReduce is a distributed computing paradigm Distributed computing is hard Do we really need to do distributed computing? Yes, otherwise some problems are too big for single computers. Examples: 20+ billion web pages × 20KB = 400+ terabytes One computer can read MB/sec from disc ~4 months to read the web ~400 hard / SSD drives just to store the web Even more time to do something with the data

Distributed computing is hard
Bad news I: programming work Communication and coordination Recovering from machine failure (all the time!) Status reporting Debugging Optimization Data locality Bad news II: repeat for every problem you want to solve How can we make this easier?

MapReduce A simple programming model that can be applied to many large-scale computing problems Hide messy details in MapReduce runtime library Automatic parallelization Load balancing Network and disk transfer optimization Automatic handing of machine failures Robustness Improvements to core library benefit all users of library!

Typical flow of problem solving by MapReduce
Read a lot of data Map: extract something you care about from each record (Hidden) shuffle and sort Reduce: aggregate, summarize, filter, or transform …… Write the results The above outline stays the same when solving different problems Map and Reduce change to fit the particular problem

MapReduce paradigm Basic data type: the key-value pair (k, v)
For example, key = URL, value = HTML of the web page Programmer specifies two primary methods Map (k, v) → <(k1, v1), (k2, v2), (k3, v3), ……, (kn, vn)> Reduce (k’, <v’1, v’2, ……, v’n>) → <(k’, v”1), (k’, v”2), ……, (k’, v”m)> All v’ with the same k’ are reduced together (Remember the invisible “Shuffle and Sort” step)

Example: word frequencies (or word count) in web pages
Considered as “Hello World!” example in cloud computing Input: files with one document per record Specify a map function that takes a key/value pair key = document URL value = document contents Output of map function is (potentially many) key/value pairs In this case, output (word, “1”) once per word in the document “document 1”, “to be or not to be” “to”, “1” “be”, “1” “or”, “1” “not”, “1”

Example: word frequencies (or word count) in web pages
MapReduce library gathers together all pairs with the same key in the shuffle/sort step Specify a reduce function that combines the values for a key In this case, compute the sum key = “be” value = <“1”, “1”> key = “not” value = <“1”> key = “or” value = <“1”> key = “to” value = <“1”, “1”> “2” “1” “1” “2” Output of reduce step “be”, “2” “not”, “1” “or”, “1” “to”, “2”

Example: the overall process

Under the hood: scheduling
One master, many workers Input data split into M map tasks (typically, 128 MB in each split) The input data decides how many map tasks will be created Reduce phase partitioned into R reduce tasks (= number of output files) Each reduce task generates an output file The programmer has the authority to decide the number of reduce tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Consider locality of data to worker when assigning task Worker reads task input (often from local disk!) Worker produces R local files containing intermediate (k, v) pairs Master assigns each reduce task to a free worker Worker reads intermediate (k, v) pairs generated by map workers Worker applies Reduce function to produce the output User may specify Partition: which intermediate keys to which Reducers

MapReduce: the flow in diagram

MapReduce: fault tolerance via re-execution
Worker failure Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in-progress reduce tasks Task completion committed through master Master failure State is checkpointed to replicated file system New master recovers and continues Very robust: lost thousands of machines once, but finished successfully

Introduction to MapReduce

Similar presentations

Presentation on theme: "Introduction to MapReduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to MapReduce

Similar presentations

Presentation on theme: "Introduction to MapReduce"— Presentation transcript:

Similar presentations

About project

Feedback