The Map-Reduce Framework

The Map-Reduce Framework
Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other

Big Data (From Wikipedia)
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate […] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

Motivation Large-Scale data processing
Want to process and deduct “new” information from a huge, mostly distributed, database Want to use 1,000s of CPUs But don’t want hassle of managing things

Distributed Computing Frameworks (1)
Hadoop: Basic MapReduce. Intermediate results in disk. Now Yarn and support DAG with Tez Spark: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Hive: SQL-like queries (HiveQL), which are implicitly converted into MapReduce, Tez, or Spark jobs. Mahout: Machine learning algs. mostly on top of Hadoop Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. Pregel: A System for Large-Scale Graph Processing (Google) Dryad: Distributed Data-parallel Programs from Sequential Building Blocks (Microsoft) Dremel: Interactive Analysis of Web-Scale Datasets (Google) PowerDrill: Processing a Trillion Cells per Mouse Click (Google) Peregrine: A Mapreduce Framework For Iterative And Pipelined Jobs

Distributed Computing Frameworks (2)
A Distributed Computing Framework provides: Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates A MapReduce framework provides the above for any task that match its specific, yet generic enough, pattern

The Map-Reduce Paradigm
You are given a phonebook: How many entries have the last name “Smith”? Create a list of all last names, with the numbers of entries they appear in How many entries have the first name “John”? Create a list of all first names, with the numbers of entries they appear in How can these tasks be distributed to many computers?

Map-Reduce Programming Model
Programming model from Lisp (and other functional languages)‏ Many problems can be phrased this way Easy to distribute across nodes Nice retry/failure semantics

Map and Reduce in Lisp (Scheme)‏
Unary operator (map fu list [list2 list3 …])‏ (reduce fb list [list2 list3 …])‏ (map square ‘( ))‏ ( )‏ ⟹ ( )‏ (reduce + ‘( ))‏ (+ 16 (+ 9 (+ 4 1) ) )‏ ⟹ 30 (reduce + (map square (map – L1 L2)))‏ Binary operator

Map-Reduce ala Google map(key, val) is run on each item in set
emit (new-key, new-val) pairs reduce(new-key, lisf of new-vals) is run for each unique new-key emitted by map()‏ emit final output MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google, Inc.)

Semantics Assumptions on Map and Reduce operations
Map has no side effects Reduce is Associative: (a # b) # c = a # (b # c) Commutative: a # b = b # a Idempotent: The same call with the same parameters will produce the same results

Count words in docs Inputs: (url, content) pair for each URL
Outputs: (word, count) pair for each word that appears in the documents map(key=url, val=content): For each word in content, emit (word, “1”)‏ pair reduce(key=word, values=unique-counts): sum all “1”s in unique-counts list emit (word, sum) pair

Count Illustrated see 1 bob 1 bob 1 david 1 run 1 run 1 see bob throw
map(key=url, val=content): For each word in content, emit (word, “1”)‏ pair reduce(key=word, values=unique-counts): sum all “1”s in unique-counts list and emit (word, sum) pair see 1 bob 1 run 1 bob 1 david 1 run 1 see 2 throw 1 see bob throw Doc 1: see david run Doc 2: see 1 david 1 throw 1

Grep Inputs: (file, regexp)‏ pair for each file and regular expression
Outputs: (regexp, file, #line) for each regexp, all the lines that match the regular expression map(key=file, val=regexp): If contents matches regexp, emit (regexp, {file-name, #line}) ‏ reduce(key=regexp, values={file-name, #line}): emit (regexp, list of {file-name, #line})

Reverse Web-Link Graph
Inputs: (source-URL, page-content)‏ for each URL Outputs: (URL, list of URLs that links to this URL) for each URL map(key=source-URL, val=page-content): For each link in page-content: emit (target-URL, source-URL) pairs reduce(key=target-URL, values=source-URL): emit (target-URL, list of source-URLs) pairs

Inverted Index Inputs: (source-URL, page-content)‏ for each URL
Outputs: (word, list of pages where it appears) for each word that appears in the documents map(key=source-URL, val=page-content): For each word in page-content: emit (word, source-URL) pair reduce(key=word, values=source-URL): emit (word, list of source-URLs) pairs

Map-Reduce Framework Input: a collection of keys and their values User
written Map function Each input (k,v) mapped to set of intermediate key-value pairs Sort all key-value pairs by key Each list of values is reduced to a single value (more or less) for that key User written Reduce function Compile one list of intermediate values for each key:(k, [v1,…,vn])‏

Parallelization How is this distributed? Map: Shuffle: Reduce:
Partition input key/value pairs into chunks Run map() tasks in parallel Shuffle: After all map()s are complete Consolidate all emitted values for each unique emitted key Reduce: Partition space of output map keys Run reduce() in parallel If map() or reduce() fails, re-execute!

Shuffle map reduce

Another example In a credit card company:
Inputs: credit card number -> bargain (cost, date, etc…) for each bargain Outputs: The average bargain cost per month per card (card, month, avg-cost) map(key=card, val=bargain): emit ({card, bargain.month}, bargain.cost) reduce(key={card, month}, values=costs): Find average cost emit (card, month, average)

Implementation A program forks a master process and many worker processes. Input is partitioned into some number of splits. Worker processes are assigned either to perform map on a split or reduce for some set of intermediate keys.

Distributed Execution Overview
User Program Worker Master fork assign map reduce read local write remote read, sort Output File 0 File 1 write Split 0 Split 1 Split 2 Input Data

Distributed Execution (2)
Map Task 1 Map Task 2 Map Task 3 Reduce Task 1 Reduce Task 2

Responsibility of the Master
Assign map and reduce tasks to workers Check that no worker has died (because its processor failed) Communicate results of map to the reduce tasks

Locality Master program divides up tasks based on location of data:
Tries to have map tasks on same machine as physical file data Or at least same rack map task inputs are divided into MB blocks Same size as Google File System chunks

Communication from Map to Reduce
Select a number R of reduce tasks Divide the intermediate keys into R groups, e.g. by hashing Each map task creates, at its own processor, R files of intermediate key/value pairs, one for each reduce task

Fault Tolerance Master detects worker failures
Re-executes failed map tasks Re-executes failed reduce tasks Master notices particular input key/values cause crashes in map, and skips those values on re-execution. Effect: Can work around bugs in third-party libraries

Strugglers No reduce can start until map is complete:
A single slow disk controller can rate-limit the whole process Master redundantly executes “slow” map tasks Uses results of first correctly-terminating replica

Task Granularity & Pipelining
Fine granularity tasks: #map tasks >> #machines Minimizes time for fault recovery Can pipeline shuffling with map execution Hides latency of I/O wait time Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines

Distributed Sort (1) Sorting guarantees within each reduce partition over the order of processing the keys in each reduce task Note: no order is guaranteed over the values of each key How do we implement a distributed sort of all the data?

Distributed Sort (2) Inputs: (document-name, document-content) for each document Outputs: List of documents names ordered by the first word in the document map(key=document-name, val=document-content): For the first word in the document: emit (word, document-name)‏ pair reduce(key=word, values=document-names): For each name in document-names: emit name Something is missing .... Shuffle phase must allow distributed sort!

Other Refinements Compression of intermediate data Combiner
Useful for saving network bandwidth Combiner Perform reduce on local map results before shuffle

Yet another example Distributed single source shortest path
Input: (u, list-of-neighbors) for each edge in the graph, and s – source node Output: (v, minimal-distance) map(key=u, val={u-distance, neighbours}): For each v in neighbours: emit (v, u-distance + 1)‏ pair reduce(key=u, values=distances): emit (u, min(distances)) pair “Graph Algorithms with MapReduce”:

Hadoop An open-source implementation of Map-Reduce in Java
Download from: More about Hadoop in the tutorials...

The Map-Reduce Framework

Similar presentations

Presentation on theme: "The Map-Reduce Framework"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Map-Reduce Framework

Similar presentations

Presentation on theme: "The Map-Reduce Framework"— Presentation transcript:

Similar presentations

About project

Feedback