Download presentation
Published byTiana Roseberry Modified over 9 years ago
1
MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
2
contents Why do we need distributed computing for big data ?
What is mapReduce? Functional programming review. MapReduce concept. First example – word counting. Fail tolerance. Optimizations. More examples. Complexity. Real world example.
3
Why do we need distributed computing for big data ?
Single computer – has not enough: RAM HD capacity, IOPS. network bandwidth . CPU.
4
What is MapReduce MapReduce is a software framework introduced by Google to support distributes computing on large data sets on clusters of computers. There are many other MapReduce framework made to work on different environments (Hadoop is the leading open source implementation). Why not other framwork (like MPI)?
5
Functional programming review
Functional operations do not modify data structures : they always create new ones. original data still exists in unmodified form. No side-affect (reading input from user, networking etc’) Data flows are explicit in program design. Order of operations does not matter: Fun foo (I :int list) = sum(I)+ mul(I) + length(I) Functions can be passed as arguments זרימת מידע מפורשת – אין לי רפרנסים או פינטרים שאני לא יודע מי נגעכ מתי ובאיזה מידע
6
Map map f a [] = f(a) map f (a:as) = list(f(a), map(f, as))
Creates a new list applying f to each element of the input list; returns output in order. map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Example: upper(x) : char->char Input : lst = [ a,b,c] Operation : Map upper lst ; Output : [A , B , C] Google's video slides - Cluster Computing and MapReduce
7
Fold fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs;
Moves across a list, applying f to each element plus an accumulator. F returns the next accumulator value, which is combined with the next element of the list. fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs; דגש – הפלט יכול להיות רשימה Example: We wish to write sum function Receiving int list; return sum; fun sum(lst) = foldl(fn (x,a)=>x+a) 0 lst Google's video slides - Cluster Computing and MapReduce
8
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
9
Google’s map Map (in_key , in_value) -> (out_key , intermediate_value) list Example: Map(“play.txt”,”to be or not to be”) Will Emit: (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
10
Google’s reduce Reduce (out_key,intermediate_value list) -> (key, out_value) list Example: reduce(“to”,[1,1,1]) Will Emit: [(“to”,3)] "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
11
Partition and combine functions
A simple hash function - hash(key) mod R. the key may be different like hash(hostname(url)) mod R (“to”,1), (“be”,1), (“to”,1), (be”,1) hash(key) mod 2 (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“or”,1), (“not”,1) combine: Similar to reduce function , applied over local worker (more details will fallow) (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“to”,2), (“be”,2), (“or”,1), (“not”,1)
12
The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
13
The MapReduce concept Spite the work to pieces
Start running code on workers אחד מהעובדים הוא מסטר MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
14
The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
15
The MapReduce concept Assign reducers Assign mappers
MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
16
The MapReduce concept Mappers read input
MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
17
The MapReduce concept Workers finishes :
writes the output of map into R regions by the partitioning function Registers the results at the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
18
The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
19
The MapReduce concept Reducers read the input, sort it and start reducing. העתקה יכולה להתחיל לפני שכולם סיימו, בתחלת הרדוס יכולה לתחיל רק כשכל המאפרים סיימו. כל רדוסר לוקח את הפרטישאן שלו MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
20
The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
21
The MapReduce concept Reducer store it’s output on GFS , and
Inform the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
22
The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
23
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
24
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
25
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
26
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
27
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
28
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
29
First example – word counting
"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
30
Example 2 – reverse a list of links
mapper input: ( url, web page content) Mapper function: reduce function :
31
Example 2 – reverse a list of links
mapper input: ( url, web page content) ( themarker.com, <HEAD> … href="ynet.com”..) ( calcalist.com, <HEAD> … href="ynet.com”..) mapper function: (url,web page content) -> (target,source) list [(ynet.com, themarker.com)] [(ynet.com, calcalist.com)] reduce function : (target,source) list -> (target, source list) (ynet.com,[themarker.com, calcalist.com])
32
Example 3 – distributed grep
Given a word and a list of text file, will return the files and lines that the word appears in. Mapper input: ( docId, docContent) Mapper function: Reduce function :
33
Example 3 – distributed grep
mapper input: ( docId, docContent) mapper function: ( docId, docContent) -> (docId, line that match pattern) reduce function : Identity function
34
Example 4 – BFS Given N, will return the nodes in the graph. Each node will include the distance from N . Mapper input: (nodeId,N) // N.distance – distance from source node // N.AdjacencyList Mapper function: Reduce function :
35
Example 4 – BFS גרף סופי, מריצים עד שאין צמתים עם מרחק אינסוף, בלכ איטרציה צריך תוכנה חיצונית שתבדוק את התנאי ותגלל הלאה במידת הצורך , hadoop מספקת כזה ממשק. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
36
Example 4 – BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer
37
Example 5 – Matrix Multiplication
MN= P . iter 1: Mapper input: Mapper function: => Reduce function: (𝑗, 𝑀,𝑖, 𝑚 𝑖𝑗 , 𝑁,𝑘, 𝑛 𝑗𝑘 ,… ⇒[( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 ),…..]
38
Example 5 – Matrix-Vector Multiplication
iter 2: Mapper input: ( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 ) Mapper function: Identity function Reduce function: ( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 )⇒( 𝑖,𝑘 , 𝑗 𝑚 𝑖𝑗 𝑛 𝑗𝑘 )
39
Fail tolerance during the mapReduce task, the master ping all workers.
MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
40
Fail tolerance 1) in progress map or reduce task is restarted on another machine. 2) Completed map task is restarted on another machine. 3) Completed reduce task is not restarted since it’s result stored on GFS. 4) If a few mappers fail on the same input – the input is marked as non-valid. הכוח של תכנות פונקציונלי – אפשר להפעיל מישהו אחר על המטלה הזאת וזה בסדר MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
41
Fail tolerance Master – a single point of failure
MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
42
Optimizations The master tries to allocate mapper which is the closest to the machine that stores the input file. Combine is used to reduce network bandwidths consumption. e.g, better transmitting ‘(“pig”,3)’ then ‘(“pig”,1) , (“pig”,1), (“pig”,1)’. Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation.
43
Optimizations Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation. אין סייד אפקטס – אין שום בעיה להריץ פונקציות במקביל MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
44
BW over time בגרף האמצעי זנב ארוך, בגרך הימני 200 מתוך 1760 מתו MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
45
Google mapReduce usage
MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat
46
Complexity Theory for MapReduce
we wish to: Shrink the wall-clock time Execute each reducer in main memory We will look into two parameters in the algorithm: reducer size(q): This parameter is the upper bound on the number of values that are allowed to appear in the list associated with a single key. replication rate(r):the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. ככל שהרדוסר סייז קטן יותר – בטוח יכנס ל RAM והעיבוד יהיה קצר יותר. בנוסף יהיה יותר מקביליות.
47
Complexity - Example Similarity Joins:
given a large set of elements X and a similarity measure s(x, y) that tells how similar two elements x and y of set X are. 1M images, 1MB each.
48
Complexity - Example mapper input: ( i, Pi) mapper function:
reduce function : ((i,j),[Pi,Pj]) -> ((i,j),s(Pi,Pj)) reducer size(q):2 replication rate(r): for each input the mapper create 1M output – 1M*1M/1M= 1M. Total network: 1M * 1M * 1MB = 1 exabyte. On one gigabit network it should take 300 years. kf
49
Complexity- Example(fixed)
mapper input: ( i, Pi) We will partision the input into g groups(G). mapper function: ( i, Pi) -> { 𝑢,𝑣 ,𝑃𝑖 | 𝑣,𝑢∈𝐺,𝑖∈𝑢} reduce function : 𝑢,𝑣 ,𝑃𝑖,𝑃𝑗,…. -> ( 𝑖,𝑗 , s(Pi,Pj)). reducer size(q): 2M/g replication rate(r): g. Total network: g * 1M * 1MB = g Terabyte. G כפול חצי יום
50
Real world example A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. 𝑃( 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 ) פסי אי זה הפריור על המשתנה i.
51
Real world example Distributed Message Passing for Large Scale Graphical Models, Alexander Schwing
52
Real world example Iteration 1:
Input entry for a single map task will be as followed:
53
Real world example Iteration 1:
54
Real world example Iteration 1: mapper output
55
Real world example
56
Real world example Iteration 2:
57
Real world example
58
Hands on
59
conclusions The good: 1) simple. 2) proven.
3) many implementations for different platforms and languages. The bad: 1) performance improvements enabled by common database is prevented. 2) map reduce algorithms is not always easy to design. 3) not all algorithms can be converted to work efficiently on mapreduce.
60
References MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer “Pro Hadoop” By Jason Venner series of Google's video - Cluster Computing and MapReduce minilecture/listing.html
61
Partial Implementations list
The Google MapReduce framework is implemented in C++ with interfaces in Python and Java. The Hadoop project is a free open source Java MapReduce implementation. Twister is an open source Java MapReduce implementation that supports iterative MapReduce computations efficiently. Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and other languages. Aster Data Systems nCluster In-Database MapReduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL. GridGain is a free open source Java MapReduce implementation. Phoenix is a shared-memory implementation of MapReduce implemented in C. FileMap is an open version of the framework that operates on files using existing file-processing tools rather than tuples. MapReduce has also been implemented for the Cell Broadband Engine, also in C. Mars:MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA. Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. CouchDB uses a MapReduce framework for defining views over distributed documents and is implemented in Erlang. Skynet is an open source Ruby implementation of Google’s MapReduce framework Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. Misco is an open source MapReduce designed for mobile devices and is implemented in Python. Qizmt is an open source MapReduce framework from MySpace written in C#. The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.) The Holumbus Framework: Distributed computing with MapReduce in Haskell Holumbus-MapReduce BashReduce: MapReduce written as a Bash script written by Erik Frey of Last.fm MapReduce for Go Meguro - a Javascript MapReduce framework MongoDB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features MapReduce Parallel::MapReduce is a CPAN module providing experimental MapReduce functionality for Perl. MapReduce on volunteer computing Secure MapReduce MapReduce with MPI implementation
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.