Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec,

Similar presentations


Presentation on theme: "MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec,"— Presentation transcript:

1 MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Je ff rey D. Ullman Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

2 contents Why do we need distributed computing for big data ? What is mapReduce? Functional programming review. MapReduce concept. First example – word counting. Fail tolerance. Optimizations. More examples. Complexity. Real world example.

3 Why do we need distributed computing for big data ? Single computer – has not enough: ◦ RAM ◦ HD capacity, IOPS. ◦ network bandwidth. ◦ CPU.

4 What is MapReduce MapReduce is a software framework introduced by Google to support distributes computing on large data sets on clusters of computers. There are many other MapReduce framework made to work on different environments (Hadoop is the leading open source implementation). Why not other framwork (like MPI)?

5 Functional programming review Functional operations do not modify data structures : they always create new ones. original data still exists in unmodified form. No side-affect (reading input from user, networking etc’) Data flows are explicit in program design. Order of operations does not matter: Fun foo (I :int list) = sum(I)+ mul(I) + length(I) Functions can be passed as arguments

6 Map Creates a new list applying f to each element of the input list; returns output in order. map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Example: upper(x) : char->char Input : lst = [ a,b,c] Operation : Map upper lst ; Output : [A, B, C] Google's video slides - Cluster Computing and MapReduce

7 Fold Moves across a list, applying f to each element plus an accumulator. F returns the next accumulator value, which is combined with the next element of the list. fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs; Example: We wish to write sum function Receiving int list; return sum; fun sum(lst) = foldl(fn (x,a)=>x+a) 0 lst Google's video slides - Cluster Computing and MapReduce

8 "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

9 Google’s map Map(“play.txt”,”to be or not to be”) Will Emit: (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) Map (in_key, in_value) -> (out_key, intermediate_value) list Example : "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

10 Google’s reduce Reduce (out_key,intermediate_value list) -> (key, out_value) list Example : reduce(“to”,[1,1,1]) Will Emit: [(“to”,3)] "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

11 Partition and combine functions Partition: A simple hash function - hash(key) mod R. the key may be different like hash(hostname(url)) mod R (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“to”,1), (“be”,1), (“to”,1), (be”,1) (“or”,1), (“not”,1) hash(key) mod 2 combine: Similar to reduce function, applied over local worker (more details will fallow) (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“to”,2), (“be”,2), (“or”,1), (“not”,1)

12 The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

13 Spite the work to pieces Start running code on workers The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

14 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

15 The MapReduce concept Assign mappers Assign reducers MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

16 The MapReduce concept Mappers read input MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

17 The MapReduce concept Workers finishes : writes the output of map into R regions by the partitioning function Registers the results at the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

18 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

19 The MapReduce concept Reducers read the input, sort it and start reducing. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

20 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

21 The MapReduce concept Reducer store it’s output on GFS, and Inform the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

22 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

23 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

24 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

25 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer (w1,1)

26 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

27 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

28 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

29 First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

30 Example 2 – reverse a list of links mapper input: ( url, web page content) Mapper function: reduce function :

31 Example 2 – reverse a list of links mapper input: ( url, web page content) ( themarker.com, … href=" ynet.com”..) ( calcalist.com, … href=" ynet.com”..) mapper function: (url,web page content) -> (target,source) list [(ynet.com, themarker.com)] [(ynet.com, calcalist.com)] reduce function : (target,source) list -> (target, source list) (ynet.com,[themarker.com, calcalist.com])

32 Example 3 – distributed grep Given a word and a list of text file, will return the files and lines that the word appears in. Mapper input: ( docId, docContent) Mapper function: Reduce function :

33 mapper input: ( docId, docContent) mapper function: ( docId, docContent) -> (docId, line that match pattern) reduce function : Identity function Example 3 – distributed grep

34 Example 4 – BFS Given N, will return the nodes in the graph. Each node will include the distance from N. Mapper input: (nodeId,N) // N.distance – distance from source node // N.AdjacencyList Mapper function: Reduce function :

35 Example 4 – BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

36 Example 4 – BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

37 Example 5 – Matrix Multiplication

38 Example 5 – Matrix-Vector Multiplication

39 Fail tolerance during the mapReduce task, the master ping all workers. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

40 Fail tolerance 1) in progress map or reduce task is restarted on another machine. 2) Completed map task is restarted on another machine. 3) Completed reduce task is not restarted since it’s result stored on GFS. 4) If a few mappers fail on the same input – the input is marked as non- valid. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

41 Fail tolerance MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Master – a single point of failure

42 Optimizations The master tries to allocate mapper which is the closest to the machine that stores the input file. Combine is used to reduce network bandwidths consumption. e.g, better transmitting ‘(“pig”,3)’ then ‘(“pig”,1), (“pig”,1), (“pig”,1)’. Some mappers may be lagging behind, the master allocate a backup worker near the end of the mappers operation.

43 Optimizations Some mappers may be lagging behind, the master allocate a backup worker near the end of the mappers operation. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

44 BW over time MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

45 Google mapReduce usage MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

46 Complexity Theory for MapReduce we wish to: Shrink the wall-clock time Execute each reducer in main memory We will look into two parameters in the algorithm: reducer size(q): This parameter is the upper bound on the number of values that are allowed to appear in the list associated with a single key. replication rate(r):the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs.

47 Complexity - Example Similarity Joins: given a large set of elements X and a similarity measure s(x, y) that tells how similar two elements x and y of set X are. 1M images, 1MB each.

48 Complexity - Example

49 Complexity- Example(fixed)

50 Real world example A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables.

51 Real world example Distributed Message Passing for Large Scale Graphical Models, Alexander Schwing

52 Real world example Iteration 1: Input entry for a single map task will be as followed:

53 Real world example Iteration 1:

54 Real world example Iteration 1: mapper output

55 Real world example

56 Iteration 2:

57 Real world example

58 Hands on

59 conclusions The good: 1) simple. 2) proven. 3) many implementations for different platforms and languages. The bad: 1) performance improvements enabled by common database is prevented. 2) map reduce algorithms is not always easy to design. 3) not all algorithms can be converted to work efficiently on mapreduce.

60 References MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer “Pro Hadoop” By Jason Venner series of Google's video - Cluster Computing and MapReduce minilecture/listing.html minilecture/listing.html

61 Partial Implementations list ◦ The Google MapReduce framework is implemented in C++ with interfaces in Python and Java. ◦ The Hadoop project is a free open source Java MapReduce implementation.Hadoop ◦ Twister is an open source Java MapReduce implementation that supports iterative MapReduce computations efficiently. Twister ◦ Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and other languages. Greenplum ◦ Aster Data Systems nCluster In-Database MapReduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL. Aster Data Systems ◦ GridGain is a free open source Java MapReduce implementation. GridGainJava ◦ Phoenix is a shared-memory implementation of MapReduce implemented in C. Phoenix ◦ FileMap is an open version of the framework that operates on files using existing file-processing tools rather than tuples. FileMap ◦ MapReduce has also been implemented for the Cell Broadband Engine, also in C.Cell Broadband Engine ◦ Mars:MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA. MarsNVIDIACUDA ◦ Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. Qt Concurrent ◦ CouchDB uses a MapReduce framework for defining views over distributed documents and is implemented in Erlang. CouchDBErlang ◦ Skynet is an open source Ruby implementation of Google’s MapReduce framework Skynet ◦ Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. DiscoNokiaErlang ◦ Misco is an open source MapReduce designed for mobile devices and is implemented in Python. Misco ◦ Qizmt is an open source MapReduce framework from MySpace written in C#. QizmtMySpace ◦ The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.)Hive frameworkHadoop ◦ The Holumbus Framework: Distributed computing with MapReduce in Haskell Holumbus-MapReduce The Holumbus FrameworkHolumbus-MapReduce ◦ BashReduce: MapReduce written as a Bash script written by Erik Frey of Last.fm BashReduce ◦ MapReduce for Go MapReduce for Go ◦ Meguro - a Javascript MapReduce framework Meguro ◦ MongoDB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features MapReduce MongoDB ◦ Parallel::MapReduce is a CPAN module providing experimental MapReduce functionality for Perl. Parallel::MapReduceCPAN ◦ MapReduce on volunteer computing MapReduce on volunteer computing ◦ Secure MapReduce Secure MapReduce ◦ MapReduce with MPI implementation MapReduce with MPI implementation


Download ppt "MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec,"

Similar presentations


Ads by Google