Presentation is loading. Please wait.

Presentation is loading. Please wait.

Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Similar presentations


Presentation on theme: "Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn."— Presentation transcript:

1 Auburn University http://www.eng.auburn.edu/~xqin
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University TBC=30 Slides are adopted from Dr. Weikuan Yu and Google

2 Review: Map-Reduce Framework

3 MapReduce Usage Large-Scale Data Processing
Can make use of 1000s of CPUs Avoid the hassle of managing parallelization Provide a complete run-time system Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates User Growth at Google (2004)

4 MapReduce Basic Ingredients
Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer The execution framework handles everything else…

5 Shuffle and Sort: aggregate values by keys
map map map map b a 1 2 c 3 6 a c 5 2 b c 7 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

6 MapReduce – Two Phases Programmers specify two functions:
map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* All values with the same key are reduced together The execution framework handles everything else… Not quite…usually, programmers also specify: combine (k’, v’) → <k’, v’>* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations

7 Shuffle and Sort: aggregate values by keys
map map map map b a 1 2 c 3 6 a c 5 2 b c 7 8 combine combine combine combine b a 1 2 c 9 a c 5 2 b c 7 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

8 Map Abstraction Inputs a key/value pair Evaluation
Key is a reference to the input value Value is the data set on which to operate Evaluation Function defined by user Applies to every value in value input Might need to parse input Produces a new list of key/value pairs Can be different type from input pair

9 Reduce Abstraction Typically a function that:
Starts with a large number of key/value pairs One key/value for each word in all files being grepped (including multiple entries for the same word) Ends with very few key/value pairs One key/value for each unique word across all the files with the number of instances summed into this entry Broken up so a given worker works with input of the same key.

10 count words in docs Input consists of (url, contents) pairs
map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)”

11 Word Count: Illustrated
map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob throw see spot run

12 Grep Input consists of (url+offset, single line)
map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) reduce(key=line, values=uniq_counts): Don’t do anything; just emit line Grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. Grep was originally developed for the Unix operating system, but is available today for all Unix-like systems. Its name comes from the ed command g/re/p (globally search a regularexpression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. $ grep apple fruitlist.txt

13 MapReduce at Google A C++ library linked into user programs
Status of Implementation (OSDI’ 04) 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, Scheduler assigns tasks to machines

14 Execution Overview* How is this distributed?
Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, re-execute! * Adapted from Google slides

15 Job Processing TaskTracker 0 TaskTracker 1 TaskTracker 2 JobTracker TaskTracker 3 TaskTracker 4 TaskTracker 5 “grep” Client submits “grep” job, indicating code and input files JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. After map(), tasktrackers exchange map-output to build reduce() keyspace JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. reduce() output may go to NDFS

16 Execution

17 Parallel Execution

18 Task Granularity and Pipelining
Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map & reduce tasks Running on 2000 machines Why map 1 and 3 have different execution time?

19

20

21

22

23

24

25

26

27

28

29


Download ppt "Auburn University http://www.eng.auburn.edu/~xqin COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn."

Similar presentations


Ads by Google