Download presentation
Presentation is loading. Please wait.
Published byTheodora Hodge Modified over 9 years ago
1
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together
2
Today’s Outline Course directions, projects, and feedback Quiz 2 Context / Where we are –Why do we care about fold() and map() ? –Why do we care about parallelization and data dependencies? MapReduce architecture from 10,000 feet
3
Context and Review Data dependencies determine whether a problem can be formulated in MapReduce The properties of fold() and map() determine how to formulate a problem in MapReduce How do you parallelize fold() ? map() ?
4
MapReduce Introduction MapReduce is both a programming model and a clustered computing system –A specific way of formulating a problem, which yields good parallelizability –A system which takes a MapReduce-formulated problem and executes it on a large cluster Hides implementation details, such as hardware failures, grouping and sorting, scheduling … Previous lectures have focused on MapReduce- the-problem-formulation Today will mostly focus on MapReduce-the- system
5
MR Problem Formulation: Formal Definition MapReduce: mapreduce f m f r l = map (reducePerKey f r ) (group (map f m l)) reducePerKey f r (k,v_list) = (k, (foldl (f r k) [] v_list)) –Assume map here is actually concatMap. –Argument l is a list of documents –The result of first map is a list of key-value pairs –The function f r takes 3 arguments key, context, current. With currying, this allows for locking the value of “key” for each list during the fold. MapReduce maps a fold over the sorted result of a map!
6
MR System Overview (1 of 2) Map: –Preprocesses a set of files to generate intermediate key-value pairs –As parallelized as you want Group: –Partitions intermediate key-value pairs by unique key, generating a list of all associated values Reduce: –For each key, iterates over value list –Performs computation that requires context between iterations –Parallelizable amongst different keys, but not within one key
7
MR System Overview (2 of 2) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
8
Example: MapReduce DocInfo (1 of 2) MapReduce: mapreduce f m f r l = map (reducePerKey f r ) (group (map f m l)) reducePerKey f r (k,v_list) = (k, (foldl (f r k) [] v_list) Pseudocode for f m f m contents = concat [ [(“spaces”, (count_spaces contents))], (map (emit “raw”) (split contents)), (map (emit “scrub”) (scrub (split contents)))] emit label value = (label, (value, 1))
9
Example: MapReduce DocInfo (2 of 2) MapReduce: mapreduce f m f r l = map (reducePerKey f r ) (group (map f m l)) reducePerKey f r (k,v_list) = (k, (foldl (f r k) [] v_list) Pseudocode for f r f r ‘spaces’ count (total:xs) = (total+count:xs) f r ‘raw’ (word,count) (result) = (update_result (word,count) result) f r ‘scrub’ (word,count) (result) = (update_result (word,count) result)
10
Group Exercise Formulate the following as map reduces: 1.Find the set of unique words in a document a)Input: a bunch of words b)Output: all the unique words (no repeats) 2.Calculate per-employee taxes a)Input: a list of (employee, salary, month) tuples b)Output: a list of (employee, taxes due) pairs 3.Randomly reorder sentences a)Input: a bunch of documents b)Output: all sentences in random order (may include duplicates) 4.Compute the minesweeper grid/map a)Input: coordinates for the location of mines b)Output: coordinate/value pairs for all non-zero cells Can you think generalized techniques for decomposing problems?
11
MapReduce Parallelization: Execution Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
12
MapReduce Parallelization: Pipelining Finely granular tasks: many more map tasks than machines –Better dynamic load balancing –Minimizes time for fault recovery –Can pipeline the shuffling/grouping while maps are still running Example: 2000 machines -> 200,000 map + 5000 reduce tasks Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
13
Example: MR DocInfo, revisited Do MapReduce DocInfo in 2 passes (instead of 1), performing all the work in the “group” step Map1: 1.Tokenize document 2.For each token output: a)(“raw: ”,1) b)(“scrubbed: ”, 1) Reduce1: 1.For each key, ignore value list and output (key,1) Map2: 1.Tokenize document 2.For each token “type:value”, output (type,1) Reduce 2: 1.For each key, output (key, (sum values))
14
Example: MR DocInfo, revisited Of the 2 DocInfo MapReduce implementations, which is better? Define “better”. What resources are you considering? Dev time? CPU? Network? Disk? Complexity? Reusability? Mapper Reducer GFS Key: Connections are network links GFS is a cluster of storage machines
15
HaDoop-as-MapReduce mapreduce f m f r l = map (reducePerKey f r ) (group (map f m l)) reducePerKey f r (k,v_list) = (k, (foldl (f r k) [] v_list) Hadoop: 1.The f m and f r are function objects (classes) 2.Class for f m implements the Mapper interface Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) 3.Class for f r implements the Reducer interface reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Hadoop takes the generated class files and manages running them
16
Bonus Materials: MR Runtime The following slides illustrate an example run of MapReduce on a Google cluster A sample job from the indexing pipeline, processes ~900 GB of crawled pages
17
MR Runtime (1 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
18
MR Runtime (2 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
19
MR Runtime (3 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
20
MR Runtime (4 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
21
MR Runtime (5 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
22
MR Runtime (6 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
23
MR Runtime (7 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
24
MR Runtime (8 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
25
MR Runtime (9 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.