Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce Algorithm Design

Similar presentations


Presentation on theme: "MapReduce Algorithm Design"— Presentation transcript:

1 MapReduce Algorithm Design

2 Contents Combiner and in-mapper combining Complex keys and values
Relative frequency Secondary Sorting

3 Contents Combiner and in-mapper combining Complex keys and values
Relative frequency Secondary Sorting

4 Combiner and in-mapper combining
Purpose Carry out local aggregation before shuffle and sort phase Reduce the communication volume between map and reduce stages

5 Combiner example (word count)
Use reducer as combiner Integer addition is both associative and commutative

6 MapReduce with Combiner

7 MapReduce with combiner
map: (k1, v1) -> [(k2, v2)] combine: (k2, [v2]) -> [(k2, v2)] reduce: (k2, [v2]) -> [(k3, v3)] The combiner input and output key-value types must match the mapper output key-value type

8 Combiner is an optimization, not a requirement
Combiner is optional A particular implementation of MapReduce framework may choose to execute the combine method many times or none Calling the combine method zero, one, or many times should produce the same output from the reducer The correctness of the MapReduce program should not rely on the assumption that the combiner is always carried out

9 In-method combining

10 Use Java Map to implement AssociativeArray
A Map is an object that maps keys to values A map cannot contain duplicate keys Each key can map to at most one value The Java platform contains three general-purpose Map implementations HashMap, TreeMap, and LinkedHashMap Basic operations boolean containsKey(Object key) get(Object key) put(K key, V value) keySet() Returns a Set view of the keys contained in this map entrySet() Returns a Set view of the mappings contained in this map.

11 In-mapper combining

12 Demo and Some online resource
Demo on in-method and in-mapper combining Online resource design-pattern-for-mapreduce-programming codingjunkie.net Blog by Bill Bejeck, who provides a lot of source code on MapReduce in Hadoop

13 Contents Combiner and in-mapper combining Complex keys and values
Relative frequency Secondary Sorting

14 Complex keys and values
Both keys and values can be complex data structures Pairs Stripes Serialization and deserialization for complex structures After map stage, structures need to be serialized to be written to storage Complex keys and values need to be de-serialized at the reducer side

15 Motivation An example to compute the mean of value

16 With combiner

17 Revised mapper

18 In-mapper combining

19 A running example Build word co-occurrence matrices for large corpora
Build the word co-occurrence matrix for all the works by Shakespeare Co-occurrence within a specific context A sentence A paragraph A document A certain window of m words

20 Pairs For each map task For each unique pair (a, b), emit [pair (a, b), 1] Use combiner or in-mapper combining to reduce the volume of intermediate pairs Reducers sum up counts associated with these pairs

21 Pairs approach

22 Stripes Idea: group together pairs into an associative array Mappers
emit [word, associate array] Reducers perform element-wise sum of associative arrays (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +

23 Stripes approach

24 Contents Combiner and in-mapper combining Complex keys and values
Relative frequency Secondary Sorting

25 Relative frequency What proportion of the time does B appear in the context A? Whenever there is a co-occurrence of (A,*), for what percentage will * be B The total count of co-occurrence of (A,*) is called marginal (A, *) → 32 Reducer holds this value in memory (a, b1) → 3 (a, b2) → 12 (a, b3) → 7 (a, b4) → 1 (a, b5) → 4 (a, b6) → 5 (a, b1) → 3 / 32 (a, b2) → 12 / 32 (a, b3) → 7 / 32 (a, b4) → 1 / 32 (a, b5) → 4 / 32 (a, b6) → 5 / 32

26 Relative frequency with stripes
It is easy One pass to compute (A, *) Another pass to directly compute f(B|A) May have the scalability issue for really large data The final associative array holds all the neighbors and their co-occurrence counts with the word A

27 Relative frequency with pairs
Must emit extra (A, *) for every B (B can be any word) in mapper Must make sure all pairs of (A, *) and (A, B) get sent to same reducer (use partitioner) Must make sure (A, *) comes first (define sort order) Must hold state in reducer across different key-value pairs This pattern is called “order inversion” For the mapper, the example is as follows. Alice go to the wonderland to have fun For the relative frequency, use the following example ((a, *), [12,4,3]) ((a, b), [1,9]) ((a, c), [2,3]) ((a, d), [4]) ((b, *), [7,12,6]) ((b, b), [3,6]) ((b, c), [2]) ((b, d), [13,1])

28 Contents Combiner and in-mapper combining Complex keys and values
Relative frequency Secondary Sorting

29 Secondary Sorting A motivating example
The readings of m sensors are recorded over the time (m1, t1, r80521) (m1, t2, r21823) …… (m2, t1, r14209) (m2, t2, r66508) (m3, t1, r76042) (m3, t2, r98347) (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) …… (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347)

30 First approach (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) …… (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347) (m1, (t1, r80521)) (m2, (t1, r14209)) (m3, (t1, r76042)) …… (m1, (t2, r21823)) (m2, (t2, r66508)) (m3, (t2, r98347)) However, Hadoop MapReduce sorts intermediate pairs by key Values can be arbitrarily ordered E.g., (m1, [(t100, r23456), (t2, r21823),…,(t234, r34870)]) Buffer values in memory, then sort Is there issue with this approach?

31 Second approach Value-to-key conversion
Move part of the value into the intermediate key to form a composite key Composite key: (m, t) Intermediate pair: ((m, t), r) Let execution framework do the sorting First sort by the sensor id, i.e., m (the left element in the key) Then sort by the timestamp, i.e., t (the right element in the key) Implement the custom partitioner All pairs with the same sensor shuffled to the same reducer ((m1, t1), r80521) ((m1, t2), r21823) ((m1, t3), r149625) ……


Download ppt "MapReduce Algorithm Design"

Similar presentations


Ads by Google