Download presentation

Presentation is loading. Please wait.

Published byZachariah Poel Modified over 2 years ago

1
Map Reduce Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.

2
Big Data and Cluster Computing 1/2 What’s “big data”? – Very large – think several Terabytes. – Often beyond the capacity of a single compute node’s storage capacity.compute node – While there is no unique def. of big data, the kind we focus on here has two properties: Enormous in size (common to all kinds of big data!) Updates mostly take the form of appends, at least in- place updates are rare 2

3
BD and CC 2/2 Some examples of big data: – Web graph – Social networks Computations on big data are expensive: – Computing page rank: iterated matrix vector products over tens of billions of web pages – Finding your friends on facebook (or other social networks): search over a graph with > 100M nodes and > 10B edges – Similarity “search” in recommender systems Some non-examples: – A bank accounts database, no matter how large (why?) – Any update (i.e., modify) intensive database – Online Retail stores 3

4
Compute Node Memory Disk CPU “Big Data” typically far exceeds the capabilities of a compute node. 4

5
Cluster Computing – Distributed File System Mem Disk CPU Mem Disk CPU … Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU … Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks Examples: Google DFS; Hadoop DFS (Apache); CloudStore (open source; Kosmix, now part of Walmart Labs) 5

6
DFS Divide up file into chunks (e.g., 64MB) and replicate chunks in different racks (e.g., 3 times). – redundancy and resiliency against failures. Divide up computation into independent tasks; if task T fails, can restart it w/o affecting tasks T’ T. Map Reduce Paradigm. -Tolerant to hardware failure. -Master file: where are the chunks of this file? -Master file is replicated; directory of DFS keeps track of MF copies. 6

7
Map Reduce Schematic Map Tasks Reduce Tasks Master Controller: Group by key Combined Output Input Chunks (key, value) pairs (ki, vi) (k, [v1,..., vm]) Chunk = {elements}. E.g.: tuple, doc, tweet,... A map task may get > 1 chunk input 7

8
Example 1 – Count #words in a collection of documents Element = doc; key = word; value = frequency. Chunk = {docs} Map task (k,v) pairs, initially just (w, 1). Master Controller: group by word (across ouput from various Map tasks) and merge all values. Reduce task: hash each key to some reduce task, which aggregates (in this case, sums) all values associated with that key. Output from all Reduce tasks merged again. Typically, Reduce function is associative and commutative, e.g., addition. Can “push” some of the Reduce functionality inside Map tasks. (Reduce still needs to do its job.) #Map Tasks & #Reduce Tasks decided by user. 8

9
Example 2 – Matrix Vector Multiplication At the core of Page Rank computation. x_nx1 = M_nxn x v_nx1. x_i = _j m_ij * v_j; n ~ 10^10. M extremely sparse (web link graph): ~10-15 non- zeros per row. Assume v will fit in memory. Then: Map: (Chunk of M, v) pairs (i, m_ij x v_j); what is the key for terms in the sum expression for x_i? Reduce: Add up all values for given key i x_i. 9

10
Matrix Vector Mult – what if v won’t fit in memory? X chunk Color = stripe. Each stripe of matrix divided up into chunks. 10

11
Relational Algebra Review RA (from any database text). We discuss MR implementation of RA not because we want to implement a DBMS over MR. Operations/computations over large networks can be captured using RA. Efficient MR implementation of RA efficient implementation of a whole family of such computations over large networks. E.g., (node pairs connected by) paths of length two: PROJECT_{L1.From, L2.To}((RENAME(Link L1) JOIN_{To=From} RENAME(Link L2)). #friends of each user: GROUP- BY_{User:COUNT(Friend)}(Friends). 11

12
MR implementations of SELECT/PROJECT SELECT_C(R): Map: for each tuple t if t satisfies C, output (t, t). Reduce: Identity, i.e., simply pass on each i/c (key, value) pair. Can extract relation by taking just value (or just key!). PROJECT_X(R): Map: for each tuple t, project it on X; let t’ = t[X], then output (t’, t’). Reduce: Transform (t’, [t’,..., t’]) into (t’, t’), i.e., dup- elim. Optimization: Can throw out encountered duplicates early in Map; still need dup-elim. in Reduce. 12

13
MR implementations of Set ops Union/Intersection/Minus: Map: turn each tuple t in R into (t, R) and each tuple t in S into (t, S). Merging could create (t, [R]), or (t, [S]), or (t, [R,S]) or (t, [S, R]). Reduce: action depends on operation; for union turn any of those into (t,t); for minus, turn (t,[R]) into (t,t) and turn everything else into (t, NULL); for intersect, turn (t,[R,S]) into (t,t) and everything else into (t, NULL). 13

14
MR implementations of Join Natural Join: Idea works also for equi-join. Consider e.g., R(A,B) and S(B,C). Map: Map each tuple (a,b) in R to (b, (R,a)) and each tuple (b,c) in S to (b, (S,c)). [Hadoop passes the output to Reduce tasks, sorted on key.] Reduce: from each pair (b,[a set of pairs of the form (R,a) or (S,c)]) produce (b, (a1,b,c1), (a1,b,c2),..., (am,b,cn)). The “value” of this key-value pair = subset of join with B=b. Boldface indicates tuple of attributes/values. Typically, join selectivity is high, so the cost is close to linear in the total size of the two relations. What if a(nother) impl. of MR did not pass Map output sorted on key? 14

15
MR Implementation of Groupby Example: R(A,B,C). Want GB_{A: agg1(B1),..., aggk(Bk)}(R). Map: Map each tuple (a,b,c) to (a,b). Reduce: Turn each (a,[b1,..., bm]) into (a, agg1{b1[1],..., bm[1]},..., aggk{b1[k],..., bm[k]}). Optimization: if an agg is associative and commutative, and if b’s associated with the same a are encountered together, can push some computation to Map. 15

16
Matrix Mult via Join! 1/2 First, we will do this by composing two MR steps. View matrix M as M(I,J,V) triples and N as N(J,K,W) triples. Map: Map M(i,j,m_ij) to (j,(M,i,m_ij)) and N(j,k,n_jk) to (j,(N,k,n_jk)). Reduce: from each (key, value) pair (j,[triples from M and from N]), and for each (M,i,m_ij) and (N,k,n_jk) in that set of triples, output (j,(i,k,m_ijxn_jk)). 16

17
Matrix Mult via Join 2/2 Second MR: Map: from each (key, value) pair (j, [(i1,k1,v1),..., (ip,kp,vp)]), produce the (key, value) pairs ((i1,k1), v1),..., ((ip,kp), vp). Reduce: for each (key, value) pair ((i,k), [v1,..., vm]), produce the output ((i,k), v1+... +vm). This is the value in row i and column k of M x N. Not the most efficient, but interesting: uses joins; composes MR like an algebraic operator! 17

18
Matrix Mult in one MR step 18 X= m_ij n_jk row i col k p x qq x r

19
Matrix Mult in one MR step 19 Map m_ij ((i,1), (M, j, m_ij)) ((i,r), (M, j, m_ij)) Map n_jk ((1,k), (N, j, n_jk)) ((p,k), (N, j, n_jk)) Re Reduce ((i,k), (M,1,m_i1)),..., ((i,k), (M,q,m_iq)) ((i,k), (N,1,n_1k)),..., ((i,k), (N,q,n_qk)) ((i,k), Σ_j (m_ij. n_jk).

Similar presentations

OK

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google