Download presentation

Presentation is loading. Please wait.

Published byReanna Steward Modified about 1 year ago

1
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Graph algorithms in MapReduce October 15, 2013

2
© 2013 A. Haeberlen, Z. Ives Announcements No class on October 22nd or 24th Andreas at IMC in Barcelona Please work on HW2 and HW3 (will be released this week) Special 'catch-up' class on October 30th 4:30-6:00pm, Location TBA Any questions about HW2? If you haven't started yet: Please start early!!! 2 University of Pennsylvania

3
© 2013 A. Haeberlen, Z. Ives What we have seen so far In the first half of the semester, we saw how the map/reduce model could be used to filter, collect, and aggregate data values This is useful for data with limited structure We could extract pieces of input data items and collect them to run various reduce operations We could “join” two different data sets on a common key But that’s not enough… 3

4
© 2013 A. Haeberlen, Z. Ives Beyond average/sum/count Much of the world is a network of relationships and shared features Members of a social network can be friends, and may have shared interests / memberships / etc. Customers might view similar movies, and might even be clustered by interest groups The Web consists of documents with links Documents are also related by topics, words, authors, etc. 4

5
© 2013 A. Haeberlen, Z. Ives Goal: Develop a toolbox We need a toolbox of algorithms useful for analyzing data that has both relationships and properties For the next ~2 lectures we’ll start to build this toolbox Some of the problems are studied in courses you may not have taken yet: CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems) So we’ll see both the traditional solution and the MapReduce one 5

6
© 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes 6 University of Pennsylvania NEXT

7
© 2013 A. Haeberlen, Z. Ives Thinking about related objects We can represent related objects as a labeled, directed graph Entities are typically represented as nodes; relationships are typically edges Nodes all have IDs, and possibly other properties Edges typically have values, possibly IDs and other properties 7 fan-of friend-of fan-of Alice Sunita Jose Mikhail Magna Carta Facebook Images by Jojo Mendoza, Creative Commons licensed

8
© 2013 A. Haeberlen, Z. Ives Encoding the data in a graph Recall basic definition of a graph: G = (V, E) where V is vertices, E is edges of the form (v 1,v 2 ) where v 1,v 2 V Assume we only care about connected vertices Then we can capture a graph simply as the edges... or as an adjacency list: v i goes to [v j, v j+1, … ] 8 Alice Sunita Jose Mikhail Magna Carta Facebook

9
© 2013 A. Haeberlen, Z. Ives Graph encodings: Set of edges (Alice, Facebook) (Alice, Sunita) (Jose, Magna Carta) (Jose, Sunita) (Mikhail, Facebook) (Mikhail, Magna Carta) (Sunita, Facebook) (Sunita, Alice) (Sunita, Jose) 9 Alice Sunita Jose Mikhail Magna Carta Facebook

10
© 2013 A. Haeberlen, Z. Ives Graph encodings: Adding edge types (Alice, fan-of, Facebook) (Alice, friend-of, Sunita) (Jose, fan-of, Magna Carta) (Jose, friend-of, Sunita) (Mikhail, fan-of, Facebook) (Mikhail, fan-of, Magna Carta) (Sunita, fan-of, Facebook) (Sunita, friend-of, Alice) (Sunita, friend-of, Jose) 10 Alice Sunita Jose Mikhail Magna Carta Facebook fan-of friend-of fan-of

11
© 2013 A. Haeberlen, Z. Ives Graph encodings: Adding weights (Alice, fan-of, 0.5, Facebook) (Alice, friend-of, 0.9, Sunita) (Jose, fan-of, 0.5, Magna Carta) (Jose, friend-of, 0.3, Sunita) (Mikhail, fan-of, 0.8, Facebook) (Mikhail, fan-of, 0.7, Magna Carta) (Sunita, fan-of, 0.7, Facebook) (Sunita, friend-of, 0.9, Alice) (Sunita, friend-of, 0.3, Jose) 11 Alice Sunita Jose Mikhail Magna Carta Facebook fan-of friend-of fan-of 0.5 0.9 0.7 0.3 0.8 0.7 0.5

12
© 2013 A. Haeberlen, Z. Ives Recap: Related objects We can represent the relationships between related objects as a directed, labeled graph Vertices represent the objects Edges represent relationships We can annotate this graph in various ways Add labels to edges to distinguish different types Add weights to edges... We can encode the graph in various ways Examples: Edge set, adjacency list 12

13
© 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes 13 University of Pennsylvania NEXT

14
© 2013 A. Haeberlen, Z. Ives A computation model for graphs Once the data is encoded in this way, we can perform various computations on it Simple example: Which users are their friends' best friend? More complicated examples (later): Page rank, adsorption,... This is often done by annotating the vertices with additional information, and propagating the information along the edges "Think like a vertex"! 14 University of Pennsylvania Alice Sunita Jose Mikhail Magna Carta Facebook fan-of friend-of fan-of 0.5 0.9 0.7 0.3 0.8 0.7 0.5

15
© 2013 A. Haeberlen, Z. Ives A computation model for graphs Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges 15 University of Pennsylvania Alice Sunita Jose Mikhail Magna Carta Facebook fan-of friend-of fan-of 0.5 0.9 0.7 0.3 0.8 0.7 0.5 Slightly more technical: How many of my friends have me as their best friend?

16
© 2013 A. Haeberlen, Z. Ives A computation model for graphs Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge 16 University of Pennsylvania Alice Sunita Jose Mikhail friend-of 0.9 0.3 sunita alice: 0.9 sunita jose: 0.3 jose sunita: 0.3 alice sunita: 0.9

17
© 2013 A. Haeberlen, Z. Ives A computation model for graphs Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge 17 University of Pennsylvania Alice Sunita Jose Mikhail friend-of 0.9 0.3 sunita alice: 0.9 sunita jose: 0.3 jose sunita: 0.3 alice sunita: 0.9 sunita alice: 0.9 sunita jose: 0.3 jose sunita: 0.3 sunita alice: 0.9 sunita jose: 0.3 alice sunita: 0.9

18
© 2013 A. Haeberlen, Z. Ives A computation model for graphs Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge Step #4: Determine result at each vertex 18 University of Pennsylvania Alice Sunita Jose Mikhail friend-of 0.9 0.3 sunita alice: 0.9 sunita jose: 0.3 jose sunita: 0.3 alice sunita: 0.9 sunita alice: 0.9 sunita jose: 0.3 jose sunita: 0.3 sunita alice: 0.9 sunita jose: 0.3 alice sunita: 0.9

19
© 2013 A. Haeberlen, Z. Ives Can we do this in MapReduce? Using adjacency list representation? 19 map(key: node, value: [ ]) { } reduce(key: ________, values: list of _________) { }

20
© 2013 A. Haeberlen, Z. Ives Can we do this in MapReduce? Using single-edge data representation? 20 map(key: node, value: ) { } reduce(key: ________, values: list of _________) { }

21
© 2013 A. Haeberlen, Z. Ives A real-world use case A variant that is actually used in social networks today: "Who are the friends of multiple of my friends?" Where have you seen this before? Friend recommendation! Maybe these people should be my friends too! 21

22
© 2013 A. Haeberlen, Z. Ives Generalizing… Now suppose we want to go beyond direct friend relationships Example: How many of my friends' friends (distance-2 neighbors) have me as their best friend's best friend? What do we need to do? How about distance k>2? To compute the answer, we need to run multiple iterations of MapReduce! 22

23
© 2013 A. Haeberlen, Z. Ives Iterative MapReduce The basic model: Note that reduce output must be compatible with the map input! What can happen if we filter out some information in the mapper or in the reducer? 23 copy files from input dir staging dir 1 (optional: do some preprocessing) while (!terminating condition) { map from staging dir 1 reduce into staging dir 2 move files from staging dir 2 staging dir1 } (optional: postprocessing) move files from staging dir 2 output dir

24
© 2013 A. Haeberlen, Z. Ives Graph algorithms and MapReduce A centralized algorithm typically traverses a tree or a graph one item at a time (there’s only one “cursor”) You’ve learned breadth-first and depth-first traversals Most algorithms that are based on graphs make use of multiple map/reduce stages processing one “wave” at a time Sometimes iterative MapReduce, other times chains of map/reduce 24

25
© 2013 A. Haeberlen, Z. Ives Recap: MapReduce on graphs Suppose we want to: compute a function for each vertex in a graph...... using data from vertices at most k hops away We can do this as follows: "Push" information along the edges "Think like a vertex" Finally, perform the computation at each vertex May need more than one MapReduce phase Iterative MapReduce: Outputs of stage i inputs of stage i+1 25 University of Pennsylvania

26
© 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes 26 University of Pennsylvania NEXT

27
© 2013 A. Haeberlen, Z. Ives Path-based algorithms Sometimes our goal is to compute information about the paths (sets of paths) between nodes Edges may be annotated with cost, distance, or similarity Examples of such problems (see CIS 121+320): Shortest path from one node to another Minimum spanning tree (minimal-cost tree connecting all vertices in a graph) Steiner tree (minimal-cost tree connecting certain nodes) Topological sort (node in a DAG comes before all nodes it points to) 27

28
© 2013 A. Haeberlen, Z. Ives Single-Source Shortest Path (SSSP) 28 0 s ? ? ? ? a b c d 10 5 2 3 1 9 7 2 4 6 Given a directed graph G = (V, E) in which each edge e has a cost c(e): Compute the cost of reaching each node from the source node s in the most efficient way (potentially after multiple 'hops')

29
© 2013 A. Haeberlen, Z. Ives SSSP: Intuition We can formulate the problem using induction The shortest path follows the principle of optimality: the last step (u,v) makes use of the shortest path to u We can express this as follows: 29 bestDistanceAndPath(v) { if (v == source) then { return } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return } }

30
© 2013 A. Haeberlen, Z. Ives SSSP: CIS 320-style solution Traditional approach: Dijkstra's algorithm 30 V: vertices, E: edges, S: start node foreach v in V dist_S_to[v] := infinity predecessor[v] = nil spSet = {} Q := V while (Q not empty) do u := Q.removeNodeClosestTo(S) spSet := spSet + {u} foreach v in V where (u,v) in E if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then dist_S_To[v] = dist_S_To[u] + cost(u,v) predecessor[v] = u Initialize length and last step of path to default values Update length and path based on edges radiating from u

31
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 31 0 s ∞ ∞ ∞ ∞ a b c d 10 5 2 3 1 9 7 2 4 6 Q = {s,a,b,c,d} spSet = {} dist_S_To: {(a,∞), (b,∞), (c,∞), (d,∞)} predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)} Example from CLR 2nd ed. p. 528

32
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 32 0 s 10 5 ∞ ∞ a b c d 5 2 3 1 9 7 2 4 6 Q = {a,b,c,d} spSet = {s} dist_S_To: {(a,10), (b,∞), (c,5), (d,∞)} predecessor: {(a,s), (b,nil), (c,s), (d,nil)} Example from CLR 2nd ed. p. 528

33
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 33 0 s 8 5 14 7 a b c d 10 5 2 3 1 9 7 2 4 6 Q = {a,b,d} spSet = {c,s} dist_S_To: {(a,8), (b,14), (c,5), (d,7)} predecessor: {(a,c), (b,c), (c,s), (d,c)} Example from CLR 2nd ed. p. 528

34
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 34 0 s 8 5 13 7 a b c d 10 5 2 3 1 9 7 2 4 6 Q = {a,b} spSet = {c,d,s} dist_S_To: {(a,8), (b,13), (c,5), (d,7)} predecessor: {(a,c), (b,d), (c,s), (d,c)} Example from CLR 2nd ed. p. 528

35
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 35 0 s 8 5 9 7 a b c d 10 5 2 3 1 9 7 2 4 6 Q = {b} spSet = {a,c,d,s} dist_S_To: {(a,8), (b,9), (c,5), (d,7)} predecessor: {(a,c), (b,a), (c,s), (d,c)} Example from CLR 2nd ed. p. 528

36
© 2013 A. Haeberlen, Z. Ives SSSP: Dijkstra in Action 36 0 s 8 5 9 7 a b c d 10 5 2 3 1 9 7 2 4 6 Q = {} spSet = {a,b,c,d,s} dist_S_To: {(a,8), (b,9), (c,5), (d,7)} predecessor: {(a,c), (b,a), (c,s), (d,c)} Example from CLR 2nd ed. p. 528

37
© 2013 A. Haeberlen, Z. Ives SSSP: How to parallelize? Dijkstra traverses the graph along a single route at a time, prioritizing its traversal to the next step based on total path length (and avoiding cycles) No real parallelism to be had here! Intuitively, we want something that “radiates” from the origin, one “edge hop distance” at a time Each step outwards can be done in parallel, before another iteration occurs - or we are done Recall our earlier discussion: Scalability depends on the algorithm, not (just) on the problem! 37 s 0 ? ? ? ? 0 ? ? 0 ? ? ? ? 0 ? ? ? ?

38
© 2013 A. Haeberlen, Z. Ives SSSP: Revisiting the inductive definition Dijkstra’s algorithm carefully considered each u in a way that allowed us to prune certain points Instead we can look at all potential u’s for each v Compute iteratively, by keeping a “frontier set” of u nodes i edge-hops from the source 38 bestDistanceAndPath(v) { if (v == source) then { return } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return } }

39
© 2013 A. Haeberlen, Z. Ives SSSP: MapReduce formulation init: For each node, node ID }> map: take node ID }> For each succ-node-ID: emit succ-node ID { } emit node ID distance,{ } reduce: distance := min cost from a predecessor; next := that predec. emit node ID }> Repeat until no changes Postprocessing: Remove adjacency lists 39 Why is this necessary? The shortest path we have found so far from the source to nodeID has length ...... and here is the adjacency list for nodeID This is a new path from the source to succ-node-ID that we just discovered (not necessarily shortest)... this is the next hop on that path...

40
© 2013 A. Haeberlen, Z. Ives Iteration 0: Base case 40 mapper: (a, ) (c, ) edges reducer: (a, ) (c, ) "Wave" 0 s ∞ ∞ ∞ ∞ a b c d 10 5 2 3 1 9 7 2 4 6

41
© 2013 A. Haeberlen, Z. Ives Iteration 1 41 mapper: (a, ) (c, ) (a, ) (c, ) (b, ) (b, ) (d, ) edges reducer: (a, ) (c, ) (b, ) (d, ) 0 10 5 ∞ ∞ 5 2 3 9 7 4 6 s a b c d 1 2 "Wave"

42
© 2013 A. Haeberlen, Z. Ives Iteration 2 42 mapper: (a, ) (c, ) (a, ) (c, ) (b, ) (b, ) (d, ) (b, ) (d, ) edges reducer: (a, ) (c, ) (b, ) (d, ) 0 8 5 11 7 10 5 2 3 9 7 4 6 s a b c d 1 2 "Wave"

43
© 2013 A. Haeberlen, Z. Ives Iteration 3 43 mapper: (a, ) (c, ) (a, ) (c, ) (b, ) (b, ) (d, ) (b, ) (d, ) edges reducer: (a, ) (c, ) (b, ) (d, ) No change! Convergence! Question: If a vertex's path cost is the same in two consecutive rounds, can we be sure that this vertex has converged? 0 8 5 11 7 10 5 2 3 9 7 4 6 s a b c d 1 2

44
© 2013 A. Haeberlen, Z. Ives Summary: SSSP Path-based algorithms typically involve iterative map/reduce They are typically formulated in a way that traverses in “waves” or “stages”, like breadth- first search This allows for parallelism They need a way to test for convergence Example: Single-source shortest path (SSSP) Original Dijkstra formulation is hard to parallelize But we can make it work with the "wave" approach 44

45
© 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes 45 University of Pennsylvania NEXT

46
© 2013 A. Haeberlen, Z. Ives Learning (clustering / classification) Sometimes our goal is to take a set of entities, possibly related, and group them If the groups are based on similarity, we call this clustering If the groups are based on putting them into a semantically meaningful class, we call this classification Both are instances of machine learning 46

47
© 2013 A. Haeberlen, Z. Ives The k-clustering Problem Given: A set of items in a n-dimensional feature space Example: data points from survey, people in a social network Goal: Group the items into k “clusters” What would be a 'good' set of clusters? 47 Age Expenses Items Clusters

48
© 2013 A. Haeberlen, Z. Ives Approach: k-Means Let m 1, m 2, …, m k be representative points for each of our k clusters Specifically: the centroid of the cluster Initialize m 1, m 2, …, m k to random values in the data For t = 1, 2, …: Map each observation to the closest mean Assign the m i to be a new centroid for each set 48

49
© 2013 A. Haeberlen, Z. Ives A simple example (1/4) 49 Age Expenses (10,10) (15,12) (11,16) (18,20) (30,21) (20,21)

50
© 2013 A. Haeberlen, Z. Ives A simple example (2/4) 50 Age Expenses (10,10) (15,12) (11,16) (18,20) (30,21) (20,21) Randomly chosen initial centers

51
© 2013 A. Haeberlen, Z. Ives A simple example (3/4) 51 Age Expenses (10,10) (15,12) (11,16) (18,20) (30,21) (20,21) (19.75,19.5) (12.5,11)

52
© 2013 A. Haeberlen, Z. Ives A simple example (4/4) 52 Age Expenses (10,10) (15,12) (11,16) (18,20) (30,21) (20,21) (22.67,20.67) (12,12.67) Stable!

53
© 2013 A. Haeberlen, Z. Ives k-Means in MapReduce Map #1: Input: node ID Compute nearest centroid; emit centroid ID Reduce #1: Recompute centroid position from positions of nodes in it Emit centroidID and for all other centroid IDs, emit otherCentroidID centroid(centroidID,X,Y) Each centroid will need to know where all the other centroids are Map #2: Pass through values to Reducer #2 Reduce #2: For each node in the current centroid, emit node ID Input for the next map iteration Also, emit > This will be the 'result' (remember that we wanted the centroids!) Repeat until no change 53

54
© 2013 A. Haeberlen, Z. Ives Plan for today Representing data in graphs Graph algorithms in MapReduce Computation model Iterative MapReduce A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes 54 University of Pennsylvania NEXT

55
© 2013 A. Haeberlen, Z. Ives Classification Suppose we want to learn what is spam (or interesting, or …) Predefine a set of classes with semantic meaning Train an algorithm to look at data and assign a class Based on giving it some examples of data in each class … and the sets of features they have Many probabilistic techniques exist Each class has probabilistic relationships with others e.g., p (spam | isSentLocally), p (isSentLocally | fromBob), … Typically represented as a graph(ical model)! See CIS 520 But we’ll focus on a simple, “flat” model: Naïve Bayes 55

56
© 2013 A. Haeberlen, Z. Ives A simple example Suppose we just look at the keywords in the email's title: Message(1, “Won contract”) Message(2, “Won award”) Message(3, "Won the lottery") Message(4, “Unsubscribe”) Message(5, "Millions of customers") Message(6, "Millions of dollars") What is probability message "Won Millions" is ? p(spam|containsWon,containsMillions) = p(spam) p(containsWon,containsMillions |spam) p(containsWon,containsMillions) 56 Bayes’ Theorem

57
© 2013 A. Haeberlen, Z. Ives Classification using Naïve Bayes Basic assumption: Probabilities of events are independent This is why it is called 'naïve' Under this assumption, p(spam) p(containsWon,containsMillions | spam) p(containsWon,containsMillions) = p(spam) p(containsWon | spam) p(containsMillions | spam) p(containsWon) p(containsMillions) = 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67 So how do we “train” a learner (compute the above probabilities) using MapReduce? 57

58
© 2013 A. Haeberlen, Z. Ives What do we need to train the learner? p(spam) Count how many spam emails there are Count total number of emails p(containsXYZ | spam) Count how many spam emails contain XYZ Count how many emails contain XYZ overall p(containsXYZ) Count how many emails contain XYZ overall Count total number of emails 58 University of Pennsylvania Easy 1 2 2

59
© 2013 A. Haeberlen, Z. Ives Training a Naïve Bayes Learner map 1: takes messageId emits 1 reduce 1: emits map 2: takes messageId -> emits word 1 reduce 2: emits word 59 Count how many emails in the class contain the word (modified WordCount) Count how many emails contain the word overall (WordCount)

60
© 2013 A. Haeberlen, Z. Ives Summary: Learning and MapReduce Clustering algorithms typically have multiple aggregation stages or iterations k-means clustering repeatedly computes centroids, maps items to them Fixpoint computation Classification algorithms can be quite complex In general: need to capture conditional probabilities Naïve Bayes assumes everything is independent Training is a matter of computing probability distribution Can be accomplished using two Map/Reduce passes 60

61
© 2013 A. Haeberlen, Z. Ives Stay tuned Next time you will learn about: PageRank and Adsorption 61 University of Pennsylvania

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google