Presentation on theme: "Thanks to Jimmy Lin slides"— Presentation transcript:
1 Thanks to Jimmy Lin slides Graph Algorithms with MapReduceChapter 5Thanks to Jimmy Lin slides
2 Topics Introduction to graph algorithms and graph representations Single Source Shortest Path (SSSP) problemRefresher: Dijkstra’s algorithmBreadth-First Search with MapReducePageRank
3 What’s a graph? G = (V,E), where Different types of graphs: V represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional informationDifferent types of graphs:Directed vs. undirected edgesPresence or absence of cyclesGraphs are everywhere:Hyperlink structure of the WebPhysical structure of computers on the InternetInterstate highway systemSocial networks
4 Some Graph Problems Finding shortest paths Routing Internet traffic and UPS trucksFinding minimum spanning treesTelco laying down fiberFinding Max FlowAirline schedulingIdentify “special” nodes and communitiesBreaking up terrorist cells, spread of avian fluBipartite matchingMonster.com, Match.comAnd of course... PageRank
5 Graphs and MapReduce Graph algorithms typically involve: Performing computation at each nodeProcessing node-specific data, edge-specific data, and link structureTraversing the graph in some mannerKey questions:How do you represent graph data in MapReduce?How do you traverse a graph in MapReduce?
6 Representing Graphs G = (V, E) Two common representations A poor representation for computational purposesTwo common representationsAdjacency matrixAdjacency list
7 Adjacency MatricesRepresent a graph as an n x n square matrix Mn = |V|Mij = 1 means a link from node i to j21234134
8 Adjacency Matrices: Critique Advantages:Naturally encapsulates iteration over nodesRows and columns correspond to inlinks and outlinksDisadvantages:Lots of zeros for sparse matricesLots of wasted space
9 Adjacency Lists Take adjacency matrices… and throw away all the zeros 12341: 2, 42: 1, 3, 43: 14: 1, 3
10 Adjacency Lists: Critique Advantages:Much more compact representationEasy to compute over outlinksGraph structure can be broken up and distributedDisadvantages:Much more difficult to compute over inlinks
11 Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodes“Graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree” WikipediaFirst, a refresher: Dijkstra’s algorithmSingle machine
12 Dijkstra’s Algorithm Example 11023946572Example from CLR
13 Dijkstra’s Algorithm Example n1n3110n02394657n2n42Example from CLR
14 Dijkstra’s Algorithm Example 10n1n3110n023946575n2n42Example from CLR
15 Dijkstra’s Algorithm Example 814n1n3110n0239465757n2n42Example from CLR
16 Dijkstra’s Algorithm Example 813n1n3110n0239465757n2n42Example from CLR
17 Dijkstra’s Algorithm Example 89n1n3110n0239465757n2n42Example from CLR
18 Dijkstra’s Algorithm Example 89n1n3110n0239465757n2n42Example from CLR
19 Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodesSingle processor machine: Dijkstra’s AlgorithmMapReduce: parallel Breadth-First Search (BFS)How to do it? First simplify the problem!!
20 Finding the Shortest Path First, consider equal edge weightsSolution to the problem can be defined inductivelyHere’s the intuition:DistanceTo(startNode) = 0For all nodes n directly reachable from startNode, DistanceTo(n) = 1For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m S)
21 Finding the Shortest Path This strategy advances the “known frontier” by one hopSubsequent iterations include more reachable nodes as frontier advancesMultiple iterations are needed to explore entire graph
23 Termination Does the algorithm ever terminate? When do we stop? Eventually, all nodes will be discovered, all edges will be considered (in a connected graph)When do we stop?When distances at every node no longer change at next frontier
24 Next Step to Solving Next – No longer assume distance to each node is 1
25 Weighted Edges Now add positive weights to the edges Simple change: points-to list in map task includes a weight w for each pointed-to nodeemit (p, D+wp) instead of (p, D+1) for each node p
26 Dijkstra’s Algorithm Example n1n3110n02394657n2n42Example from CLR
27 Multiple Iterations Needed This MapReduce task advances the “known frontier” by one hopSubsequent iterations include more reachable nodes as frontier advancesMultiple iterations are needed to explore entire graphEach iteration a MapReduce taskFinal output is input to next iteration - MapReduce taskFeed output back into the same MapReduce task
29 From Intuition to Algorithm What info does the map task require?A map task receives (k,v)Key:node nValue:D (distance from start)points-to (adjacency list of nodes reachable from n)What does the map task do?Computes distancesEmit (p, D+wp) p points-to: Makes sure current distance is carried into the reducerEmits graph structure of node n (n, struct) which contains the current shortest distance to node n
30 From Intuition to Algorithm What info does the reduce task require?The reduce task gathers possible distances to a given pWhat does the reduce task do?selects the minimum one
31 AlgorithmAssume adjacency list has information about edges and distances!!
32 class Mappermethod MAP(nid n, node N)D ← N.DistanceEmit(nid n, N) // Pass along graph structurefor all nodeid m € N.AdjacencyList doEmit(nid m, d+w) // Emit distances to reachable nodesclass Reducermethod REDUCE (nid m, [d1, d2, ...])dmin ← ∞M ← Φfor all d € counts [d1, d2, ...] doif IsNode(d) thenM ← d // Recover graph structureelse if d < dmin then // Look for shorter distancedmin ← dif M.Distance > dmin // update shortest distanceM.Distance ← dminIncrement counter for driverEmit(nid m, node M)
33 Map AlgorithmLine 2. N is an adjacency list and current distance (shortest)Line 4. Emits (k,v) in k which is current node info , but only one of these for a node because assume each node assigned to one mapperLine 6. Emits different type of (k,v) which only has distance to neighbor not adjacency listShuffles (k,v) with same k to same reducers
34 Reduce Algorithm Line 2. Will have different types of (k,v) as input Line 5. Determine what type of (k,v) if adjacency listLine 6. If v is not adjacency list (Node structure) then it is a distance, find shortestOnly 1 IsNode as far as I can tellLine 9. Determine if new shortestLine 10. Update current shortest, increment a counter to determine if should stop
35 Shortest path – one more thing Only finds shortest distances, not the shortest pathIs this true?Do we have to use backpointers to find shortest path to retraceNO --Emit paths along with distances, each node has shortest path accessible at all timesMost paths relatively short, uses little space
36 Weighted edges Finds Minimum? Discover node rDiscovered shortest D to p and shortest D to r goes through pMaybe path through q to r that is shorter, but path lies outside current search frontierNot true if D = 1 since shortest path cannot lie outside search frontier, since would be longer pathHave found shortest path within frontierWill discover shortest path as frontier expandsWith sufficient iterations, eventually discover shortest Distance
37 Dijkstra’s Algorithm Example n1n3110n02394657n2n42Example from CLR
38 Termination Does this ever terminate? Yes! Eventually, no better distances will be found. When distance is the same, we stopChecking of termination must occur outside of MapReduceDriver program submits MR job to iterate algorithm, see if termination condition metHadoop provides Counters (drivers) outside MapReduceDrivers determine after reducers if doneIn shortest path reducers count each change to min distance, passes count to driver
39 IterationsHow many iterations needed to compute shortest distance to all nodes?Diameter of graph or greatest distance between any pair of nodesSmall for many real-world problems – 6 degrees of separationFor global social network – 6 MapReduce iterations
40 Fig. 5.6 needs how many iterations for n1-n6? Worst case?need (#nodes – 1)
41 Comparison to Dijkstra Dijkstra’s algorithm is more efficientAt any step it only pursues edges from the minimum-cost path inside the frontierMapReduce explores all paths in parallelBrute force – wastes timeDivide and conquerExcept at search frontier, within frontier repeating same computationsThrow more hardware at the problem
42 General Approach MapReduce is adept at manipulating graphs Store graphs as adjacency listsGraph algorithms with MapReduce:Each map task receives a node and its outlinksMap task compute some function of the link structure, emits value with target as the keyReduce task collects keys (target nodes) and aggregatesIterate multiple MapReduce cycles until some termination conditionRemember to “pass” graph structure from one iteration to next
43 Another example – Random Walks Over the Web Model:User starts at a random Web pageUser randomly clicks on links, surfing from page to page (may also teleport to completely diff pageHow frequently will a page be encountered during this surfing?This is PageRankProbability distribution over nodes in a graph representing likelihood random walk over a graph will arrive at a particular node
44 PageRank: Defined … Given page n with in-bound links L(n), where C(m) is the out-degree of mP(m) is the page rank of m is probability of random jump|G| is the total number of nodes in the graphm1nmn…mn
45 Computing PageRank Properties of PageRank Sketch of algorithm: Can be computed iterativelyEffects at each iteration is localSketch of algorithm:Start with seed (Pi ) valuesEach page distributes (Pi ) “credit” to all pages it links toEach target page adds up “credit” from multiple in-bound links to compute (Pi+1)Iterate until values converge
46 Computing PageRankWhat does map do?What does reduce do?
47 PageRank MapReduce Fig. 5.7 Begins with 5 nodes splitting 1.0 -> 0.2 eachEach node must split their 0.2 to outgoing nodes (map)Then add up all incoming values (reduce)Each iteration is one MapReduce job
50 PageRank in MapReduceMap: distribute PageRank “credit” to link targetsReduce: gather up PageRank “credit” from multiple sources to compute new PageRank valueIterate untilconvergence...
51 Convergence to end Page Rank Stop when few changes (some tolerance for precision errors) or reached fixed number of iterationsDriver checks for convergenceHow many iterations needed for PageRank to converge, e.g. if 322 M edges?Fewer than expected52 iterations
52 Dangling nodes and random jumps Must redistribute mass lost at dangling nodes (no out going edges – so mass lost)3 approaches to determine missing massCount dangling nodes and multiply by constantEmit special key, handle special key with logicWrite as side data, sum across all map tasksNext, Redistribute missing mass m across all nodesCompute final page rank p’ where a is random jump probabilityNeed 2 MapReduce jobs for one iteration – 1 to distribute mass across edges, the other to take care of lost mass
53 PageRank Assume honest users No Spider trap – infinite chain of pages all link to single page to inflate PageRankPageRank only one of thousands of features used in ranking web pages
54 Issues with Graph processing No global data structures can be usedLocal computation on each node, results passed to neighborsWith multiple iterations, convergence on global graphAmount of intermediate data order of number of edgesWorst case?O(n2) for dense graph
57 PageRank in MapReduceMap: distribute PageRank “credit” to link targetsReduce: gather up PageRank “credit” from multiple sources to compute new PageRank valueIterate untilconvergence...
58 Dijkstra’s Algorithm Example n1n3110n02394657n2n42Example from CLR
59 Issues with Graph processing Combiners only useful if can do partial aggregationOnly if multiple nodes being processed by individual mapper and point to same nodesOtherwise combiner not usefulAssume we have a mapper process more than one nodeHow to assign nodes (partition graph) so useful?
60 Issues with Graph processing Desirable to partition graph so many intra-component links and few inter-component linkConsider a social network --Partitioning heuristicsOrder nodes by:Last name?Zip code?Language spoken?School?So people are connected
61 Summary Graph structure represented with adjacency list Map over nodes, pass partial results to nodes on adjacency list, partial results aggregated for each node in reducerGraph structure passed from mapper to reducer, output in same form as inputAlgorithms iterative, under control of non-MapReduce driver checking for termination at end of each iteration