Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.

Slides:



Advertisements
Similar presentations
Map-Reduce Graph Processing Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Advertisements

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
大规模数据处理 / 云计算 Lecture 6 – Graph Algorithm 彭波 北京大学信息科学技术学院 4/26/2011 This work is licensed under a Creative Commons.
Thanks to Jimmy Lin slides
Graph A graph, G = (V, E), is a data structure where: V is a set of vertices (aka nodes) E is a set of edges We use graphs to represent relationships among.
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
CS171 Introduction to Computer Science II Graphs Strike Back.
CS 206 Introduction to Computer Science II 11 / 11 / Veterans Day Instructor: Michael Eckmann.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
ITEC200 – Week 12 Graphs. 2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Graph & BFS.
CS 311 Graph Algorithms. Definitions A Graph G = (V, E) where V is a set of vertices and E is a set of edges, An edge is a pair (u,v) where u,v  V. If.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Graphs & Graph Algorithms 2 Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
CSE 780 Algorithms Advanced Algorithms Graph Algorithms Representations BFS.
Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Graph Essentials Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Graph Essentials Networks A network is a graph. – Elements of the network.
Social Media Mining Graph Essentials.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Representing and Using Graphs
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Graphs. Made up of vertices and arcs Digraph (directed graph) –All arcs have arrows that give direction –You can only traverse the graph in the direction.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Data Structures CSCI 132, Spring 2014 Lecture 38 Graphs
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
Data Structures and Algorithms in Parallel Computing Lecture 3.
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
1 Algorithms CSCI 235, Fall 2015 Lecture 32 Graphs I.
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Source: CSE 214 – Computer Science II Graphs.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Big Data Infrastructure
Big Data Infrastructure
Graphs Lecture 19 CS2110 – Spring 2013.
Graphs Representation, BFS, DFS
Lecture 11 Graph Algorithms
MapReduce and Data Management
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
Graph Algorithms Ch. 5 Lin and Dyer.
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Lecture 10 Graph Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 32 Graphs I
Graph Algorithms Ch. 5 Lin and Dyer.
Presentation transcript:

Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo February 2, 2016 These slides are available at

Structure of the Course “Core” framework features and algorithm design Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

What’s a graph? G = (V,E), where V represents the set of vertices (nodes) E represents the set of edges (links) Both vertices and edges may contain additional information Different types of graphs: Directed vs. undirected edges Presence or absence of cycles Graphs are everywhere: Hyperlink structure of the web Physical structure of computers on the Internet Interstate highway system Social networks

Source: Wikipedia (Königsberg)

Source: Wikipedia (Kaliningrad)

Some Graph Problems Finding shortest paths Routing Internet traffic and UPS trucks Finding minimum spanning trees Telco laying down fiber Finding Max Flow Airline scheduling Identify “special” nodes and communities Breaking up terrorist cells, spread of avian flu Bipartite matching Monster.com, Match.com And of course... PageRank

What makes graphs hard? Irregular structure Irregular data access patterns Iterations

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Performing computations at each node: based on node features, edge features, and local link structure Propagating computations: “traversing” the graph Key questions: How do you represent graph data in MapReduce (and Spark)? How do you traverse a graph in MapReduce (and Spark)?

Representing Graphs G = (V, E) Three common representations Adjacency matrix Adjacency list Edge lists

Adjacency Matrices Represent a graph as an n x n square matrix M n = |V| M ij = 1 means a link from node i to j

Adjacency Matrices: Critique Advantages: Amenable to mathematical manipulation Iteration over rows and columns corresponds to computations on outlinks and inlinks Disadvantages: Lots of zeros for sparse matrices Lots of wasted space

Adjacency Lists Take adjacency matrices… and throw away all the zeros 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, Wait, where have we seen this before?

Adjacency Lists: Critique Advantages: Much more compact representation Easy to compute over outlinks Disadvantages: Much more difficult to compute over inlinks

Edge Lists Explicitly enumerate all edges (1, 2) (1, 4) (2, 1) (2, 3) (2, 4) (3, 1) (4, 1) (4, 3)

Edge Lists: Critique Why? Edges arrive in no particular order Sometimes, we want to store inverted edges Advantages: Supports the ability to perform edge partitioning Disadvantages: Takes a lot of space

Co-occurrence of characters in Les Misérables Source:

Co-occurrence of characters in Les Misérables Source:

Co-occurrence of characters in Les Misérables Source: How are visualizations like this generated? Limitations?

What does the web look like? Meusel et al. Graph Structure in the Web — Revisited. WWW Analysis of a large webgraph from the common crawl: 3.5 billion pages, 129 billion links

Broder’s Bowtie (2000) – revisited

What does the web look like? Very roughly, a scale-free network Fraction of k nodes having k connections: (i.e., distribution follows a power law)

Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipf's law.” Contemporary Physics 46:323–351. Power Laws are everywhere!

What does the web look like? Very roughly, a scale-free network Internet domain routers Co-author network Citation network Movie-Actor network Other Examples: Why?

(In this installment of “learn fancy terms for simple ideas”) Preferential Attachment Matthew Effect Also: For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken even that which he hath. — Matthew 25:29, King James Version.

Figure from: Seth A. Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. Information Network or Social Network? The Structure of the Twitter Follow Graph. WWW What about Facebook?

BTW, how do we compute these graphs? (Assume graph stored as adjacency lists)

Count. Source:

BTW, how do we extract the webgraph? The webgraph… is big?! Integerize vertices (montone minimal perfect hashing) Sort URLs Integer compression A few tricks: Meusel et al. Graph Structure in the Web — Revisited. WWW webgraph from the common crawl: 3.5 billion pages, 129 billion links 58 GB!

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Performing computations at each node: based on node features, edge features, and local link structure Propagating computations: “traversing” the graph Key questions: How do you represent graph data in MapReduce (and Spark)? How do you traverse a graph in MapReduce (and Spark)?

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost First, a refresher: Dijkstra’s Algorithm

Dijkstra’s Algorithm Example 0 0     Example from CLR

Dijkstra’s Algorithm Example   Example from CLR

Dijkstra’s Algorithm Example Example from CLR

Dijkstra’s Algorithm Example Example from CLR

Dijkstra’s Algorithm Example Example from CLR

Dijkstra’s Algorithm Example Example from CLR

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost Single processor machine: Dijkstra’s Algorithm MapReduce: parallel breadth-first search (BFS)

Finding the Shortest Path Consider simple case of equal edge weights Solution to the problem can be defined inductively Here’s the intuition: Define: b is reachable from a if b is on adjacency list of a D ISTANCE T O (s) = 0 For all nodes p reachable from s, D ISTANCE T O (p) = 1 For all nodes n reachable from some other set of nodes M, D ISTANCE T O (n) = 1 + min(D ISTANCE T O (m), m  M) s s m3m3 m3m3 m2m2 m2m2 m1m1 m1m1 n n … … … d1d1 d2d2 d3d3

Source: Wikipedia (Wave)

Visualizing Parallel BFS n0n0 n0n0 n3n3 n3n3 n2n2 n2n2 n1n1 n1n1 n7n7 n7n7 n6n6 n6n6 n5n5 n5n5 n4n4 n4n4 n9n9 n9n9 n8n8 n8n8

From Intuition to Algorithm Data representation: Key: node n Value: d (distance from start), adjacency list (nodes reachable from n) Initialization: for all nodes except for start node, d =  Mapper:  m  adjacency list: emit (m, d + 1) Remember to also emit distance to yourself Sort/Shuffle Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path

Multiple Iterations Needed Each MapReduce iteration advances the “frontier” by one hop Subsequent iterations include more and more reachable nodes as frontier expands Multiple iterations are needed to explore entire graph Preserving graph structure: Problem: Where did the adjacency list go? Solution: mapper emits (n, adjacency list) as well Ugh! This is ugly!

BFS Pseudo-Code

Stopping Criterion How many iterations are needed in parallel BFS (equal edge weight case)? Convince yourself: when a node is first “discovered”, we’ve found the shortest path Now answer the question... Six degrees of separation? Practicalities of implementation in MapReduce

Comparison to Dijkstra Dijkstra’s algorithm is more efficient At each step, only pursues edges from minimum-cost path inside frontier MapReduce explores all paths in parallel Lots of “waste” Useful work is only done at the “frontier” Why can’t we do better using MapReduce?

Single Source: Weighted Edges Now add positive weights to the edges Why can’t edge weights be negative? Simple change: add weight w for each edge in adjacency list In mapper, emit (m, d + w p ) instead of (m, d + 1) for each node m That’s it?

Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case)? Convince yourself: when a node is first “discovered”, we’ve found the shortest path Not true!

Additional Complexities s p q r search frontier 10 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 n8n8 n9n

Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case)? Practicalities of implementation in MapReduce

Application: Social Search Source: Wikipedia (Crowd)

Social Search When searching, how to rank friends named “John”? Assume undirected graphs Rank matches by distance to user Naïve implementations: Precompute all-pairs distances Compute distances at query time Can we do better?

All-Pairs? Floyd-Warshall Algorithm: difficult to MapReduce-ify… Multiple-source shortest paths in MapReduce: run multiple parallel BFS simultaneously Assume source nodes {s 0, s 1, … s n } Instead of emitting a single distance, emit an array of distances, with respect to each source Reducer selects minimum for each element in array Does this scale?

Landmark Approach (aka sketches) Select n seeds {s 0, s 1, … s n } Compute distances from seeds to every node: What can we conclude about distances? Insight: landmarks bound the maximum path length Lots of details: How to more tightly bound distances How to select landmarks (random isn’t the best…) Use multi-source parallel BFS implementation in MapReduce! A=[2, 1, 1] B =[1, 1, 2] C=[4, 3, 1] D=[1, 2, 4] Nodes Distances to seeds

Source: Wikipedia (Japanese rock garden) Questions?