Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

Slides:



Advertisements
Similar presentations
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
Advertisements

Overview of this week Debugging tips for ML algorithms
MapReduce.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Thanks to Jimmy Lin slides
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
PageRank Identifying key users in social networks Student : Ivan Todorović, 3231/2014 Mentor : Prof. Dr Veljko Milutinović.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Recommender Systems and Collaborative Filtering
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Collaborative Filtering  Introduction  Search or Content based Method  User-Based Collaborative Filtering  Item-to-Item Collaborative Filtering  Using.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Data Structures and Algorithms in Parallel Computing
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
A Peta-Scale Graph Mining System
Big Data Infrastructure
Randomized Algorithms Graph Algorithms
PREGEL Data Management in the Cloud
Graph Algorithms (plus some review….)
Link-Based Ranking Seminar Social Media Mining University UC3M
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
Workflows 3: Graphs, PageRank, Loops, Spark
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce and Data Management
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
Graph Algorithms Ch. 5 Lin and Dyer.
Analysis of Large Graphs: Overlapping Communities
Presentation transcript:

Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

Parallel Graph Computation Distributed computation and/or multicore parallelism – Sometimes confusing. We will talk mostly about distributed computation. Are classic graph algorithms parallelizable? What about distributed? – Depth-first search? – Breadth-first search? – Priority-queue based traversals (Djikstra’s, Prim’s algorithms)

MapReduce for Graphs Graph computation almost always iterative MapReduce ends up shipping the whole graph on each iteration over the network (map- >reduce->map->reduce->...) – Mappers and reducers are stateless

Iterative Computation is Difficult System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

MapReduce and Partitioning Map-Reduce splits the keys randomly between mappers/reducers But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average  Extremely uneven distribution  Time of iteration = time of slowest job.

Curse of the Slow Job Data CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier Data Barrier

Map-Reduce is Bulk-Synchronous Parallel Bulk-Synchronous Parallel = BSP (Valiant, 80s) – Each iteration sees only the values of previous iteration. – In linear systems literature: Jacobi iterations Pros: – Simple to program – Maximum parallelism – Simple fault-tolerance Cons: – Slower convergence – Iteration time = time taken by the slowest node

Triangle Counting in Twitter Graph 40M Users 1.2B Edges Total: 34.8 Billion Triangles Hadoop results from [Suri & Vassilvitskii '11] 1536 Machines 423 Minutes 64 Machines, 1024 Cores 1.5 Minutes

PageRank 40M Webpages, 1.4 Billion Links (100 iterations) 5.5 hrs 1 hr 8 min Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

Graph algorithms PageRank implementations – in memory – streaming, node list in memory – streaming, no memory – map-reduce A little like Naïve Bayes variants – data in memory – word counts in memory – stream-and-sort – map-reduce

Google ’ s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Inlinks are “ good ” (recommendations) Inlinks from a “ good ” site are better than inlinks from a “ bad ” site but inlinks from sites with many outlinks are not as “ good ”... “ Good ” and “ bad ” are relative. web site xxx

Google ’ s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page

Google ’ s PageRank (Brin & Page, web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page PageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected “ crowd size ”

Random Walks avoids messy “dead ends”….

Random Walks: PageRank

Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

PageRank in Memory Let u = (1/N, …, 1/N) – dimension = #nodes N Let A = adjacency matrix: [a ij =1  i links to j] Let W = [w ij = a ij /outdegree(i)] – w ij is probability of jump from i to j Let v 0 = (1,1,….,1) – or anything else you want Repeat until converged: – Let v t+1 = cu + (1-c)Wv t c is probability of jumping “anywhere randomly”

Streaming PageRank Assume we can store v but not W in memory Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Store A as a row matrix: each line is – i j i,1,…,j i,d [the neighbors of i] Store v’ and v in memory: v’ starts out as cu For each line “i j i,1,…,j i,d “ – For each j in j i,1,…,j i,d v’[j] += (1-c)v[i]/d Everything needed for update is right there in row….

Streaming PageRank: with some long rows Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Store A as a list of edges: each line is: “i d(i) j” Store v’ and v in memory: v’ starts out as cu For each line “i d j“ v’[j] += (1-c)v[i]/d We need to get the degree of i and store it locally

Streaming PageRank: preprocessing Original encoding is edges (i,j) Mapper replaces i,j with i,1 Reducer is a SumReducer Result is pairs (i,d(i)) Then: join this back with edges (i,j) For each i,j pair: – send j as a message to node i in the degree table messages always sorted after non-messages – the reducer for the degree table sees i,d(i) first then j1, j2, …. can output the key,value pairs with key=i, value=d(i), j

Preprocessing Control Flow: 1 IJ i1j1,1 i1j1,2 …… i1j1,k1 i2j2,1 …… i3j3,1 …… I i11 1 …… 1 i21 …… i31 …… I i11 1 …… 1 i21 …… i31 …… Id(i) i1d(i1)..… i2d(i2) …… i3d)i3) …… MAP SORT REDUCE Summing values

Preprocessing Control Flow: 2 IJ i1j1,1 i1j1,2 …… i2j2,1 …… I i1d(i1) i1~j1,1 i1~j1,2..… i2d(i2) i2~j2,1 i2~j2,2 …… I i1d(i1)j1,1 i1d(i1)j1,2 ……… i1d(i1)j1,n1 i2d(i2)j2,1 ……… i3d(i3)j3,1 ……… Id(i) i1d(i1)..… i2d(i2) …… MAP SORT REDUCE IJ i1~j1,1 i1~j1,2 …… i2~j2,1 …… Id(i) i1d(i1)..… i2d(i2) …… copy or convert to messagesjoin degree with edges

Control Flow: Streaming PR IJ i1j1,1 i1j1,2 …… i2j2,1 …… Id/v i1d(i1),v(i1) i1~j1,1 i1~j1,2..… i2d(i2),v(i2) i2~j2,1 i2~j2,2 …… todelta i1c j1,1(1-c)v(i1)/d(i1) …… j1,n1i i2c j2,1… …… i3c Id/v i1d(i1),v(i1) i2d(i2),v(i2) …… MAP SORT REDUCE MAP SORT Idelta i1c (1-c)v(…)…. i1(1-c)…..… i2c (1-c)… i2…. …… copy or convert to messages send “pageRank updates ” to outlinks

Control Flow: Streaming PR todelta i1c j1,1(1-c)v(i1)/d(i1) …… j1,n1i i2c j2,1… …… i3c REDUCE MAP SORT Idelta i1c (1-c)v(…)…. i1(1-c)…..… i2c (1-c)… i2…. …… REDUCE Iv’ i1~v’(i1) i2~v’(i2) …… Summing values Id/v i1d(i1),v(i1) i2d(i2),v(i2) …… MAP SORT REDUCE Replace v with v’ Id/v i1d(i1),v’(i1) i2d(i2),v’(i2) ……

Control Flow: Streaming PR IJ i1j1,1 i1j1,2 …… i2j2,1 …… Id/v i1d(i1),v(i1) i2d(i2),v(i2) …… MAP copy or convert to messages and back around for next iteration….

PageRank in MapReduce

More on graph algorithms PageRank is a one simple example of a graph algorithm – but an important one – personalized PageRank (aka “random walk with restart”) is an important operation in machine learning/data analysis settings PageRank is typical in some ways – Trivial when graph fits in memory – Easy when node weights fit in memory – More complex to do with constant memory – A major expense is scanning through the graph many times … same as with SGD/Logistic regression disk-based streaming is much more expensive than memory-based approaches Locality of access is very important! gains if you can pre-cluster the graph even approximately avoid sending messages across the network – keep them local

Machine Learning in Graphs

Some ideas Combiners are helpful – Store outgoing incrementVBy messages and aggregate them – This is great for high indegree pages Hadoop’s combiners are suboptimal – Messages get emitted before being combined – Hadoop makes weak guarantees about combiner usage

This slide again!

Some ideas Most hyperlinks are within a domain – If we keep domains on the same machine this will mean more messages are local – To do this, build a custom partitioner that knows about the domain of each nodeId and keeps nodes on the same domain together – Assign node id’s so that nodes in the same domain are together – partition node ids by range – Change Hadoop’s Partitioner for this

Some ideas Repeatedly shuffling the graph is expensive – We should separate the messages about the graph structure (fixed over time) from messages about pageRank weights (variable) – compute and distribute the edges once – read them in incrementally in the reducer not easy to do in Hadoop! – call this the “Schimmy” pattern

Schimmy Relies on fact that keys are sorted, and sorts the graph input the same way…..

Schimmy

Results

More details at… Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), Community Detection in Networks with Node Attributes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 39

Resources Stanford SNAP datasets: – ClueWeb (CMU): – Univ. of Milan’s repository: –