Presentation on theme: "Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft."— Presentation transcript:
Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft Research XCG
Streaming Graph Partitioning KDD 8/15 Modern graph datasets are huge – The web graph had over a trillion links in Now? – facebook has “more than 901 million users with average degree 130” – Protein networks Motivation
Streaming Graph Partitioning KDD 8/15 We still need to perform computations, so we have to deal with large data – PageRank (and other matrix-multiply problems) – Broadcasting status updates – Database queries – And on and on and on… Motivation P QL Graph has to be distributed across a cluster of machines!
Streaming Graph Partitioning KDD 8/15 Motivation Edges cut correspond (approximately) to communication volume required Too expensive to move data on the network – Interprocessor communication: nanoseconds – Network communication: microseconds The data has to be loaded onto the cluster at some point… Can we partition while we load the data?
Streaming Graph Partitioning KDD 8/15 Graph partitioning is NP-hard on a good day But then we made it harder: – Graphs like social networks are notoriously difficult to partition (expander-like) – Large data sets drastically reduce the amount of computation that is feasible – O(n) or less – The partitioning algorithms need to be parallel and distributed High Level Background
Streaming Graph Partitioning KDD 8/15 The Streaming Model Graph Stream → Partitioner Graph is ordered: Random Breadth-First Search Depth-First Search Goal: Generate an approximately balanced k-partitioning
Streaming Graph Partitioning KDD 8/15 Lower Bounds On Orderings DFS Ordering -Stream is connected -Greedy will do optimally Theory says these types of algorithms can’t do well
Streaming Graph Partitioning KDD 8/15 Current Approach in Real Systems
Streaming Graph Partitioning KDD 8/15 Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions Find out which heuristics work on each graph Compare these with the results of – Random Hashing to get worst case – METIS to get ‘best’ offline performance Our Approach
Streaming Graph Partitioning KDD 8/15 Caveats METIS is a heuristic, not true lower bound – Does fine in practice – Available online for reproducing results Used publicly available datasets – Public graph datasets tend to be much smaller than what companies have Using meta-data for partitioning can be good – partitioning the web graph by URL – Using geographic location for social network users
Streaming Graph Partitioning KDD 8/15 Datasets Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs Sizes: 297 vertices to 41.7 million vertices Synthetic graph models – Barabasi-Albert (Preferential Attachment) – RMAT (Kronecker) – Watts-Strogatz – Power law-Clustered Biggest graphs: LiveJournal and Twitter
Streaming Graph Partitioning KDD 8/15 Experimental Method For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces Compare with a random cut – upper bound Compare with METIS – lower bound Performance was measured by:
Streaming Graph Partitioning KDD 8/15 Heuristic Results Best heuristic, LDG, gets an average improvement of 76% over all datasets! Synthetic Social network Finite element mesh Hash METIS BFS DFS Random
Streaming Graph Partitioning KDD 8/15 Scaling in the Size of Graphs: Exploiting Synthetic Graphs LDG Hash METIS
Streaming Graph Partitioning KDD 8/15 More Observations BFS is a superior ordering for all algorithms Avoid Big does 46% WORSE on average than Random Cut Further experiments showed Linear Det. Greedy has identical performance to Det. Greedy with load-based tie breaking.
Streaming Graph Partitioning KDD 8/15 Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/) Used 2 datasets 4.6 million users, 77 million edges 41.7 million users, billion edges Computed the PageRank of each graph Results on a Real System
Streaming Graph Partitioning KDD 8/15 Results on SPARK LJ HashLJ StreamTwitter HashTwitter Stream Naïve PR Mean296.2 s181.5 s s969.3 s Naïve PR STD5.5 s2.2 s81.2 s16.9 s Combiner PR Mean s110.4 s599.4 s486.8 s Combiner PR STD 1.5 s0.8 s14.4 s5.9 s LJ Improvement: Naïve – 38.7% Combiner – 28.8 % Twitter Improvement: Naïve – 19.1% Combiner – 18.8 % LiveJournal – 4.6 million users, 77 million edges Twitter – 41.7 million users, billion edges
Streaming Graph Partitioning KDD 8/15 Streaming graph partitioning is a really nice, simple, effective preprocessing step.
Streaming Graph Partitioning KDD 8/15 Where to now? Can we explain theoretically why the greedy algorithm performs so well?* What heuristics work better? What heuristics are optimal for different classes of graphs? Use multiple parallel streams! Implement in real systems! *Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs
Streaming Graph Partitioning KDD 8/15 Acknowledgements David B. Wecker Burton Smith Reid Andersen Nikhil Devanur Sameh Elkinety Sreenivas Gollapudi Yuxiong He Rina Panigrahy Yuval Peres All at MSR Satish Rao Virginia Vassilevska Williams Alexandre Stauffer Ngoc Mai Tran Miklos Racz Matei Zaharia All at Berkeley - CS and Statistics Supported by NSF and NDSEG fellowships, NSF grant CCF , and an internship at Microsoft Research’s eXtreme Computing Group.