R ONG C HEN R ONG C HEN, J IAXIN S HI, Y ANZHE C HEN, H AIBO C HEN Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China H.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Differentiated Graph Computation and Partitioning on Skewed Graphs
Naiad: A Timely Dataflow System
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Pregel: A System for Large-Scale Graph Processing
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Chronos: A Graph Engine for Temporal Graph Analysis
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Ligra: A Lightweight Graph Processing Framework for Shared Memory
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees Da Yan (CUHK), James Cheng (CUHK), Kai Xing (HKUST), Yi Lu (CUHK), Wilfred.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Data Structures and Algorithms in Parallel Computing Lecture 7.
GraphX: Graph Analytics on Spark
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Data Structures and Algorithms in Parallel Computing
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Supporting On-Demand Elasticity in Distributed Graph Processing Mayank Pundir*, Manoj Kumar, Luke M. Leslie, Indranil Gupta, Roy H. Campbell University.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
Mizan:Graph Processing System
Optimizing Parallel Algorithms for All Pairs Similarity Search
CSCI5570 Large Scale Data Processing Systems
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Roshan Dathathri          Gurbinder Gill          Loc Hoang
湖南大学-信息科学与工程学院-计算机与科学系
Mayank Bhatt, Jayasi Mehar
Replication-based Fault-tolerance for Large-scale Graph Processing
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Parallel Programming in C with MPI and OpenMP
Loc Hoang Roshan Dathathri Gurbinder Gill Keshav Pingali
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

R ONG C HEN R ONG C HEN, J IAXIN S HI, Y ANZHE C HEN, H AIBO C HEN Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China H L Differentiated Graph Computation and Partitioning on Skewed Graphs

Graphs are Everywhere Graphs are Everywhere 1 Traffic network graphs □ Monitor/react to road incidents □ Analyze Internet security Biological networks □ Predict disease outbreaks □ Model drug-protein interactions Social networks □ Recommend people/products □ Track information flow 1 From PPoPP’15 Keynote: Computing in a Persistent (Memory) World, P. Faraboschi 2

A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank Graph-parallel Processing Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" -- "Think as a Vertex" philosophy C OMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = * sum if !converged(v) foreach n in v.out_nbrs activate(n) 3

Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" -- "Think as a Vertex" philosophy C OMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = * sum if !converged(v) foreach n in v.out_nbrs activate(n) Gather data from neighbors via in-edges 4

Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" -- "Think as a Vertex" philosophy C OMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = * sum if !converged(v) foreach n in v.out_nbrs activate(n) Update vertex with new data 5

Graph-parallel Processing A centrality analysis algorithm to iteratively rank each vertex as a weighted sum of neighbors’ ranks Example: PageRank Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges -- "Think as a Vertex" -- "Think as a Vertex" philosophy C OMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = * sum if !converged(v) foreach n in v.out_nbrs activate(n) Scatter data to neighbors via out-edges 6

Natural Graphs are Skewed Graphs in real-world □ Power-law degree distribution (aka Zipf’s law or Pareto) Source: M. E. J. Newman, ”Power laws, Pareto distributions and Zipf’s law”, 2005 “most vertices have relatively few neighbors while a few have many neighbors” -- PowerGraph [OSDI’12] #Follower : Case: S INA Weibo Case: S INA Weibo( 微博 ) □ 167 million active users ,971,093 #1 76,946,782 #2#100 14,723,

Challenge: L OCALITY vs. P ARALLELISM L OCALITY make resource locally accessible to hidden network latency H High High -degree vertex 77,971,093 L Low Low -degree vertex Imbalance 2.High contention 3.Heavy network traffic 8

Challenge: L OCALITY vs. P ARALLELISM L OCALITY P ARALLELISM make resource locally accessible to hidden network latency evenly parallelize the workloads to avoid load imbalance H High High -degree vertex 77,971,093 L Low Low -degree vertex Redundant comm., comp. & sync. (not worthwhile) 9

Challenge: L OCALITY vs. P ARALLELISM L OCALITY P ARALLELISM make resource locally accessible to hidden network latency evenly parallelize the workloads to avoid load imbalance H High High -degree vertex 77,971,093 L Low Low -degree vertex million users w/ 100 followers 100 users w/ 100-million followers Conflict 10

Existing Efforts: One Size Fits All PregelGraphLab Pregel [SIGMOD’08] and GraphLab [VLDB’12] □ Focus on exploiting L OCALITY □ Partitioning: use edge-cut to evenly assign vertices along with all edges □ Computation: aggregate all resources (i.e., msgs or replicas) of a vertex on local machine A B AA B msg replica (i.e., mirror) master x2 B LLHL Edge-cut 11

Existing Efforts: One Size Fits All PowerGraph GraphX PowerGraph [OSDI’12] and GraphX [OSDI’14] □ Focus on exploiting P ARALLELISM □ Partitioning: use vertex-cut to evenly assign edges with replicated vertices □ Computation: decompose the workload of a vertex into multiple machines Vertex-cut LHHL mirror A B A B master x5 12

: : differentiated graph computation and partitioning on skewed graphs □ Hybrid computation and partitioning strategies → Embrace locality (low-degree vertex) and parallelism (high-degree vertex) □ An optimization to improve locality and reduce interference in communication □ Outperform PowerGraph (up to 5.53X) and other systems for ingress and runtime P ARALLELISM L OCALITY Contributions One Size fits All 13

Agenda Graph Partitioning Graph Computation OptimizationEvaluation

Graph Partitioning -cut Hybrid-cut: differentiate graph partitioning for low- degree and high-degree vertices □ Low-cut (inspired by edge-cut) : reduce mirrors and exploit locality for low-degree vertices □ High-cut (inspired by vertex-cut) : provide balance and restrict the impact of high-degree vertices □ Efficiently combine low-cut and high-cut High-master Low-master High-mirror Low-mirror Sample Graph θ=3 User-defined threshold 15

Low-cut Low-cut : evenly assign low-degree vertices along with edges in only one direction (e.g., in-edges) to machines by hashing the (e.g., target) vertex 1.Unidirectional access locality 2.No duplicated edges 3.Efficiency at ingress 4.Load balance (E & V) 5.Ready-made heuristics Hybrid-cut (Low) N OTE: A new heuristic, called Ginger, is provided to further optimize Random hybrid-cut. (see paper) hash(X) = X % 3 θ=3 16

High-cut High-cut : distribute one-direction edges (e.g., in-edges) of high-degree vertices to machines by hashing opposite (e.g., source) vertex 1.Avoid adding any mirrors for low-degree vertices (i.e., an upper bound of increased mirrors for assigning a new high-degree vertex) 2.No duplicated edges 3.Efficiency at ingress 4.Load balance (E & V) Hybrid-cut (High) hash(X) = X % 3 θ=

Constructing Hybrid-cut 1, 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 Load Sample Graph θ=3 1.Load files in parallel format: edge-list 18

Constructing Hybrid-cut , 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 Dispatch Load Sample Graph θ=3 1.Load files in parallel 2.Dispatch all edges (Low-cut) and count in-degree of vertices format: edge-list 19

Constructing Hybrid-cut , 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 Dispatch Load Re-assign Sample Graph θ=3 1.Load files in parallel 2.Dispatch all edges (Low-cut) and count in-degree of vertices 3.Re-assign in-edges of high-degree vertices (High-cut, θ=3) N OTE: some formats (e.g., adjacent list) can skip counting and re-assignment format: edge-list 20

Constructing Hybrid-cut , 6 1, 4 2, 1 2, 5 2, 6 3, 1 3, 4 4, 1 5, 1 3 Dispatch Load Re-assign Construct Sample Graph θ=3 1.Load files in parallel 2.Dispatch all edges (Low-cut) and count in-degree of vertices 3.Re-assign in-edges of high-degree vertices (High-cut, θ=3) 4.Create mirrors to construct a local graph format: edge-list 21

Agenda Graph Partitioning Graph Computation OptimizationEvaluation 22

Graph Computation -model Hybrid-model: differentiate the graph computation model to high-degree and low-degree vertices □ High-degree model: decompose workload of high-degree vertices for load balance □ Low-degree model: minimize the communication cost for low-degree vertices □ Seamlessly run all of existing graph algorithms mastermirror L... L G A SS High-degree modelLow-degree model mastermirror H... G A S G S H 23

C OMPUTE(v) foreach n in v.in_nbrs sum += n.rank / n.nedges v.rank = * sum if !converged(v) foreach n in v.out_nbrs activate(n) PageRank Example: PageRank Hybrid-model (High) -degree vertex High-degree vertex □ Exploit P ARALLELISM □ Follow GAS model (inspired by P OWER G RAPH ) □ Comm. Cost: ≤ 4 x #mirrors A PPLY (v, sum): v.rank = * sum G ATHER(v, n): return n.rank / n.nedges S UM (a, b): return a + b G ATHER_EDGES = IN S CATTER (v, n): if !converged(v) activate(n) S CATTER_EDGES = OUT G A S local distributed distributed master H... G A S G S mirror H... High-degree vertex 24

Hybrid-model (Low) -degree vertex Low-degree vertex □ Exploit L OCALITY □ Reserve GAS interface □ Unidirectional access locality □ Comm. Cost: ≤ 1 x #mirrors O BSERVATION : most algorithms only gather or scatter in one direction (e.g., PageRank: G/IN and S/OUT) A PPLY (v, sum): v.rank = * sum G ATHER(v, n): return n.rank / n.nedges S UM (a, b): return a + b G ATHER_EDGES = IN S CATTER (v, n): if !converged(v) activate(n) S CATTER_EDGES = OUT G A S local distributed local S L... master L... mirror Low-degree vertex A S G 25

Generalization run all graph algorithms Seamlessly run all graph algorithms □ Adaptively downgrade model (+ N msgs, 1 ≤ N ≤ 3) □ Detect at runtime w/o any changes to APPs e.g., Connected Components (CC) → G/NONE and S/ALL → +1 msg in S CATTER (mirror  master ) → Comm. Cost: ≤ 2 x #mirrors mastermirror L... L G A SS +1 msg One-direction locality limits the expressiveness to some OTHER algorithms (i.e., G or S in both IN and OUT) 26

Agenda Graph Partitioning Graph Computation OptimizationEvaluation

Poor locality and high interference □ Messages are batched and sent periodically □ Messages from different nodes will be updated to vertices (i.e., master and mirror) in parallel... M1... M0 Worker Thread Challenge: Locality & Interference msgs RPC Thread OPT: Locality-conscious graph layout □ Focus on exploiting locality in communication □ Extend hybrid-cut at ingress (w/o runtime overhead) 28

Example: Initial State M1 M2 M0 h L l l h H l L H h l h L L l l l l L l H h l h L H H igh-master h h igh-mirror L L ow-master l l ow-mirror Both master and mirror of all vertices (high- & low-degree) are mixed and stored in a random order on each machine e.g. Hash(X) = (X-1)%3 29

Example: Zoning M1 M2 M M M M0 H0 L0 h-mrr l-mrr H1 L1 h-mrr l-mrr H2 L2 h-mrr l-mrr Z0 Z1 Z2 Z3 1.Z ONE : divide vertex space into 4 zones to store H igh-masters (Z0), L ow-masters (Z1), h igh-mirrors (Z2) and l ow-mirrors (Z3) H H igh-master h h igh-mirror L L ow-master l l ow-mirror Z0Z1Z2Z3 30

Example: Grouping H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 2.G ROUP : the mirrors (i.e., Z2 and Z3) are further grouped within zone according to the location of their masters H H igh-master h h igh-mirror L L ow-master l l ow-mirror M1 M2 M M1 M M Z0Z1Z2Z3 31

M1 Example: Sorting H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 3.S ORT : both the masters and mirrors within a group are sorted according to the global vertex ID H H igh-master h h igh-mirror L L ow-master l l ow-mirror M M0 M M M1 Z0Z1Z2Z3 32

M1 Example: Sorting H0 L0 h1 h2 l1 l2 H1 L1 h0 h2 l0 l2 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 3.S ORT : both the masters and mirrors within a group are sorted according to the global vertex ID H H igh-master h h igh-mirror L L ow-master l l ow-mirror M2 M0 M M M1 Z0Z1Z2Z3 33

M1 Example: Rolling H0 L0 h1 h2 l1 l2 H1 L1 h2 h0 l2 l0 H2 L2 h0 h1 l0 l1 Z0 Z1 Z2 Z3 4.R OLL : place mirror groups in a rolling order: the mirror groups in machine n for p partitions start from (n + 1)%p e.g., M1 start from h2 then h0, where n=2 and p= H H igh-master h h igh-mirror L L ow-master l l ow-mirror M2 M0 M M M1 Z0Z1Z2Z3 34

Tradeoff: Ingress vs. Runtime OPT: Locality-conscious graph layout □ The increase of graph ingress time is modest → Independently executed on each machine → Less than 5% (power-law) & around 10% (real-world) □ Usually more than 10% speedup for runtime → 21% for the Twitter follower graph Power-law (α)Real-world 35

Agenda Graph Partitioning Graph Computation OptimizationEvaluation

Implementation P OWER L YRA □ Based on latest PowerGraph (v2.2, Mar. 2014) □ A separate engine and partitioning algorithms □ Support both sync and async execution engines □ Seamlessly run all graph toolkits in GraphLab and respect the fault tolerance model □ Port the Random hybrid-cut to GraphX 1 The source code and a brief instruction are available at 1 We only port Random hybrid-cut for preserving the graph partitioning interface of GraphX. 37

Evaluation Baseline: Baseline: latest P OWER G RAPH (2.2 Mar. 2014) with various vertex-cuts (Grid/Oblivious/Coordinated)Platforms 1.A 48-node VM-based cluster 1 ( Each: 4 Cores, 12GB RAM ) 2.A 6-node physical cluster 1 ( Each: 24 Cores, 64GB RAM ) Algorithms and graphs □ 3 graph-analytics algorithms (PR 2, CC, DIA) and 2 MLDM algorithms (ALS, SGD) □ 5 real-world graphs, 5 synthetic graphs 3, etc. 6-node48-node 1 Clusters are connected via 1GigE 2 We run 10 iterations for PageRank. 3 Synthetic graphs has fixed 10 million vertices 38

Performance Power-law (α)Real-world default 2.2X 2.9X 3.3X 3.2X 2.6X 2.0X 3.0X 5.5X 48-node better 39

Scalability default #Machines#Vertices (millions) node6-node better

Breakdown Power-law (α) 48-node Power-law (α) 48-node default better 41

Graph Partitioning Comparison Power-law (α)Real-world

Threshold ∞ + 48-node Impact of different thresholds □ Using high-cut or low-cut for all vertices is poor □ Execution time is stable for a large range of θ → Difference is lower than 1sec (100 ≤ θ ≤ 500) High-cut High-cut for all Low-cut Low-cut for all Threshold (θ) stable

6-node vs. Other Systems vs. Other Systems 1 Twitter Follower GraphPower-law (α=2.0) Graph Exec-Time (sec) PageRank PageRank with 10 iterations 6-node In-memory Out-of-core 1 We use Giraph 1.1, CombBLAS 1.4, GraphX 1.1, Galois 2.2.1, X-Stream 1.0, as well as the latest version of GPS, PowerGraph, Polymer and GraphChi. better 44

Conclusion Existing systems use a “O NE S IZE F ITS A LL ” design for skewed graphs, resulting in suboptimal performance P OWER L YRA : a hybrid and adaptive design that differentiates computation and partitioning on low- and high-degree vertices Embrace L OCALITY and P ARALLELISM □ Outperform PowerGraph by up to 5.53X □ Also much faster than other systems like GraphX, Giraph, and CombBLAS for ingress and runtime □ Random hybrid-cut improves GraphX by 1.33X 45

L QuestionsThanks projects/powerlyra.html Institute of Parallel and Distributed Systems H

Related Work G ENERAL Pregel [SIGMOD'09], GraphLab [VLDB'12], PowerGraph [OSDI'12], GraphX [OSDI'14] O PTIMIZATION Mizan [EuroSys'13], Giraph++ [VLDB'13], PowerSwitch [PPoPP'15] S INGLE M ACHINE GraphChi [OSDI'12], X-Stream [SOSP'13], Galois [SOSP'13], Polymer [PPoPP'15] O THER P ARADIGMS CombBLAS [IJHPCA'11], SociaLite [VLDB'13] E DGE -C UT Stanton et al. [SIGKDD'12], Fennel [WSDM'14] V ERTEX -C UT Greedy [OSDI'12], GraphBuilder [GRADES'13] H EURISTICS 2D [SC'05], Degree-Based Hashing [NIPS'15] Other graph processing systems Graph replication and partitioning

Backup

Downgrade mastermirror L... L G A SS mastermirror L... G A S G S L G ATHER S CATTER mastermirror L... G A S G S L mastermirror L... G A SS L #msgs O BSERVATION : most algorithms only gather or scatter in one direction Sweet Spot

Other Algorithms and Graphs Power-law (α) Normalized Speedup Power-law (α) 48-node MLDM Algorithms Netflix :|V|=0.5M, |E|=99M 48-node default G/OUT & S/NONE ingress | runtime better Latent dimension G/NONE & S/ALL

Other Algorithms and Graphs Power-law (α) Normalized Speedup Power-law (α) 48-node MLDM Algorithms Netflix :|V|=0.5M, |E|=99M 48-node Non-skewed Graph RoadUS: |V|=23.9M, |E|=58.3M 48-node default G/OUT & S/NONE G/NONE & S/ALL better

Ingress Time vs. Execution Time

Fast CPU & Networking Exec-Time (sec) Twitter Follower GraphPower-law (α=2.0) Graph Testbed: Testbed: 6-node physical clusters 1.SLOW: 24 Cores (AMD GHz L3:12MB), 64GB RAM, 1GbE 2.FAST: 20 Cores (Intel E GHz L3:24MB), 64GB RAM, 10GbE