# Differentiated Graph Computation and Partitioning on Skewed Graphs

## Presentation on theme: "Differentiated Graph Computation and Partitioning on Skewed Graphs"— Presentation transcript:

Differentiated Graph Computation and Partitioning on Skewed Graphs
J R Y H B 2014 PowerLyra Differentiated Graph Computation and Partitioning on Skewed Graphs Hi everyone, Today I’ll introduce our work on distributed graph analytics framework, which is joint work with Jiaxi, Yanzhe and Haibo Rong Chen, JiaXin Shi, Yanzhe Chen, and Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University

How do we understand and use Big Data?
Big Data Everywhere 100 Hrs of Video every minute 1.11 Billion Users 6 Billion Photos 400 Million Tweets/day How do we understand and use Big Data? Since the description to Big data has been mentioned in everywhere, I will keep this short. A lot of data mining and machine learning algorithms have been used to understand big data.

Big Learning: machine learning and data mining on Big Data
Big Data  Big Learning 100 Hrs of Video every minute 1.11 Billion Users 6 Billion Photos 400 Million Tweets/day Big Learning: machine learning and data mining on Big Data From Social Network, Recommendation system, Natural Language Processing and Web Search. It is named big learning. NLP

It’s all about the graphs …
All these algorithms could be abstracted as computation on graphs. Thus, analysis of big social networks and other huge graphs becomes a hot topic now It’s all about the graphs …

Example Algorithms PageRank (Centrality Measures)
α is the random reset probability L[j] is the number of links on page j iterate until convergence PageRank algorithms is always used as a typical example to show the key properties of graph algorithms. It can be expressed as an iterative computation on vertex. Each vertex update its rank as a weighted sum of its neighbors’ rank. The new rank will trigger the re-execution on neighbors until all vertices reached convergence. 2 4 3 1 5 example: 𝑹 𝟏 = 𝑹 𝟑 𝑹 𝟒 𝑹[𝟓]

Background: Graph Algorithms
Dependent Data Local Accesses Iterative Computation 2 4 3 1 5 2 4 3 1 5 2 4 3 1 5 Coding graph algorithms as vertex-centric programs to process vertices in parallel and communicate along edges Compared to data-parallel algorithms, the properties of graph algorithms can be summarized as dependent data, local access and iterative computation. Aim to these properties, most of graph-parallel models follows “think as a vertex” philosophy. It means the programmer only need provide a vertex-centric program to abstract algorithms, which is executed on vertex in parallel and communicate along edges "Think as a Vertex" philosophy

Think as a Vertex Algorithm Impl. compute() for vertex
1. aggregate value of neighbors 2. update itself value 3. activate neighbors compute(v): double sum = 0 double value, last = v.get () foreach (n in v.in_nbrs) sum += n.value / n.nedges; value = * sum; v.set (value); activate (v.out_nbrs); 1 I still uses PageRank as an example. The vertex-centric program consists of three step: 1, 2, 3 So PageRank can be coded as like it, sums up neighbors’ value, update own rank and activate neighbors. Example: PageRank 2 3

Graph in Real World Hallmark Property : Skewed power-law degree distributions “most vertices have relatively few neighbors while a few have many neighbors” Low Degree Vertex count degree star-like motif High Degree Vertex An important challenge to graph processing is from natural graph, which has a skewed power-law degree distributions. It means “blabla”, such as social network. Thus，the vertex in natural graph can be distinguished by degree. Twitter Following Graph: 1% of the vertices are adjacent to nearly half of the edges

Existing Graph Models A B A B A B A B sample graph Computation Model
mirror master A B Pregel GraphLab PowerGraph There is three representative distributed graph-parallel models. The first is Pregel from Google, which adopts edge-cut to evenly partition graph and uses messages to ensure all resources has been accumulated in local before computation on vertex. The message also implies the activation. Pregel doesn’t support dynamic computation, since vertex can not actively pull data from neighbors. To remedy this problem, GraphLab introduces read-only replication and uses implicit message to synchronize vertex and edge data. One additional message used to transfer activation from mirror to master. Both Pregel and Graphlab provides local semantics for vertex computation to avoid network latency. But it also results in load imbalance and heavy network traffic when processing natural graph. (since high-degree vertex will accumulate too many resources in single machine.) PowerGraph splits the workload on single vertex to multiple machines to address the imbalance issue. So that mirrors will take part in the computation to gather and activate neighbors. However, distribution of computation also results in heavy communication between master and mirrors. One mirrors will incurs up to 5 messages in each iteration. It is a heavy cost for low-degree vertex. In short, no existing graph models can get optimal performance on natural graph. Computation Model Pregel GraphLab PowerGraph Graph Placement Comp. Pattern Comm. Cost Dynamic Comp. Load Balance edge-cuts local ≤ #edge-cuts no edge-cuts local ≤ 2 x #mirrors yes no vertex-cuts distributed ≤ 5 x #mirrors yes

Existing Graph Cuts master 6 6 mirror 6 imbalance 4 1 2 1 2 4 1 2 dup. edge 3 5 5 3 Edge-cut flying master Vertex-cut 5 4 1 2 6 1 2 6 4 1 2 Graph partitioning plays a vital role in distributed graph processing. Traditional edge-cuts adopted by GraphLab evenly assigned vertex to machines and create replication for edges spanning machines to constructs a local graph. The edge will also be replicated since two endpoint vertex require it. For high-degree vertex, it would incurs load imbalance PowerGraph proposes vertex-cut to avoid accumulation and replication of edges. All edges are evenly assigned to machines, and mirrors of vertex are created. One of mirrors should be selected as master. For easily supporting external querying, the master of vertex is enforcedly located in its hash-based machine even without edge, it namely flying master. Randomized vertex-cut will incur a large number of mirrors and poor performance in graph ingress and runtime, so that a greedy heuristic is proposed to minimize the number of mirrors according to previous assignment. But it results in very poor performance in graph ingress due to periodically exchange mapping information between machines. Further, the heuristic is unfair to low-edge vertex since it has fewer attraction compare to high-degree vertex. 3 5 3 5 random greedy 1 4 2 6 1 2 6 5 1 4 3 6

Issues of Graph Partitioning
Edge-cut: Imbalance & replicated edges Vertex-cut: do not exploit locality Random: high replication factor* Greedy: long ingress time, unfair to low-degree vertex Constrained: imbalance, poor placement of low-vertex partition λ ingress runtime Random 16.0 263 94.2 Coordinated 5.5 391 33.7 Oblivious 12.8 289 75.3 Grid 8.9 138 43.6 In short, edge-cut easily incurs load imbalance and has replicated edges. While vertex-cut does not Exploit locality, especial for low-degree vertex. The experiment for Twitter follower graph using PageRank on 48 machines illustrate that, Random vertex-cut has high replication factor, greedy vertex-cut has long ingress time if coordinated by machines. If not, the effect of heuristic is little. Constrained vertex-cut is a newly solution from Intel, which provides a compromise between ingress time and runtime. It can ensure the upper bound of replication factor by restrict the assignment of edges to a subset of machines. It is efficient in graph ingress, since the constraint is predetermined. However it may results in imbalance and likely provides poor placement of low-degree vertex, since the constraint is too relax for them. Thus its replication factor and runtime is lightly worse than greedy. ∗𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟： Twitter Follower Graph 48 machines, |V|=42M |E|=1.47B 𝜆 = #𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 #𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠

Principle of PowerLyra
The vitally important challenges associated to the performance of distributed computation system 1. How to make resource locally accessible? 2. How to evenly parallelize workloads? Conflict Differentiated Graph Computation and Partitioning Two challenges associate to performance of distributed computation system impact the design of existing graph computation and partitioning 1st is … to avoid network latency and 2nd is … to fully using distributed resources. Most time they are conflict. The former is followed by pregel and graphlab, and powergraph chooses the latter. However, the skewed distribution in natural graph calls differentiated processing and partitioning. Thus, we introduce Powerlyra, which applies different computation and partition strategies to different vertex. Exploit locality for low-degree vertex and exploit parallelism for high-degree vertex. Low-degree vertex  Locality One Size fit All High-degree vertex  Parallelism

“Gather Apply Scatter”
Computation Model High-degree vertex Goal: exploit parallelism Follow GAS model [PowerGraph OSDI’12] compute (v) double sum = 0 double value, last = v.get () foreach (n in v.in_nbrs) sum += n.value / n.nedges; value = * sum; v.set (value); activate (v.out_nbrs); “Gather Apply Scatter” gather (n): return n.value / n.nedges; First for computation model. We follow GAS model introduced by powergraph to handle high-degree vertex, which split the computation on vertex to 3 part gather apply and scatter. Gather and scatter are executed on edges, So that they can be distributed to mirrors apply (v, acc): value = * acc; v.set (value); scatter (v) activate (v.out_nbrs);

Computation Model H H High-degree vertex Goal: exploit parallelism
Follow GAS model [PowerGraph OSDI’12] master  mirrors 1 1 H Gather Gather H Gather call gather() 2 master  mirrors 2 Apply 3 In gather phase, high-degree master invites mirrors to do gather and collect partial results In apply phase, master update vertex data by apply function and sync vertex data with mirrors In scatter phase, master invites mirrors to activate neighbors and mirrors will note master if it has been activated by neighbors call apply() Scatter 4 Scatter Apply master  mirrors 5 3 master  mirrors 4 Scatter call scatter() master  mirrors 5

Computation Model L L Low-degree vertex Goal: exploit locality
Observation: most algorithms only gather or scatter in one direction (e.g., PageRank: G/IN and S/OUT) Low-degree vertex Goal: exploit locality One direction locality (avoid replicated edges) Local gather + distributed scatter Comm. Cost : ≤ 1 x #mirrors e.g., PageRank: Gather/IN & Scatter/OUT Gather call gather() Scatter call scatter() Apply call apply() master  mirrors 1 For low-degree vertex, we don’t adopt the design of graphlab since they replicate edges and double messages to provide fully locality. According to our observation. Most algorithms only gather or scatter in one direction. For example … So we only exploit one direction locality. It means we assume all of one direction edges is in local. Since update message from master to mirror in apply phase can not be omit, we choose local gather and distributed scatter. For pagerank, we exploit in-edge locality, so all of in-edges is in place. Then master can complete gather in local and send one message to each mirror for update vertex data and do scatter. Since only master will be activated, there is no need message from mirror to master again. So the communication cost for low-degree vertex is up to 1 message for each mirror in each iteration. L Gather L Apply 1 Scatter Scatter All of in-edges

Computation Model L L Generality
Algorithm gather or scatter in two directions Adaptive degradation for gathering or scattering Easily check in runtime without overhead (user has explicitly defined access direction in code) e.g., Gather/IN & Scatter/ALL Type Gather Scatter Ex. In IN/ NONE OUT/ NONE PR Out DIA Other ANY LBP For generality, if algorithms that gather or scatter in two directions, we will adaptively degrade gathering and scattering. For example if algorithm still gather from in-edge, but scatter via two direction, then one message from mirror to master in scatter phase will be added to note that mirror has been activated. The direction of gather and scatter can be easily checked in runtime without overhead L Gather Gather L Apply 1 Scatter Scatter 2

Graph Partitioning Low-degree vertex
Place one direction edges (e.g., in-edges) of a vertex to its hash-based machine Simple, but Best ! Lower replication factor One direction locality Efficiency (ingress/runtime) Balance (#edge) Fewer flying master Synthetic Regular Graph* 48 machines, |V|=10M |E|=93M To support our new hybrid model, we design a hybrid vertex-cut. But hybrid-cut is not limited to only for hybrid model. We still use different strategies for different vertices. For low-degree vertex. We just … it looks very simple and naive. But maybe it’s the best. We uses tools provided by graphlab to construct a full regular graph and try different vertex-cut on it. To surprise, low-cut is always the best in replication factor, ingress time and runtime. We summarize the benefits of our simple low-cut. 1. it has lower replication factor, since it doesn’t need to create mirrors for one direction edges. 2. It’s efficient in ingress and runtime, since it’s a hash-based partition and very fewer replication factor.. 3. … 4. it’s natural edge balance, since for low-degree vertex, vertex balance is almost equal to edge balance. Finally it has fewer flying master, since we directly assign vertex to its hash-based machine. partition λ ingress runtime Random 11.7 35.5 14.71 Coordinated 4.8 32.0 6.85 Oblivious 8.3 36.4 11.70 Grid 8.4 21.4 7.30 Low-cut 3.9 15.0 2.26 *https://github.com/graphlab-code/graphlab/blob/master/src/graphlab/graph/distributed_graph.hpp

Graph Partitioning High-degree vertex
Distribute edges (e.g., in-edges) according to another endpoint vertex (e.g., source) The upper bound of replications imported by placing all edges belonged to high-degree vertex is #machines Existing Vertex-cut low-master low-mirror high-master high-mirror For high-degree vertex, we distribute edges belonged to it according to another endpoint vertex. For example, existing vertex-cut will randomly or greedy assign edges, so it likely import mirrors for low-degree vertex Low-degree mirror

Graph Partitioning High-degree vertex
Distribute edges (e.g., in-edges) according to another endpoint vertex (e.g., source) The upper bound of replications imported by placing all edges belonged to high-degree vertex is #machines High-cut low-master But our high-cut will only import mirrors for high-degree vertex. low-mirror high-master high-mirror

Graph Partitioning Hybrid vertex-cut 1 4 2 5 3 6 1 1 1 1 4 3 1 2 5 1 2
User defined threshold (θ) and the direction of locality Group edges in hash-based machine of vertex Low-cut: done! / High-cut: re-assignment e.g., θ =3， IN 1 4 2 5 3 6 group To put it together, first user need to define a threshold to distinguish vertices and which direct. Then hybrid-cut loading edges from disk and group them at hash-based machine in parallel. Further, hybrid-cut counts the degree of vertex and re-assign edges of high-degree vertices. For the sample graph, we set the threshold to 3 and local direction to IN. then hybrid-cut will assign edges by hashing destination vertex. After counting, only vertex 1 is high degree vertex, then all its in-edges will be re-assigned by hashing source vertex. Finally, construct local graph. reassign 1 4 1 5 2 1 3 1 4 3 1 2 5 1 2 3 6 construct

Heuristic for Hybrid-cut
Inspired by heuristic for edge-cut choose best master location of vertex according to neighboring has located Consider one direction neighbors is enough Only apply to low-degree vertices Parallel ingress: periodically synchronize private mapping-table (global vertex-id  machine) To further reduce replication factor, we design a greedy-like heuristic for hybrid-cut, which is inspired by heuristic for edge-cut. The main differences are

Optimization How about (cache) locality in communication?
2 4 3 5 1 7 Challenge: graph computation usually exhibits poor data access (cache) locality* irregular traversal of neighboring vertices along edges How about (cache) locality in communication? Problem: a mismatch of orders btw. sender & receiver We also introduce an optimization for cache locality. It is well known that … But how about cache locality in communication? In currently implement, the data and metadata of both master and mirror has been stored in multiple separate arrays and sequentially assigns a unified local ID for indexing. In each phase, the worker thread sequentially traverses vertices and executes user-defined functions. The messages across machines are batched and sent periodically. After a barrier, all messages received from different machines will be updated to vertices in parallel and the order of accessing vertices is only determined by the order in the sender. It has a poor locality and interfere between message threads. *LUMSDAINE et al. Challenges in parallel graph processing. 2007

Locality-conscious Layout
General Idea: match orders by hybrid vertex-cut Tradeoff: ingress time vs. runtime Decentralized matching  global vertex-id M1 M2 M3 4 3 5 7 2 6 1 8 9 Zoning The general idea of our optimization is matching orders by hybrid vertex-cut in ingress time. Meanwhile, global vertex-id provide the opportunity to implement matching order without communication. We use the following example to show the four step of locality-conscious layout. Square denotes high and circle denotes low, white denotes master and black denotes mirror Before reordering, all master and mirror of high and low vertex are mixed and disorderedly stored. When send messages from the master in M1 to mirrors in M2, the access has no locality. First, we need relocate master and mirror to four zones, which could limit the access for message within a zone. H2 L2 h-mrr l-mrr H3 L3 h-mrr l-mrr H1 L1 h-mrr l-mrr Z1 Z2 Z3 Z4 M1 M2 M3 8 1 5 2 4 3 7 6 9 6 High-master 9 Low-master 2 high-mirror 8 low-mirror

Locality-conscious Layout
General Idea: match orders by hybrid vertex-cut Tradeoff: ingress time vs. runtime Decentralized algorithm  global vertex-id H2 L2 h1 h3 l1 l3 H3 L3 h1 h2 l1 l2 H1 L1 h2 h3 l2 l3 M1 M2 M3 8 6 5 2 4 3 7 1 9 Grouping Further we will group mirrors in zone according to the location of their masters It could avoid interfere between two message threads. H2 L2 h-mrr l-mrr H3 L3 h-mrr l-mrr H1 L1 h-mrr l-mrr Z1 Z2 Z3 Z4 M1 M2 M3 8 1 5 2 4 3 7 6 9 6 High-master 9 Low-master 2 high-mirror 8 low-mirror

Locality-conscious Layout
General Idea: match orders by hybrid vertex-cut Tradeoff: ingress time vs. runtime Decentralized algorithm  global vertex-id H2 L2 h1 h3 l1 l3 H3 L3 h1 h2 l1 l2 H1 L1 h2 h3 l2 l3 M1 M2 M3 8 6 5 2 4 3 7 1 9 Sorting Third, we sort master and mirror within group according to the global vertex id. Now the order of sender and receiver has matched. M1 M2 M3 5 6 2 8 7 3 4 1 9 H2 L2 h1 h3 l1 l3 H3 L3 h1 h2 l1 l2 H1 L1 h2 h3 l2 l3 6 High-master 9 Low-master 2 high-mirror 8 low-mirror

Locality-conscious Layout
General Idea: match orders by hybrid vertex-cut Tradeoff: ingress time vs. runtime Decentralized algorithm  global vertex-id H2 L2 h3 h1 l3 l1 H3 L3 h1 h2 l1 l2 H1 L1 h2 h3 l2 l3 M1 M2 M3 5 1 2 8 7 3 4 6 9 Rolling Now the locality issue of message sent from master to mirror has served. But when send message from mirrors to master, since message sending is batched and there is barrier in each phase. So the time of send message is closed, then the message from mirrors in M2 and M3 to the master in M1 may contend on the same vertex. So at last, we add a rolling to avoid such problem Though we separately describe above four steps, they are actually implemented within one step in hybrid-cut after reassignment of high-degree vertices. M1 M2 M3 5 6 2 8 7 3 4 1 9 H2 L2 h1 h3 l1 l3 H3 L3 h1 h2 l1 l2 H1 L1 h2 h3 l2 l3 6 High-master 9 Low-master 2 high-mirror 8 low-mirror

Evaluation Experiment Setup
48-node EC2-like cluster (4-core 12G RAM 1GigE NIC) Graph Algorithms PageRank Approximate Diameter Connected Components Data Set: 5 real-world graphs 5 synthetic power-law graphs* *Varying α and fixed 10 million vertices (smaller α produces denser graphs)

PageRank Gather: IN / Scatter: OUT
Runtime Speedup Hybrid: 2.02X ~ 2.96X Hybrid: 1.40X ~ 2.05X Ginger: 2.17X ~ 3.26X Ginger: 1.97X ~ 5.53X 5 systems, the first 3 is powergraph and the next 2 is powerlyra We use default setting as baseline better Power-law Graphs Real-world Graphs PageRank Gather: IN / Scatter: OUT 48 machines and baseline: PowerGraph + Grid (default)

Runtime Speedup Hybrid: 1.93X ~ 2.48X Hybrid: 1.44X ~ 1.88X
Ginger: 1.97X ~ 3.15X Ginger: 1.50X ~ 2.07X better Approximate Diameter Gather: OUT / Scatter: NONE Connected Component Gather: NONE / Scatter: ALL 48 machines and baseline: PowerGraph + Grid (default)

Communication Cost 188MB 394MB 170MB 79.4% Power-law Graphs
better Power-law Graphs Real-world Graphs

Effectiveness of Hybrid
Hybrid Graph Partitioning Power-law (48) Ingress Time better Two cause to reduce communication cost The first is good partitioning and then reduce replication factor Real-world (48) Scalability (Twitter)

Effectiveness of Hybrid
Hybrid Graph Computation better Another is from graph computation engine

Scalability Increasing of machines Increasing of data size better

Conclusion PowerLyra a new hybrid graph analytics engine that embraces the best of both worlds of existing frameworks an efficient hybrid graph partitioning algorithm that adopts different heuristics for different vertices. outperforms PowerGraph with default partition by up to 5.53X and 3.26X for real-world and synthetic graphs accordingly

Questions PowerLyra Thanks
Institute of Parallel And Distributed Systems

Example Algorithms Collaborative Filtering Graph Analytics
Alternating Least Squares Stochastic Gradient Descent Tensor Factorization Graph Analytics PageRank SSSP Triangle-Counting Graph Coloring K-core Decomposition Classification Neural Networks Lasso Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL CoEM

From GraphLab users group