Roshan Dathathri Gurbinder Gill Loc Hoang

Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics
Roshan Dathathri Gurbinder Gill Loc Hoang Hoang-Vu Dang Vishwesh Jatala V. Krishna Nandivada Marc Snir Keshav Pingali

Graph Analytics Need TBs of memory Applications:
machine learning and network analysis Datasets: unstructured graphs Need TBs of memory Applications: search engines, machine learning, social network analysis How to rank webpages? (PageRank) How to recommend movies to users? (Stochastic Gradient Descent) Who are the most influential persons in a social network? (Betweenness Centrality) Datasets: unstructured graphs Billions of nodes and trillions of edges Do not fit in the memory of a single machine Credits: Wikipedia, SFL Scientific, MakeUseOf Credits: Sentinel Visualizer

Motivation Most distributed graph analytics systems are bulk-synchronous parallel (BSP) Gluon [PLDI’18], Lux [VLDB’18], Gemini [OSDI’16], PowerGraph [OSDI’12], ... Dynamic workloads => stragglers limit scaling Asynchronous distributed graph analytics systems are restricted and not competitive GRAPE+ [SIGMOD’18], PowerSwitch [PPoPP’15], ASPIRE [OOPSLA’14], GraphLab [VLDB’12], … Small messages => high synchronization overheads => slower than BSP systems No way to reuse infrastructure to leverage GPUs Static load balancing is difficult Hardware not suitable for small messages

Bulk-Asynchronous Parallel (BASP)
Exploits resilience of graph applications to stale reads Novel asynchronous programming and execution model Retains the advantages of bulk-computation and bulk-communication Allows progress by eliding blocking/waiting during communication Novel non-blocking reconciliation of updates to the graph Novel non-blocking termination detection algorithm Easy to adapt BSP programs and systems to BASP

Gluon-Async: A BASP System
Adapted Gluon, the state-of-the-art distributed graph analytics system, to build Gluon-Async First asynchronous distributed GPU graph analytics system Supports arbitrary partitioning policies in CuSP with Gluon’s communication optimizations GPU IrGL/CUDA Gluon Plugin CPU Gluon Comm. Runtime Galois CPU Gluon Plugin Gluon-Async Comm. Gluon-Async Comm. CuSP Partitioner CuSP Partitioner Network (LCI/MPI) Network (LCI/MPI) Gluon [PLDI’18] Galois [SoSP’13] CuSP [IPDPS’19] IrGL [OOPSLA’16] LCI [IPDPS’18] Gluon-Async is ~1.5x faster than Gluon(-Sync) at scale

Outline Gluon Synchronization Approach
Bulk-Asynchronous Parallel (BASP) Execution Model Adapting BSP systems to BASP Bulk-Asynchronous Communication Non-Blocking Termination Detection Adapting BSP programs to BASP Experimental Results

Gluon Synchronization Approach

Vertex Programming Model
Every node has a label e.g., distance in single source shortest path (SSSP) Apply an operator on an active node in the graph e.g., relaxation operator in SSSP Push-style: reads its label and writes to neighbors’ labels Pull-style: reads neighbors’ labels and writes to its label Termination: no more active nodes (or work) Applications: breadth first search, connected component, k-core, pagerank, single source shortest path, etc. R W push-style R W pull-style

Partitions of the graph
Partitioning Host h1 Host h2 B C A D F G E H I J Original graph Partitions of the graph

Partitioning Host h1 Host h2 Each edge is assigned to a unique host B C A D F G E H I J Original graph Partitions of the graph

Partitioning Host h1 Host h2 Each edge is assigned to a unique host All edges connect proxy nodes on the same host B C B C B C A D A D F G F G F G E H E H I J J I J Original graph Partitions of the graph

Partitioning Host h1 Host h2 Each edge is assigned to a unique host All edges connect proxy nodes on the same host A node can have multiple proxies: one is master proxy; rest are mirror proxies B C B C B C A D A D F G F G F G E H E H I J J I J : Master proxy : Mirror proxy Original graph Partitions of the graph

How does Gluon synchronize the proxies?
Exploit domain knowledge Cached copies can be stale as long as they are eventually synchronized Host h1 Host h2 8 1 B C B C 8 A D F G F G 1 8 E 8 H I J J : Master proxy : Mirror proxy : distance (label) from source A

How does Gluon synchronize the proxies?
Exploit domain knowledge Cached copies can be stale as long as they are eventually synchronized Use all-reduce: Reduce from mirror proxies to master proxy Broadcast from master proxy to mirror proxies Host h1 Host h2 1 B C B C 1 8 A D F G F G 1 E 8 1 H I J J : Master proxy : Mirror proxy : distance (label) from source A

Bulk-Asynchronous Parallel (BASP) Execution Model

Bulk-Synchronous Parallel (BSP)
Execution occurs in rounds In each round: Each host computes independently Each host sends a message to every other host Each host ingests a message from every other host Virtual barrier at the end BSP Host h1 Host h2 Compute Idle Communicate

Bulk-Asynchronous Parallel (BASP)
Execution occurs in rounds In each local round: Each host computes independently Each host can send messages to other hosts Each host can ingest messages from other hosts No waiting or blocking BSP Host h1 Host h2 BASP Host h1 Host h2 Compute Idle Communicate

Discussion: BSP vs. BASP
BASP exploits domain knowledge Cached copies can be stale as long as they are eventually synchronized Advantages of BASP Faster hosts may progress and send updated values Straggler hosts may receive updated values => lesser computation Communication may overlap with computation Disadvantages of BASP Faster hosts may use stale values => redundant computation Host h1 Host h2 sssp – chaotic relaxation – where total amount of work varies

Challenges in Realizing BASP Systems
Removing barrier changes execution semantics: How to synchronize proxies asynchronously? How to detect termination without blocking?

Bulk-Asynchronous Communication

Communication: BSP vs. BASP
Problem: synchronization of different proxies of the same vertex Straightforward in BSP Requirement: Same value as sequential at the end of a round Achieved by all-reduce at the end of a round More complicated in BASP Requirement: Same value as sequential eventually Must be achieved without blocking Values sent in one round can be received in another

Non-Blocking Synchronization of Proxies
Consider the master proxy of vertex v on h1 and its mirror proxy on h2 Reduction: Mirror updated => value sent to master and reset Master received messages => reduce on its value Broadcast: Master updated => value sent to mirror Mirror received messages => reduce on its value Master on h1 8 7 5 3 Min Min Mirror on h2 8 9 7 3 Showing only for one vertex

Bulk-Asynchronous Communication
Consider two hosts: h2 has mirror proxies for master proxies on h1 Reduction – h2 : Aggregates all updated mirrors into one message Sends message only if non-empty Broadcast – h1 : Aggregates all updated masters into one message Reduction – h1 : May receive zero or more messages Broadcast – h2 :

Non-Blocking Termination Detection

Termination Detection: BSP vs. BASP
Semantics: No host should terminate if there is work left Trivial in BSP Condition: All hosts are inactive in a round Implementation: Use distributed accumulator (blocking collective) More complicated in BASP Cannot use blocking collectives Conditions are not clear

Termination Detection Algorithm
Invoked at the end of each local round on each host Implements a distributed consensus protocol Does not rely on message delivery order Hosts can directly send/receive messages to each other (clique network) Uses a state machine on each host Uses non-blocking collectives or snapshots to coordinate among hosts Snapshots are numbered A snapshot broadcasts the current state Why our own algorithm? Plays well with MPI

States and Goal A States:
Goal: a host must move to T if and only if every other host will move to T Intuition: hosts move to T only if every host knows that “every host knows that every host wants to move to T” Requires two RT states A States: A: Active I: Idle RT1: Ready-to-Terminate1 RT2: Ready-to-Terminate2 T: Terminate I RT1 RT2 T

State Transitions A States: Conditions for transition: I RT1
A: Active I: Idle RT1: Ready-to-Terminate1 RT2: Ready-to-Terminate2 T: Terminate Conditions for transition: inactive active prior snapshot ended prior snapshot from RT1 ended prior snapshot from RT2 ended I RT1 Action for transition: take snapshot terminate RT2 T If one host is in RT1, then unsafe to terminate

Adapting BSP programs to BASP

Programs: BSP (Gluon-Sync) vs. BASP (Gluon-Async)
Gluon-Sync Program Gluon-Async Program Asynchronous shared-memory programs can run in BASP Resilient to stale reads Agnostic of BSP round number CuSP partitioner CuSP partitioner Galois on multicore CPU or IrGL on GPU Galois on multicore CPU or IrGL on GPU Gluon-Sync comm. runtime Gluon-Async comm. runtime Gluon-Sync termination detection Gluon-Async termination detection CuSP [IPDPS’19] Galois [SoSP’13] IrGL [OOPSLA’16] break from loop break from loop

Experimental Results

Experimental Setup Benchmarks: Breadth first search (bfs)
Connected components (cc) K-core (kcore) Pagerank (pr) Single source shortest path (sssp) Clusters Stampede (CPU) Bridges (GPU) Max. hosts 128 64 Machine Intel Xeon Skylake 2 NVIDIA Tesla P100s Each host 48 cores of Skylake 1 Tesla P100 Memory 192GB DDR4 16GB COWOS HBM2

Evaluated Systems  x System Distributed CPUs Distributed GPUs
Asynchronous Gluon-Async (Gluon-Async + Galois/IrGL)  Gluon-Sync (Gluon-Sync + Galois/IrGL) [PLDI’18] x Lux [VLDB’18] GRAPE+ [SIGMOD’18] PowerSwitch [PPoPP’15]

Small Input Graphs Inputs twitter50 rmat27 friendster uk07 |V| 51M 134M 0.07M 0.1B |E| 2B 4B |E|/|V| 38 16 28 35 Diameter 12 3 21 115 Size (CSR) 16GB 18GB 28GB 29GB Only used for comparison with Lux, GRAPE+, and PowerSwitch Execute <100 BSP rounds in Gluon-Sync for all benchmarks Not expected to gain much from asynchronous execution (even in shared-memory)

Gluon-Async is ~12x faster than Lux
Strong scaling on Bridges for small graphs (2 GPUs share a physical node) Gluon-Async is ~12x faster than Lux

Friendster on 12 CPUs each with 16 cores
Gluon-Async is ~2.5x and ~9.3x faster than GRAPE+ and PowerSwitch

Large Input Graphs Lux, GRAPE+, and PowerSwitch could not run
Inputs clueweb12 uk14 wdc12 |V| 1B 0.8B 3.6B |E| 43B 48B 129B |E|/|V| 44 60 36 Diameter 498 2498 5274 Size (CSR) 325GB 361GB 986GB Lux, GRAPE+, and PowerSwitch could not run Execute >100 BSP rounds in Gluon-Sync for almost all benchmarks Potential for asynchronous execution to perform better

Speedup of Gluon-Async over Gluon-Sync: 64 GPUs of Bridges
Benchmarks Gluon-Async is ~1.4x faster than Gluon-Sync

Speedup of Gluon-Async over Gluon-Sync: 128 hosts of Stampede
Benchmarks Gluon-Async is ~1.6x faster than Gluon-Sync

Breakdown of execution time: wdc12 on 128 hosts of Stampede
Bench-mark Gluon-Async Min. Rounds Gluon-Sync bfs 1222 2759 cc 206 401 kcore 250 277 pr 156 183 sssp 1862 4040 Diameter = 5274 Gluon-Async reduces idle time compared to Gluon-Sync : stragglers execute fewer rounds

Conclusions Designed a bulk-asynchronous model for distributed and heterogeneous graph analytics Gluon-Async is ~1.5x faster than Gluon-Sync at scale Use Gluon-Async to scale out your shared-memory graph system Gluon-Async is publicly available in Galois v5.0 GPU IrGL/CUDA Gluon Plugin CPU Gluon Comm. Runtime Galois CPU Gluon Plugin Gluon-Async Comm. Runtime Gluon-Async Comm. Runtime CuSP Partitioner CuSP Partitioner Network (LCI/MPI) Network (LCI/MPI)

Backup slides Questions:
Is the termination detection algorithm novel? Why not use existing ones?

Best Execution Times: Gluon-Async and Gluon-Sync

Partitioning Time for CuSP Policies [CuSP IPDPS’19]
Additional CuSP policies implemented in few lines of code

Partitioning Quality at 128 Hosts [CuSP IPDPS’19]
No single policy is fastest: depends on input and benchmark

Best Partitioning Policy (clueweb12) [VLDB’18]
Execution time (sec): 32 hosts 256 hosts XEC EC HVC CVC bc 136.2 439.1 627.9 539.1 413.7 420 1012.1 266.9 bfs 10.4 8.9 17.4 19.2 36.4 27.1 46.5 14.6 cc OOM 16.9 7.5 19.6 43.0 84.7 8.4 7.3 pr 272.6 219.6 193.5 217.9 286.7 82.3 97.5 60.8 sssp 16.5 13.1 26.6 31.7 32.5 54.7 21.8

Decision Tree [VLDB’18] % difference in execution time between policy chosen by decision tree vs. optimal Based on our study we present the decision tree to select the best partitioning policy:

State Transitions: An Example with 2 Hosts
Snapshot Status on H1 Snapshot Status on H2 I I I I I I I I Number: 1 3 2 Number: 3 1 2 RT1 RT1 RT1 RT1 RT1 RT1 RT1 RT1 State of H1: RT1 RT2 State of H1: RT2 RT1 RT1 RT2 State of H2: RT1 RT1 RT1 RT2 State of H2: RT1 RT2 RT2 RT2 RT2 RT2 RT2 RT2 T T T T Incorrect to terminate if in RT1

Roshan Dathathri Gurbinder Gill Loc Hoang

Similar presentations

Presentation on theme: "Roshan Dathathri Gurbinder Gill Loc Hoang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Roshan Dathathri Gurbinder Gill Loc Hoang

Similar presentations

Presentation on theme: "Roshan Dathathri Gurbinder Gill Loc Hoang"— Presentation transcript:

Similar presentations

About project

Feedback