Parallel Graph Algorithms Kamesh Madduri

Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov

2 Applications Parallel algorithm building blocks –Kernels –Data structures Parallel algorithm case studies –Connected components –BFS/Shortest paths –Betweenness Centrality Performance on current systems –Software –Architectures –Performance trends Talk Outline Parallel Graph Algorithms 5/4/2009

3 Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds Routing in transportation networks Parallel Graph Algorithms 5/4/2009 H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.

4 The world-wide web can be represented as a directed graph –Web search and crawl: traversal –Link analysis, ranking: Page rank and HITS –Document classification and clustering Internet topologies (router networks) are naturally modeled as graphs Internet and the WWW Parallel Graph Algorithms 5/4/2009

5 Reorderings for sparse solvers –Fill reducing orderings partitioning, traversals, eigenvectors –Heavy diagonal to reduce pivoting (matching) Data structures for efficient exploitation of sparsity Derivative computations for optimization –Matroids, graph colorings, spanning trees Preconditioning –Incomplete Factorizations –Partitioning for domain decomposition –Graph techniques in algebraic multigrid Independent sets, matchings, etc. –Support Theory Spanning trees & graph embedding techniques Scientific Computing B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf Parallel Graph Algorithms 5/4/2009

6 Graph abstractions are very useful to analyze complex data sets. Sources of data: petascale simulations, experimental devices, the Internet, sensor networks Challenges: data size, heterogeneity, uncertainty, data quality Large-scale data analysis Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Social Informatics: new analytics challenges, data uncertainty Parallel Graph Algorithms 5/4/2009 Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpghttp://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

7 Study of the interactions between various components in a biological system Graph-theoretic formulations are pervasive: –Predicting new interactions: modeling –Functional annotation of novel proteins: matching, clustering –Identifying metabolic pathways: paths, clustering –Identifying new protein complexes: clustering, centrality 5/4/2009 Data Analysis and Graph Algorithms in Systems Biology Parallel Graph Algorithms Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.

8 Image Source: Nexus (Facebook application) 5/4/2009 Parallel Graph Algorithms Graph –theoretic problems in social networks –Community identification: clustering –Targeted advertising: centrality –Information spreading: modeling

9 [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information Plot masterminds correctly identified from interaction patterns: centrality A global view of entities is often more insightful Detect anomalous activities by exact/approximate graph matching Image Source: http://www.orgnet.com/hijackers.html Network Analysis for Intelligence and Survelliance 5/4/2009 Parallel Graph Algorithms Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

10 Characterizing Graph-theoretic computations graph sparsity (m/n ratio) static/dynamic nature weighted/unweighted, weight distribution vertex degree distribution directed/undirected simple/multi/hyper graph problem size granularity of computation at nodes/edges domain-specific characteristics graph sparsity (m/n ratio) static/dynamic nature weighted/unweighted, weight distribution vertex degree distribution directed/undirected simple/multi/hyper graph problem size granularity of computation at nodes/edges domain-specific characteristics paths clusters partitions matchings patterns orderings paths clusters partitions matchings patterns orderings Input: Graph abstraction Problem: Find *** Factors that influence choice of algorithm Graph kernel traversal shortest path algorithms flow algorithms spanning tree algorithms topological sort ….. traversal shortest path algorithms flow algorithms spanning tree algorithms topological sort ….. Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations Parallel Graph Algorithms 5/4/2009

11 Applications Parallel algorithm building blocks –Kernels –Data structures Parallel algorithm case studies –Connected components –BFS/Shortest paths –Betweenness centrality Performance on current systems –Software –Architectures –Performance trends Talk Outline Parallel Graph Algorithms 5/4/2009

12 Objectives –Bridge between software and hardware General purpose HW, scalable HW Transportable SW –Abstract architecture for algorithm development Why is it so important? –Uniprocessor: von Neumann model of computation –Parallel processors  Multicore Requirements: inherent tension –Simplicity to make analysis of interesting problems tractable –Detailed to reveal the important bottlenecks Models, e.g.: –PRAM: rich collection of parallel graph algorithms –BSP: some CGM algorithms (cgmGraph) –LogP Parallel Computing Models: A Quick PRAM review Parallel Graph Algorithms 5/4/2009

13 Ideal model of a parallel computer for analyzing the efficiency of parallel algorithms. PRAM composed of –P unmodifiable programs, each composed of optionally labeled instructions. –a single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. –P accumulators, one associated with each program –a read-only input tape –a write-only output tape No local memory in each RAM. Synchronization, communication, parallel overhead is zero. PRAM 5/4/2009 Parallel Graph Algorithms

14 EREW (Exclusive Read, Exclusive Write) –A memory cell can be read or written by at most one processor per cycle. –Ensures no read or write conflicts. CREW (Concurrent Read, Exclusive Write) –Ensures there are no write conflicts. CRCW (Concurrent Read, Concurrent Write) –Requires use of some conflict resolution scheme. PRAM Data Access Forms 5/4/2009 Parallel Graph Algorithms

15 Pros –Simple and clean semantics. –The majority of theoretical parallel algorithms are specified with the PRAM model. –Independent of the communication network topology. Cons –Not realistic, too powerful communication model. –Algorithm designer is misled to use IPC without hesitation. –Synchronized processors. –No local memory. –Big-O notation is often misleading. PRAM Pros and Cons 5/4/2009 Parallel Graph Algorithms

16 Problem parameters: n, m, D (graph diameter) Worst-case running time: T Total number of operations (work): W Nick’s Class (NC): complexity class for problems that can be solved in poly- logarithmic time using a polynomial number of processors P-complete: inherently sequential Analyzing Parallel Graph Algorithms 5/4/2009 Parallel Graph Algorithms

17 Extension to the PRAM model for shared memory algorithm design and analysis. T(n, p) is measured by the triplet –T M (n, p), T C (n, p), B(n, p) –T M (n, p): maximum number of non-contiguous main memory accesses required by any processor –T C (n, p): upper bound on the maximum local computational complexity of any of the processors –B(n, p): number of barrier synchronizations. The Helman-JaJa model 5/4/2009 Parallel Graph Algorithms

18 Prefix sums List ranking –Euler tours, Pointer jumping, Symmetry breaking Sorting Tree contraction Building blocks of classical PRAM graph algorithms 5/4/2009 Parallel Graph Algorithms

19 Prefix Sums 5/4/2009 Parallel Graph Algorithms Input: A, an array of n elements; associative binary operation Output: B(0,1) C(0,1) B(0,2) C(0,2) B(0,3) C(0,3) B(0,4) C(0,4) B(0,5) C(0,5) B(0,6) C(0,6) B(0,7) C(0,7) B(0,8) C(0,8) B(1,1) C(1,1) B(1,2) C(1,2) B(1,3) C(1,3) B(1,4) C(1,4) B(2,1) C(2,1) B(2,2) C(2,2) B(3,1) C(3,2) O(n) work, O(log n) time, n processors

20 X: array of n elements stored in arbitrary order. For each element i, let X(i).value be its value and X(i).next be the index of its successor. For binary associative operator Θ, compute X(i).prefix such that –X(head).prefix = X (head).value, and –X(i).prefix = X(i).value Θ X(predecessor).prefix w here –head is the first element –i is not equal to head, and –predecessor is the node preceding i. List ranking: special case of parallel prefix, values initially set to 1, and addition is the associative operator. Parallel Prefix 5/4/2009 Parallel Graph Algorithms

21 Ordered list (X.next values) Random list (X.next values) List ranking Illustration 234 567 89 465 783 29 5/4/2009 Parallel Graph Algorithms

22 1. Chop X randomly into s pieces 2. Traverse each piece using a serial algorithm. 3. Compute the global rank of each element using the result computed from the second step. Locality (list ordering) determines performance In the Helman-JaJa model, T M (n,p) = O(n/p). List Ranking key idea 5/4/2009 Parallel Graph Algorithms

23 Tarjan-Vishkin’s biconnected components algorithm: O(log n) time, O(m+n) time. 1.Compute spanning tree T for the input graph G. 2.Compute Eulerian circuit for T. 3.Root the tree at an arbitrary vertex. 4.Preorder numbering of all the vertices. 5.Label edges using vertex numbering 6.Connected components using the Shiloach- Vishkin algorithm An example higher-level algorithm 5/4/2009 Parallel Graph Algorithms

24 Dense graphs (m = O(n 2 )): adjacency matrix commonly used. Sparse graphs: adjacency lists, similar to the CSR matrix format. Dynamic sparse graphs: we need to support edge and vertex membership queries, insertions, and deletions. –should be space-efficient, with low synchronization overhead Several different representations possible –Resizable adjacency arrays –Adjacency arrays, sorted by vertex identifiers –Adjacency arrays for low-degree vertices, heap-based structures for high-degree vertices (for sparse graphs with skewed degree distributions) Data structures: graph representation 5/4/2009 Parallel Graph Algorithms

25 A wide range of ADTs in graph algorithms: array, list, queue, stack, set, multiset, tree ADT implementations are typically array- based for performance considerations. Key data structure considerations in parallel graph algorithm design –Practical parallel priority queues –Space-efficiency –Parallel set/multiset operations, e.g., union, intersection, etc. Data structures in (Parallel) Graph Algorithms 5/4/2009 Parallel Graph Algorithms

27 Building blocks for many graph algorithms –Minimum spanning tree, spanning tree, planarity testing, etc. Representative of the “graft-and-shortcut” approach CRCW PRAM algorithms – [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn) work –[Gazit ’91]: randomized, optimal, O(log n) time. CREW algorithms – [Han & Wagner ’90]: O(log 2 n) time, O((m+nlog n) logn) work. Connected Components 5/4/2009 Parallel Graph Algorithms

28 Input: n isolated vertices and m PRAM processors. Each processor P i grafts a tree rooted at vertex v i to the tree that contains one of its neighbors u under the constraints u< v i Grafting creates k ≥ 1 connected subgraphs, and each subgraph is then shortcut so that the depth of the trees reduce at least by half. Repeat graft and shortcut until no more grafting is possible. Runs on arbitrary CRCW PRAM in O(logn) time with O(m) processors. Helman-JaJa model: T M = (3m/p + 2)log n, T B = 4log n. Shiloach-Vishkin algorithm Parallel Graph Algorithms 5/4/2009

29 Input: (1) A set of m edges (i, j) given in arbitrary order. (2) Array D[1..n] with D[i] = i Output: Array D[1..n] with D[i] being the component to which vertex i belongs. begin while true do 1. for (i, j) ∈ E in parallel do if D[i]=D[D[i]] and D[j]<D[i] then D[D[i]] = D[j]; 2. for (i, j) ∈ E in parallel do if i belongs to a star and D[j]=D[i] then D[D[i]] = D[j]; 3. if all vertices are in rooted stars then exit; for all i in parallel do D[i] = D[D[i]] end SV pseudo-code Parallel Graph Algorithms 5/4/2009

30 SV Illustration 4 1 2 3 4 1 2 3 Input graph graft 1,4 2,3 shortcut 1 21 2 1 1 st iter 2 nd iter Parallel Graph Algorithms 5/4/2009

32 Parallel Single-source Shortest Paths (SSSP) algorithms No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work Parallel priority queues: relaxed heaps [DGST88], [BTZ98] Ullman-Yannakakis randomized approach [UY90] Meyer et al. ∆ - stepping algorithm [MS03] Distributed memory implementations based on graph partitioning Heuristics for load balancing and termination detection 5/4/2009 Parallel Graph Algorithms K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

33 ∆ - stepping algorithm [MS03] Label-correcting algorithm: Can relax edges from unsettled vertices also ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm” ∆: bucket width Vertices are ordered using buckets representing priority range of size ∆ Each bucket may be processed in parallel 5/4/2009 Parallel Graph Algorithms

34 5/4/2009 Parallel Graph Algorithms

35 Classify edges as “heavy” and “light” 5/4/2009 Parallel Graph Algorithms

36 Relax light edges (phase) Repeat until B[i] Is empty 5/4/2009 Parallel Graph Algorithms

37 Relax heavy edges. No reinsertions in this step. 5/4/2009 Parallel Graph Algorithms

38 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket ∞ ∞∞∞ ∞ ∞∞ ∆ = 0.1 (say) 5/4/2009 Parallel Graph Algorithms

39 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞∞∞ ∞ ∞∞ Initialization: Insert s into bucket, d(s) = 0 0 0 5/4/2009 Parallel Graph Algorithms

40 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞∞∞ ∞ ∞∞ 0 0 2 R S.01 5/4/2009 Parallel Graph Algorithms

41 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞∞∞ ∞ ∞∞ 2 R 0 S.01 0 5/4/2009 Parallel Graph Algorithms

42 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞.01 ∞ ∞ ∞∞ 2 R 0 S 0 5/4/2009 Parallel Graph Algorithms

43 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞.01 ∞ ∞ ∞∞ 2 R 0 S 0 13.03.06 5/4/2009 Parallel Graph Algorithms

44 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0 ∞.01 ∞ ∞ ∞∞ R 0 S 0 13.03.06 2 5/4/2009 Parallel Graph Algorithms

45 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0.03.01.06 ∞ ∞∞ R 0 S 0 2 1 3 5/4/2009 Parallel Graph Algorithms

46 0.01 ∆ - stepping algorithm: illustration 1 2 3 4 5 6 0.13 0 0.18 0.15 0.05 0.07 0.23 0.56 0.02 d array 0 1 2 3 4 5 6 Buckets One parallel phase while (bucket is non-empty) i)Inspect light edges ii)Construct a set of “requests” (R) iii)Clear the current bucket iv)Remember deleted vertices (S) v)Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 0.03.01.06.16.29.62 R 0 S 1 2 13 2 6 4 5 6 5/4/2009 Parallel Graph Algorithms

47 No. of phases (machine-independent performance count) low diameter high diameter 5/4/2009 Parallel Graph Algorithms

48 Average shortest path weight for various graph families ~ 2 20 vertices, 2 22 edges, directed graph, edge weights normalized to [0,1] 5/4/2009 Parallel Graph Algorithms

49 Last non-empty bucket (machine-independent performance count) Fewer buckets, more parallelism 5/4/2009 Parallel Graph Algorithms

50 Number of bucket insertions (machine-independent performance count) 5/4/2009 Parallel Graph Algorithms

52 Betweenness Centrality Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph –degree, closeness, eigenvalue, betweenness Betweenness Centrality ( : No. of shortest paths between s and t) Applied to several real-world networks –Social interactions –WWW –Epidemiology –Systems biology 5/4/2009 Parallel Graph Algorithms

53 Algorithms for Computing Betweenness All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n 3 ) time), sum up the fractional dependency values (O(n 2 ) space). Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space. 5/4/2009 Parallel Graph Algorithms

54 Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality –low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n)) –Exact algorithm: O(mn) work, O(m+n) space, O(nD) time (PRAM model) or O(nD+nm/p) time. Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references –In practice, 2-3X faster than previous algorithm –Lock-free => better scalability on large parallel systems Our New Parallel Algorithms Parallel Graph Algorithms 5/4/2009

55 Parallel BC Algorithm Consider an undirected, unweighted graph High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores Two steps –traversal and path counting –dependency accumulation Parallel Graph Algorithms 5/4/2009

56 Parallel BC Algorithm Illustration G (size m+n): read-only, adjacency array representation BC (size n): centrality score of each vertex S (n): stack of visited vertices Visited (n): Mark to check if vertex has been visited D (n): Distance of vertex from source s Sigma (n): No. of shortest paths through a vertex Delta (n): Partial dependence score for each vertex P (m+n): Multiset of predecessors of a vertex along shortest paths Space requirement: 8(m+6n) Bytes Data structures 07 5 3 8 2 46 1 9 Parallel Graph Algorithms 5/4/2009

57 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex Parallel Graph Algorithms 5/4/2009

58 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex 2 7 5 0 0 0 0 1 1 1 1 1 1 D 0 00 0 S P Parallel Graph Algorithms 5/4/2009

59 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 07 5 3 8 2 46 1 9 source vertex 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00 0 S P 3 3 27 57 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel Parallel Graph Algorithms 5/4/2009

60 Parallel BC Algorithm Illustration 1.Traversal step: at the end, we have all reachable vertices, their corresponding predecessor multisets, and D values. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 0 0 1 1 2 2 1 1 1 1 2 2 D 0 00 0 S P 3 3 27 57 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 38 8 6 6 Parallel Graph Algorithms 5/4/2009

61 Step 1 (traversal) pseudo-code for all vertices u at level d in parallel do for all adjacencies v of u in parallel do dv = D[v]; if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1; pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v v e1e1 e2e2 u2u2 u1u1 v1v1 u v2v2 Parallel Graph Algorithms 5/4/2009

62 Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n) Upper bound on size of each predecessor multiset: In- degree Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths. –simplifies the accumulation step –reduces an atomic operation in traversal step –cache-friendly! Graph traversal step analysis Parallel Graph Algorithms 5/4/2009

63 Modified Step 1 pseudo-code for all vertices u at level d in parallel do for all adjacencies v of u in parallel do dv = D[v]; if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1; pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of u if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of u Parallel Graph Algorithms 5/4/2009

64 Graph Traversal Step locality analysis for all vertices u at level d in parallel do for all adjacencies v of u in parallel do dv = D[v]; if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1; pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); All the vertices are in a contiguous block (stack) All the adjacencies of a vertex are stored compactly (graph rep.) Store to S[u] Non-contiguous memory access Non-contiguous memory access Non-contiguous memory access Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously Parallel Graph Algorithms 5/4/2009

65 Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack, update dependence scores. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 Delta 0 00 0 S P 3 3 27 57 38 8 6 6 Parallel Graph Algorithms 5/4/2009

66 Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in a level-synchronous manner. 07 5 3 8 2 46 1 9 source vertex 2 2 1 1 6 6 4 4 8 8 2 7 5 0 0 Delta 0 00 0 S P 3 3 27 57 38 8 6 6 Parallel Graph Algorithms 5/4/2009

67 Step 2 (accumulation) pseudo-code for level d = GraphDiameter to 2 do for all vertices w at level d in parallel do for all v in P[w] do acquire_lock(v); delta[v] = delta[v] + (1 + delta[w]) * sigma(v)/sigma(w); release_lock(v); BC[v] = delta[v] Parallel Graph Algorithms 5/4/2009

68 Modified Step 2 pseudo-code (w/ successor lists) for level d = GraphDiameter-2 to 1 do for all vertices v at level d in parallel do for all w in S[v] in parallel do reduction(delta) delta[v] = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] Parallel Graph Algorithms 5/4/2009

69 Accumulation step locality analysis for level d = GraphDiameter-2 to 1 do for all vertices v at level d in parallel do for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] = delta_sum_v; All the vertices are in a contiguous block (stack) Each S[v] is a contiguous block Only floating point operation in code Parallel Graph Algorithms 5/4/2009

70 Digression: Centrality Analysis applied to Protein Interaction Networks 43 interactions Protein Ensembl ID ENSG00000145332.2 Kelch-like protein 8 5/4/2009 Parallel Graph Algorithms

72 Information networks are very different from graph topologies and computations that arise in scientific computing. Classical graph algorithms typically assume a uniform random graph topology. Informatics: dynamic, high-dimensional data Static networks, Euclidean topologies Image Sources: visualcomplexity.com (1,2), MapQuest (3) Graph topology matters 5/4/2009 Parallel Graph Algorithms

73 Information networks are typically dynamic graph abstractions, from diverse data sources High-dimensional data Skewed (“power law”) degree distribution of the number of neighbors Low graph diameter Massive networks (billions of entities) Image Source: Seokhee Hong 5/4/2009 “Small-world” complex networks Kevin Bacon and six degrees of separation Parallel Graph Algorithms

74 Execution time is dominated by latency to main memory –Large memory footprint –Large number of irregular memory accesses Essentially no computation to hide memory costs Poor performance on current cache-based architectures (< 5-10% of peak achieved) Memory access pattern is dependent on the graph topology. Variable degrees of concurrency in parallel algorithm 5/4/2009 Implementation Challenges Parallel Graph Algorithms

75 Desirable HPC Architectural Features A global shared memory abstraction –no need to partition the graph –support dynamic updates A high-bandwidth, low-latency network Ability to exploit fine-grained parallelism Support for light-weight synchronization HPC systems with these characteristics –Massively Multithreaded architectures –Symmetric multiprocessors 5/4/2009 Parallel Graph Algorithms

76 Latency tolerance by massive multithreading –hardware support for 128 threads on each processor –Globally hashed address space –No data cache –Single cycle context switch –Multiple outstanding memory requests Support for fine-grained, word-level synchronization 16 x 500 MHz processors, 128 GB RAM Performance Results: Test Platforms Parallel Graph Algorithms 5/4/2009 Cray XMT Sun Fire T5120 Sun Niagara2: Cache-based multicore server with chip multithreading 1 socket x 8 cores x 8 threads per core 8 KB private L1 cache per core, 4 MB shared L2 cache 1167 MHz processor, 32 GB RAM

77 Betweenness Centrality Performance Approximate betweenness computation on a synthetic small-world network of 256 million vertices and 2 billion edges TEPS: Traversed edges per second, performance rate for centrality computation. BC-new 2.2x faster than previous approach BC-new 2.4x faster than BC-old Parallel Graph Algorithms 5/4/2009

78 New parallel framework for complex graph analysis 10-100x faster than existing approaches Can process graphs with billions of vertices and edges Open-source Image Source: visualcomplexity.com 5/4/2009 snap-graph.sourceforge.net SNAP: Small-world network analysis and Partitioning Parallel Graph Algorithms

79 Graph: 25M vertices and 200M edges, System: Sun Fire T2000 New graph representations for dynamically evolving small-world networks in SNAP. We support fast, parallel structural updates to low- diameter scale-free and small-world graphs. 5/4/2009 SNAP: Compact graph representations for dynamic network analysis Parallel Graph Algorithms

80 Graph: 500M vertices and 2B edges, System: IBM p5 570 SMP Key kernel for dynamic graph computations We reduce execution time of linear-work kernels from minutes to seconds for massive small-world networks (billions of vertices and edges) 5/4/2009 SNAP: Induced Subgraphs Performance Parallel Graph Algorithms

81 Large-scale Graph Traversal ProblemGraphResultComments Multithreaded BFS [BM06] Random graph, 256M vertices, 1B edges 2.3 sec (40p) 73.9 sec (1p) MTA-2 Processes all low- diameter graph families External Memory BFS [ADMO06] Random graph, 256M vertices, 1B edges 8.9 hrs (3.2 GHz Xeon) State-of-the-art external memory BFS Multithreaded SSSP [MBBC06] Random graph, 256M vertices, 1B edges 11.96 sec (40p) MTA-2 Works well for all low- diameter graph families Parallel Dijkstra [EBGL06] Random graph, 240M vertices, 1.2B edges 180 sec, 96p 2.0GHz cluster Best known distributed-memory SSSP implementation for large-scale graphs Parallel Graph Algorithms 5/4/2009

82 Optimizations for real-world graphs Preprocessing kernels (connected components, biconnected components, sparsification) significantly reduce computation time. –ex. A high number of isolated and degree-1 vertices store BFS/Shortest Path trees from high degree vertices and reuse them Typically 3-5X performance improvement Exploit small-world network properties (low graph diameter) –Load balancing in the level-synchronous parallel BFS algorithm –SNAP data structures are optimized for unbalanced degree distributions 5/4/2009 Parallel Graph Algorithms

83 Faster Community Identification Algorithms in SNAP: Performance Improvement over the Girvan-Newman approach Graphs: Real-world networks (order of Millions), System: Sun Fire T2000 Speedup from Algorithm Engineering (approximate BC) and parallelization (Sun Fire T2000) are multiplicative! 100-300X overall performance improvement over Girvan-Newman approach 5/4/2009 Parallel Graph Algorithms

84 Large-scale graph analysis: current research Synergistic combination of novel graph algorithms, high performance computing, and algorithm engineering. Dynamic graph algorithms Complex Network Analysis & Empirical studies Spectral techniques Data stream algorithms Classical graph algorithms Many-core Graph problems on dynamic network abstractions Stream computing Novel approaches Realistic modeling Enabling technologies Affordable exascale data storage 5/4/2009 Parallel Graph Algorithms

85 Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance Parallel algorithm building blocks –Kernels: PRAM algorithms, Prefix sums, List ranking –Data structures: graph representation, priority queues Parallel algorithm case studies –Connected components: graft and shortcut –BFS: level-synchronous approach –Shortest paths: parallel priority queue –Betweenness centrality: parallel algorithm with pseudo-code Performance on current systems –Software: SNAP, Boost GL, igraph, NetworkX, Network Workbench –Architectures: Cray XMT, cache-based multicore, SMPs –Performance trends Review of lecture Parallel Graph Algorithms 5/4/2009

86 Questions? Thank you! 5/4/2009 Parallel Graph Algorithms

Parallel Graph Algorithms Kamesh Madduri

Similar presentations

Presentation on theme: "Parallel Graph Algorithms Kamesh Madduri"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Graph Algorithms Kamesh Madduri

Similar presentations

Presentation on theme: "Parallel Graph Algorithms Kamesh Madduri"— Presentation transcript:

Similar presentations

About project

Feedback