Tools and Primitives for High Performance Graph Computation

Name: Tools and Primitives for High Performance Graph Computation
Uploaded: 2017-10-02T12:46:17+00:00
Duration: PTM16S56
Description: Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation
John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World Graphs July 12, 2010 Support: NSF, DARPA, DOE, Intel

Large graphs are everywhere…
Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, … Details on the types of computations here -> Pagerank (web), betweenness centrality (biological) Both are huge, sparse graphs WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong

Types of graph computations

Continuous physical modeling
An analogy? As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

Continuous physical modeling Discrete structure analysis
An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

An analogy? Well, we’re not there yet ….
 Mathematical tools ? “Impedance match” to computer operations ? High-level primitives ? High-quality software libs ? Ways to extract performance from computer architecture ? Interactive environments Discrete structure analysis Graph theory Computers

The Primitives Challenge
By analogy to numerical scientific computing. . . What should the combinatorial BLAS look like? C = A*B y = A*x μ = xT y Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n)

Primitives should … Supply a common notation to express computations
Have broad scope but fit into a concise framework Allow programming at the appropriate level of abstraction and granularity Scale seamlessly from desktop to supercomputer Hide architecture-specific details from users

The Case for Sparse Matrices
Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited

Identification of Primitives
Sparse array-based primitives Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations Sparse matrix-dense vector multiplication Sparse matrix indexing x x .* Matrices on various semirings: (x, +) , (and, or) , (+, min) , …

Multiple-source breadth-first search
1 2 3 4 7 6 5 AT X

1 2 3 4 7 6 5  AT X ATX What we do here is parallelism on the edge level, 2D partitioning is actually edge partitioning.

1 2 3 4 7 6 5  AT X ATX What we do here is parallelism on the edge level, 2D partitioning is actually edge partitioning. Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three possible levels of parallelism: searches, vertices, edges

A Few Examples

Typical software stack
Combinatorial BLAS [Buluc, G] A parallel graph library based on distributed-memory sparse arrays and algebraic graph primitives Typical software stack Betweenness Centrality (BC) What fraction of shortest paths pass through this node? Why betweenness centrality? It has a lot of applications in identifying important vertices It’s a standard graph benchmark to compare implementations. Brandes’ algorithm

BC performance in distributed memory
RMAT power-law graph, 2Scale vertices, avg degree 8 TEPS = Traversed Edges Per Second One page of code using C-BLAS

Layer 1: Graph Theoretic Tools
KDT: A toolbox for graph analysis and pattern discovery [G, Reinhardt, Shah] Layer 1: Graph Theoretic Tools Graph operations Global structure of graphs Graph partitioning and clustering Graph generators Visualization and graphics Scan and combining operations Utilities

. . . Star-P architecture processor #0 client manager processor #1
MATLAB® processor #0 client manager processor #1 package manager dense/sparse processor #2 sort ScaLAPACK processor #3 FFTW Ordinary Matlab variables FPGA interface . . . MPI user code UPC user code processor #n-1 server manager Distributed matrices matrix manager

Landscape connectivity modeling
Habitat quality, gene flow, corridor identification, conservation planning Pumas in southern California: 12 million nodes, < 1 hour Targeting larger problems: Yellowstone-to-Yukon corridor Figures courtesy of Brad McRae

Circuitscape [McRae, Shah]
Predicting gene flow with resistive networks Matlab, Python, and Star-P (parallel) implementations Combinatorics: Initial discrete grid: ideally 100m resolution (for pumas) Partition landscape into connected components Graph contraction: habitats become nodes in resistive network Numerics: Resistance computations for pairs of habitats in the landscape Iterative linear solvers invoked via Star-P: Hypre (PCG+AMG)

A Few Nuts & Bolts

Why focus on SpGEMM? SpGEMM: sparse matrix x sparse matrix x x
Graph clustering (Markov, peer pressure) Subgraph / submatrix indexing Shortest path calculations Betweenness centrality Graph contraction Cycle detection Multigrid interpolation & restriction Colored intersection searching Applying constraints in finite element computations Context-free parsing ... 1 1 x x

Two Versions of Sparse GEMM
B1 B2 B3 B4 B5 B6 B7 B8 C1 C2 C3 C4 C5 C6 C7 C8 1D block-column distribution x = Ci = Ci + A Bi j k k 2D block distribution - In a graph world, 1D corresponds to standard vertex based distribution where n/p vertices are owned by a processor and all the outgoing (or incoming) edges to that set of vertices also belong to that same processor. 2D is an edge based distribution, like on employed by Yoo et al. for graph algorithms on BlueGene (Gordon Bell price finalist). - Communication in the 2D case is only among sqrt(p) processors instead of p in the 1D case. - The 1D case has other algorithmic problems as well [ICPP’08] x = i Cij Cij += Aik Bkj

Parallelism in multiple-source BFS
1 2 3 4 7 6 5  AT X ATX Three levels of parallelism from 2-D data decomposition: columns of X : over multiple simultaneous searches rows of X & columns of AT: over multiple frontier nodes rows of AT: over edges incident on high-degree frontier nodes What we do here is parallelism on the edge level, 2D partitioning is actually edge partitioning.

Modeled limits on speedup, sparse 1-D & 2-D
2-D algorithm 1-D algorithm N N Explicitly say what axes are – x is number of processors, y is the matrix dimension, z is the speedup Notice the difference between vertical scales, Left figure is the Star*P algorithm while the right one is mine  P P 1-D algorithms do not scale beyond 40x Break-even point is around 50 processors

Submatrices are hypersparse (nnz << n)
Average nonzeros per column = c Average nonzeros per column within block = blocks Total memory using compressed sparse columns = CSC (and CSR) are vertex based data structures, they are just isomorphic to adjacency list representation. There is a dense array of vertices and one chain of incoming edges per array element. However, 2D distribution is based on edges, so using CSC with 2D distributed data is forcing a vertex based representation on edge distributed data. Result: unnecessary replication of vertex heads. blocks Any algorithm whose complexity depends on matrix dimension n is asymptotically too wasteful.

Distributed-memory sparse matrix-matrix multiplication
j * = i k Cij Cij += Aik * Bkj 2D block layout Outer product formulation Sequential “hypersparse” kernel Scales well to hundreds of processors Betweenness centrality benchmark: over 200 MTEPS Experiments: TACC Lonestar cluster Time vs Number of cores -- 1M-vertex RMAT

CSB: Compressed sparse block storage [Buluc, Fineman, Frigo, G, Leiserson]

CSB for parallel Ax and ATx [Buluc, Fineman, Frigo, G, Leiserson]
Efficient multiplication of a sparse matrix and its transpose by a vector Compressed sparse block storage Critical path never more than ~ sqrt(n)*log(n) Multicore / multisocket architectures

From Semirings to Computational Patterns

Matrices over semirings
Matrix multiplication C = AB (or matrix/vector): Ci,j = Ai,1B1,j + Ai,2B2,j + · · · + Ai,nBn,j Replace scalar operations  and + by  : associative, distributes over , identity 1  : associative, commutative, identity 0 annihilates under  Then Ci,j = Ai,1B1,j  Ai,2B2,j  · · ·  Ai,nBn,j Examples: (,+) ; (and,or) ; (+,min) ; . . . No change to data reference pattern or control flow 2 Sentences describing goals.

From semirings to computational patterns
Sparse matrix times vector as a semiring operation: Given vertex data xi and edge data ai,j For each vertex j of interest, compute yj = ai1,jxi1  ai2,jxi2  · · ·  aik,j xik User specifies: definition of operations  and 

From semirings to computational patterns
Sparse matrix times vector as a computational pattern: Given vertex data and edge data For each vertex of interest, combine data from neighboring vertices and edges User specifies: desired computation on data from neighbors

SpGEMM as a computational pattern
Explore length-two paths that use specified vertices Possibly do some filtering, accumulation, or other computation with vertex and edge attributes E.g. “friends of friends” (per Lars Backstrom) May or may not want to form the product graph explicitly Formulation as semiring matrix multiplication is often possible but sometimes clumsy Same data flow and communication patterns as in SpGEMM

Graph BLAS: A pattern-based library
User-specified operations and attributes give the performance benefits of algebraic primitives with a more intuitive and flexible interface. Common framework integrates algebraic (edge-based), visitor (traversal-based), and map-reduce patterns. 2D compressed sparse block structure supports user-defined edge/vertex/attribute types and operations. “Hypersparse” kernels tuned to reduce data movement. Initial target: manycore and multisocket shared memory.

Challenge: Complete the analogy . . .
 Mathematical tools ? “Impedance match” to computer operations ? High-level primitives ? High-quality software libs ? Ways to extract performance from computer architecture ? Interactive environments Discrete structure analysis Graph theory Computers

Tools and Primitives for High Performance Graph Computation

Similar presentations

Presentation on theme: "Tools and Primitives for High Performance Graph Computation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools and Primitives for High Performance Graph Computation

Similar presentations

Presentation on theme: "Tools and Primitives for High Performance Graph Computation"— Presentation transcript:

Similar presentations

About project

Feedback