High-Performance Computation for Path Problems in Graphs

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

Lecture 19: Parallel Algorithms

1 Computational models of the physical world Cortical bone Trabecular bone.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

1 Parallel Sparse Operations in Matlab: Exploring Large Graphs John R. Gilbert University of California at Santa Barbara Aydin Buluc (UCSB) Brad McRae.

MATH 685/ CSI 700/ OR 682 Lecture Notes

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Sparse Matrices in Matlab John R. Gilbert Xerox Palo Alto Research Center with Cleve Moler (MathWorks) and Rob Schreiber (HP Labs)

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.

UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lectures 3 Tuesday, 9/25/01 Graph Algorithms: Part 1 Shortest.

Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.

Connected Components, Directed Graphs, Topological Sort COMP171.

Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.

Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Combinatorial Scientific Computing: Experiences, Directions, and Challenges John R. Gilbert University of California, Santa Barbara DOE CSCAPES Workshop.

1 A High-Performance Interactive Tool for Exploring Large Graphs John R. Gilbert University of California, Santa Barbara Aydin Buluc & Viral Shah (UCSB)

Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara SIAM Annual Meeting July 10, 2009 Support: DOE.

Tools and Primitives for High Performance Graph Computation

CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Fall 2014 Doug James Lecture 17: Graphs.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Important Problem Types and Fundamental Data Structures

CS 290H Lecture 17 Dulmage-Mendelsohn Theory

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

MA/CSSE 473 Day 12 Insertion Sort quick review DFS, BFS Topological Sort.

Symbolic sparse Gaussian elimination: A = LU

Chapter 2 Graph Algorithms.

GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Data Structures & Algorithms Graphs

ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.

CS240A: Conjugate Gradients and the Model Problem.

CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.

Data Structures and Algorithms in Parallel Computing Lecture 2.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Data Structures and Algorithms in Parallel Computing Lecture 3.

Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.

Data Structures and Algorithms in Parallel Computing Lecture 7.

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections

© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.

Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Symmetric-pattern multifrontal factorization T(A) G(A)

Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

All-Pairs Shortest Paths

Dense Linear Algebra (Data Distributions)

Nonsymmetric Gaussian elimination

Presentation transcript:

High-Performance Computation for Path Problems in Graphs Aydin Buluç John R. Gilbert University of California, Santa Barbara SIAM Conf. on Applications of Dynamical Systems May 20, 2009 Support: DOE Office of Science, MIT Lincoln Labs, NSF, DARPA, SGI

Horizontal-vertical decomposition [Mezic et al.] Explore “polarity” (input-output structure) in dynamical systems defined on graphs to alleviate function and uncertainty propagation analysis. Slide courtesy of Igor Mezic group, UCSB

Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics

A few directions in CSC Hybrid discrete & continuous computations Multiscale combinatorial computation Analysis, management, and propagation of uncertainty Economic & game-theoretic considerations Computational biology & bioinformatics Computational ecology Knowledge discovery & machine learning Relationship analysis Web search and information retrieval Sparse matrix methods Geometric modeling . . .

The Parallel Computing Challenge Two Nvidia 8800 GPUs > 1 TFLOPS LANL / IBM Roadrunner > 1 PFLOPS Intel 80-core chip > 1 TFLOPS Parallelism is no longer optional… … in every part of a computation.

The Parallel Computing Challenge Efficient sequential algorithms for graph-theoretic problems often follow long chains of dependencies Several parallelization strategies, but no silver bullet: Partitioning (e.g. for preconditioning PDE solvers) Pointer-jumping (e.g. for connected components) Sometimes it just depends on what the input looks like A few simple examples . . .

Sample kernel: Sort logically triangular matrix Original matrix Permuted to unit upper triangular form Used in sparse linear solvers (e.g. Matlab’s) Simple kernel, abstracts many other graph operations (see next) Sequential: linear time, simple greedy topological sort Parallel: no known method is efficient in both work and span: one parallel step per level; arbitrarily long dependent chains

Bipartite matching PA A 1 2 3 4 5 1 5 2 3 4 PA 1 5 2 3 4 1 2 3 4 5 A Perfect matching: set of edges that hits each vertex exactly once Matrix permutation to place nonzeros (or heavy elements) on diagonal Efficient sequential algorithms based on augmenting paths No known work/span efficient parallel algorithms

Strongly connected components 1 5 2 4 7 3 6 1 2 3 4 7 6 5 G(A) PAPT Symmetric permutation to block triangular form Diagonal blocks are strong Hall (irreducible / strongly connected) Sequential: linear time by depth-first search [Tarjan] Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar]

Horizontal - vertical decomposition 5 9 6 7 8 1 2 3 4 level 1 level 2 level 3 level 4 3 5 4 9 7 1 8 6 2 Defined and studied by Mezic et al. in a dynamical systems context Strongly connected components, ordered by levels of DAG Efficient linear-time sequential algorithms No work/span efficient parallel algorithms known

Strong components of 1M-vertex RMAT graph

Dulmage-Mendelsohn decomposition 1 2 5 3 4 7 6 10 8 9 12 11 HR SR VR HC SC VC 1 5 2 3 4 6 7 8 12 9 10 11

Applications of D-M decomposition Strongly connected components of directed graphs Connected components of undirected graphs Permutation to block triangular form for Ax=b Minimum-size vertex cover of bipartite graphs Extracting vertex separators from edge cuts for arbitrary graphs Nonzero structure prediction for sparse matrix factorizations

Strong Hall components are independent of choice of matching 1 5 2 3 4 7 6 1 5 2 4 7 3 6

The Primitives Challenge By analogy to numerical linear algebra. . . What should the “combinatorial BLAS” look like? C = A*B y = A*x μ = xT y Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n)

Primitives for HPC graph programming Visitor-based multithreaded [Berry, Gregor, Hendrickson, Lumsdaine] + search templates natural for many algorithms + relatively simple load balancing – complex thread interactions, race conditions – unclear how applicable to standard architectures Array-based data parallel [G, Kepner, Reinhardt, Robinson, Shah] + relatively simple control structure + user-friendly interface – some algorithms hard to express naturally – load balancing not so easy Scan-based vectorized [Blelloch] We don’t know the right set of primitives yet!

Array-based graph algorithms study [Kepner, Fineman, Kahn, Robinson]

Multiple-source breadth-first search 1 2 3 4 7 6 5 AT X

Multiple-source breadth-first search 1 2 3 4 7 6 5  AT X ATX What we do here is parallelism on the edge level, 2D partitioning is actually edge partitioning.

Multiple-source breadth-first search 1 2 3 4 7 6 5  AT X ATX What we do here is parallelism on the edge level, 2D partitioning is actually edge partitioning. Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Span & load balance depend on matrix-mult implementation

Matrices over semirings Matrix multiplication C = AB (or matrix/vector): Ci,j = Ai,1B1,j + Ai,2B2,j + · · · + Ai,nBn,j Replace scalar operations  and + by  : associative, distributes over , identity 1  : associative, commutative, identity 0 annihilates under  Then Ci,j = Ai,1B1,j  Ai,2B2,j  · · ·  Ai,nBn,j Examples: (,+) ; (and,or) ; (+,min) ; . . . Same data reference pattern and control flow 2 Sentences describing goals.

SpGEMM: Sparse Matrix x Sparse Matrix [Buluc, G] Shortest path calculations (APSP) Betweenness centrality BFS from multiple source vertices Subgraph / submatrix indexing Graph contraction Cycle detection Multigrid interpolation & restriction Colored intersection searching Applying constraints in finite element modeling Context-free parsing

Distributed-memory parallel sparse matrix multiplication j * = i k Cij Cij += Aik * Bkj 2D block layout Outer product formulation Sequential “hypersparse” kernel Asynchronous MPI-2 implementation Experiments: TACC Lonestar cluster Good scaling to 256 processors Time vs Number of cores -- 1M-vertex RMAT

All-Pairs Shortest Paths Directed graph with “costs” on edges Find least-cost paths between all reachable vertex pairs Several classical algorithms with Work ~ matrix multiplication Span ~ log2 n Case study of implementation on multicore architecture: graphics processing unit (GPU)

GPU characteristics But: Powerful: two Nvidia 8800s > 1 TFLOPS Inexpensive: $500 each But: Difficult programming model: One instruction stream drives 8 arithmetic units Performance is counterintuitive and fragile: Memory access pattern has subtle effects on cost Extremely easy to underutilize the device: Doing it wrong easily costs 100x in time

Recursive All-Pairs Shortest Paths B D C Based on R-Kleene algorithm Well suited for GPU architecture: Fast matrix-multiply kernel In-place computation => low memory bandwidth Few, large MatMul calls => low GPU dispatch overhead Recursion stack on host CPU, not on multicore GPU Careful tuning of GPU code A B C D + is “min”, × is “add” A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

Execution of Recursive APSP

APSP: Experiments and observations 128-core Nvidia 8800 Speedup relative to. . . 1-core CPU: 120x – 480x 16-core CPU: 17x – 45x Iterative, 128-core GPU: 40x – 680x MSSSP, 128-core GPU: ~3x Time vs. Matrix Dimension Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key

H-V decomposition 5 9 6 7 8 1 2 3 4 level 1 level 2 level 3 level 4 3 5 4 9 7 1 8 6 2 A span-efficient, but not work-efficient, method for H-V decomposition uses APSP to determine reachability…

Reachability: Transitive closure 5 9 6 7 8 1 2 3 4 level 1 level 2 level 3 level 4 3 5 4 9 7 1 8 6 2 APSP => transitive closure of adjacency matrix Strong components identified by symmetric nonzeros

H-V structure: Acyclic condensation 345 9 67 1 8 2 level 1 level 2 level 3 level 4 1 8 2 3 6 9 Acyclic condensation is a sparse matrix-matrix product Levels identified by “APSP” for longest paths Practically speaking, a parallel method would compromise between work and span efficiency

Remarks Combinatorial algorithms are pervasive in scientific computing and will become more so. Path computations on graphs are powerful tools, but efficiency is a challenge on parallel architectures. Carefully chosen and implemented primitive operations are key. Lots of exciting opportunities for research! Combinatorialists can profit from New problems (often special cases of existing ones) Eager customers for advances New algorithmic insights Scientific computing folks can profit from New tools & techniques Novel points of view For individual researchers Lots of interesting stuff at boundaries between fields Opportunities for enormous impact Dialogue is Necessary! What problem variants are useful? Which application formulations are tractable? Computer scientist as tool-maker Sandia’s success in CS&E is partially due to early tolerance of discrete algorithms. Look at my department.