1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery October 28, 2010 Support: NSF, DOE, Intel, Microsoft

2 Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics

3 Graphs and Sparse Matrices : Cholesky factorization 10 1 3 2 4 5 6 7 8 9 1 3 2 4 5 6 7 8 9 G(A) G + (A) [chordal] Symmetric Gaussian elimination: for j = 1 to n add edges between j’s higher-numbered neighbors Fill: new nonzeros in factor

4 Large graphs are everywhere… WWW snapshot, courtesy Y. HyunYeast protein interaction network, courtesy H. Jeong Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, …

5 The Challenge of the Middle

6 An analogy? As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

7 An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

8 An analogy? Well, we’re not there yet …. Discrete structure analysis Graph theory Computers  Mathematical tools ? “Impedance match” to computer operations ? High-level primitives ? High-quality software libs ? Ways to extract performance from computer architecture ? Interactive environments

9 The Case for Primitives

10 All-Pairs Shortest Paths on a GPU [Buluc et al.] AB CD A B D C A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC; + is “min”, × is “add” Based on R-Kleene algorithm Well suited for GPU architecture: In-place computation => low memory bandwidth Few, large MatMul calls => low GPU dispatch overhead Recursion stack on host CPU, not on multicore GPU Careful tuning of GPU code Fast matrix-multiply kernel

11 The Case for Primitives480x Lifting Floyd-Warshall to GPU The right primitive! High performance is achievable but not simple Carefully chosen and optimized primitives are key Matching the architecture and the algorithm is key Unorthodox R- Kleene algorithm Runtime vs. Matrix Dimension, log-log APSP: Experiments and observations

12 The Case for Sparse Matrices Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited The case for sparse matrices

13 Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations x Matrices on various semirings: ( x, +), (and, or), (+, min), … Sparse matrix-dense vector multiplication Sparse matrix indexing x.* Sparse array-based primitives

14 Multiple-source breadth-first search X 1 2 3 4 7 6 5 ATAT

15 X ATAT ATXATX  1 2 3 4 7 6 5 Multiple-source breadth-first search

16 Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three levels of available parallelism: searches, vertices, edges X ATAT ATXATX  1 2 3 4 7 6 5 Multiple-source breadth-first search

17 A Few Examples

18 Betweenness Centrality (BC) What fraction of shortest paths pass through this node? Brandes’ algorithm A parallel graph library based on distributed-memory sparse arrays and algebraic graph primitives Typical software stack Combinatorial BLAS [Buluc, G]

19 BC performance in distributed memory TEPS = Traversed Edges Per Second One page of code using C-BLAS RMAT power- law graph, 2 Scale vertices, avg degree 8

20 KDT: A toolbox for graph analysis and pattern discovery [G, Reinhardt, Shah] Layer 1: Graph Theoretic Tools Graph operations Global structure of graphs Graph partitioning and clustering Graph generators Visualization and graphics Scan and combining operations Utilities

21 MATLAB ® Star-P architecture Ordinary Matlab variables Star-P client manager server manager package manager processor #0 processor #n-1 processor #1 processor #2 processor #3... ScaLAPACK FFTW FPGA interface matrix manager Distributed matrices sort dense/sparse UPC user code MPI user code

22 Landscape connectivity modeling Habitat quality, gene flow, corridor identification, conservation planning Pumas in southern California: 12 million nodes, < 1 hour Targeting larger problems: Yellowstone-to-Yukon corridor Figures courtesy of Brad McRae

23 From semirings to computational patterns Sparse matrix times vector as a semiring operation: –Given vertex data x i and edge data a i,j –For each vertex j of interest, compute y j = a i 1,j  x i 1  a i 2,j  x i 2  · · ·  a i k,j  x i k –User specifies: definition of operations  and 

24 Sparse matrix times vector as a computational pattern: –Given vertex data and edge data –For each vertex of interest, combine data from neighboring vertices and edges –User specifies: desired computation on data from neighbors From semirings to computational patterns

25 Explore length-two paths that use specified vertices Possibly do some filtering, accumulation, or other computation with vertex and edge attributes E.g. “friends of friends” (think Facebook) May or may not want to form the product graph explicitly Formulation as semiring matrix multiplication is often possible but sometimes clumsy Same data flow and communication patterns as in SpGEMM SpGEMM as a computational pattern

26 Graph BLAS: A pattern-based library User-specified operations and attributes give the performance benefits of algebraic primitives with a more intuitive and flexible interface. Common framework integrates algebraic (edge-based), visitor (traversal-based), and map-reduce patterns. 2D compressed sparse block structure supports user- defined edge/vertex/attribute types and operations. “Hypersparse” kernels tuned to reduce data movement. Initial target: manycore and multisocket shared memory.

27 The Challenge of Architecture and Algorithms

28 The Architecture & Algorithms Challenge Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  Parallelism is no longer optional…  … in every part of a computation.

29 High-performance architecture  Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.  Originally, because linear algebra is the middleware of scientific computing.  Nowadays, largely for bragging rights. = x P A L U

30 Strongly connected components Symmetric permutation to block triangular form Diagonal blocks are strong Hall (irreducible / strongly connected) Sequential: linear time by depth-first search [Tarjan] Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar] 1524736 1 5 2 4 7 3 6 PAP T G(A) 1 2 3 4 7 6 5

31 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)

32 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or parallelism.

33 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or parallelism.  Most interesting graph problems have lousy locality.  Thus the algorithms need even more parallelism!

34 Architectural impact on algorithms Full matrix multiplication: C = A * B C = 0; for i = 1 : n for j = 1 : n for k = 1 : n C(i,j) = C(i,j) + A(i,k) * B(k,j); O(n 3 ) operations

35 Architectural impact on algorithms T = N 4.7 Naïve algorithm is O(N 5 ) time under UMH model. BLAS-3 DGEMM and recursive blocked algorithms are O(N 3 ). Size 2000 took 5 days 12000 would take 1095 years Diagram from Larry Carter Naïve 3-loop matrix multiply [Alpern et al., 1992]:

36  A big opportunity exists for computer architecture to influence combinatorial algorithms.  (Maybe even vice versa.) The architecture & algorithms challenge

37 A novel architectural approach: Cray MTA / XMT Hide latency by massive multithreading Per-tick context switching Uniform (sort of) memory access time But the economic case is still not completely clear.

38 A Few Other Challenges

39 The Productivity Challenge Raw performance isn’t always the only criterion. Other factors include: Seamless scaling from desktop to HPC Interactive response for data exploration and viz Rapid prototyping Just plain programmability

40 The Education Challenge  How do you teach this stuff?  Where do you go to take courses in  Graph algorithms …  … on massive data sets …  … in the presence of uncertainty …  … analyzed on parallel computers …  … applied to a domain science?

41 Final thoughts Combinatorial algorithms are pervasive in scientific computing and will become more so. Linear algebra and combinatorics can support each other in computation as well as in theory. A big opportunity exists for computer architecture to influence combinatorial algorithms. This is a great time to be doing research in combinatorial scientific computing!

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

Similar presentations

Presentation on theme: "1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

Similar presentations

Presentation on theme: "1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery."— Presentation transcript:

Similar presentations

About project

Feedback