1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

1 Parallel Sparse Operations in Matlab: Exploring Large Graphs John R. Gilbert University of California at Santa Barbara Aydin Buluc (UCSB) Brad McRae.

Microsoft Proprietary High Productivity Computing Large-scale Knowledge Discovery: Co-evolving Algorithms and Mechanisms Steve Reinhardt Principal Architect.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

High-Performance Computation for Path Problems in Graphs

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

1 Combinatorial Scientific Computing: Experiences, Directions, and Challenges John R. Gilbert University of California, Santa Barbara DOE CSCAPES Workshop.

1 A High-Performance Interactive Tool for Exploring Large Graphs John R. Gilbert University of California, Santa Barbara Aydin Buluc & Viral Shah (UCSB)

Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara SIAM Annual Meeting July 10, 2009 Support: DOE.

Tools and Primitives for High Performance Graph Computation

CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Computer System Architectures Computer System Software

Knowledge Discovery Toolbox kdt.sourceforge.net Adam Lugowski.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

JAVA AND MATRIX COMPUTATION

ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.

Interactive Supercomputing Update IDC HPC User’s Forum, September 2008.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Data Structures and Algorithms in Parallel Computing Lecture 7.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

An Interactive Environment for Combinatorial Scientific Computing Viral B. Shah John R. Gilbert Steve Reinhardt With thanks to: Brad McRae, Stefan Karpinski,

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Parallel Computing Presented by Justin Reschke

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

CS427 Multicore Architecture and Parallel Computing

Parallel Computers Today

All-Pairs Shortest Paths

Presentation transcript:

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery October 28, 2010 Support: NSF, DOE, Intel, Microsoft

2 Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics

3 Graphs and Sparse Matrices : Cholesky factorization G(A) G + (A) [chordal] Symmetric Gaussian elimination: for j = 1 to n add edges between j’s higher-numbered neighbors Fill: new nonzeros in factor

4 Large graphs are everywhere… WWW snapshot, courtesy Y. HyunYeast protein interaction network, courtesy H. Jeong Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, …

5 The Challenge of the Middle

6 An analogy? As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

7 An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

8 An analogy? Well, we’re not there yet …. Discrete structure analysis Graph theory Computers  Mathematical tools ? “Impedance match” to computer operations ? High-level primitives ? High-quality software libs ? Ways to extract performance from computer architecture ? Interactive environments

9 The Case for Primitives

10 All-Pairs Shortest Paths on a GPU [Buluc et al.] AB CD A B D C A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC; + is “min”, × is “add” Based on R-Kleene algorithm Well suited for GPU architecture: In-place computation => low memory bandwidth Few, large MatMul calls => low GPU dispatch overhead Recursion stack on host CPU, not on multicore GPU Careful tuning of GPU code Fast matrix-multiply kernel

11 The Case for Primitives480x Lifting Floyd-Warshall to GPU The right primitive! High performance is achievable but not simple Carefully chosen and optimized primitives are key Matching the architecture and the algorithm is key Unorthodox R- Kleene algorithm Runtime vs. Matrix Dimension, log-log APSP: Experiments and observations

12 The Case for Sparse Matrices Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited The case for sparse matrices

13 Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations x Matrices on various semirings: ( x, +), (and, or), (+, min), … Sparse matrix-dense vector multiplication Sparse matrix indexing x.* Sparse array-based primitives

14 Multiple-source breadth-first search X ATAT

15 X ATAT ATXATX  Multiple-source breadth-first search

16 Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three levels of available parallelism: searches, vertices, edges X ATAT ATXATX  Multiple-source breadth-first search

17 A Few Examples

18 Betweenness Centrality (BC) What fraction of shortest paths pass through this node? Brandes’ algorithm A parallel graph library based on distributed-memory sparse arrays and algebraic graph primitives Typical software stack Combinatorial BLAS [Buluc, G]

19 BC performance in distributed memory TEPS = Traversed Edges Per Second One page of code using C-BLAS RMAT power- law graph, 2 Scale vertices, avg degree 8

20 KDT: A toolbox for graph analysis and pattern discovery [G, Reinhardt, Shah] Layer 1: Graph Theoretic Tools Graph operations Global structure of graphs Graph partitioning and clustering Graph generators Visualization and graphics Scan and combining operations Utilities

21 MATLAB ® Star-P architecture Ordinary Matlab variables Star-P client manager server manager package manager processor #0 processor #n-1 processor #1 processor #2 processor #3... ScaLAPACK FFTW FPGA interface matrix manager Distributed matrices sort dense/sparse UPC user code MPI user code

22 Landscape connectivity modeling Habitat quality, gene flow, corridor identification, conservation planning Pumas in southern California: 12 million nodes, < 1 hour Targeting larger problems: Yellowstone-to-Yukon corridor Figures courtesy of Brad McRae

23 From semirings to computational patterns Sparse matrix times vector as a semiring operation: –Given vertex data x i and edge data a i,j –For each vertex j of interest, compute y j = a i 1,j  x i 1  a i 2,j  x i 2  · · ·  a i k,j  x i k –User specifies: definition of operations  and 

24 Sparse matrix times vector as a computational pattern: –Given vertex data and edge data –For each vertex of interest, combine data from neighboring vertices and edges –User specifies: desired computation on data from neighbors From semirings to computational patterns

25 Explore length-two paths that use specified vertices Possibly do some filtering, accumulation, or other computation with vertex and edge attributes E.g. “friends of friends” (think Facebook) May or may not want to form the product graph explicitly Formulation as semiring matrix multiplication is often possible but sometimes clumsy Same data flow and communication patterns as in SpGEMM SpGEMM as a computational pattern

26 Graph BLAS: A pattern-based library User-specified operations and attributes give the performance benefits of algebraic primitives with a more intuitive and flexible interface. Common framework integrates algebraic (edge-based), visitor (traversal-based), and map-reduce patterns. 2D compressed sparse block structure supports user- defined edge/vertex/attribute types and operations. “Hypersparse” kernels tuned to reduce data movement. Initial target: manycore and multisocket shared memory.

27 The Challenge of Architecture and Algorithms

28 The Architecture & Algorithms Challenge Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  Parallelism is no longer optional…  … in every part of a computation.

29 High-performance architecture  Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.  Originally, because linear algebra is the middleware of scientific computing.  Nowadays, largely for bragging rights. = x P A L U

30 Strongly connected components Symmetric permutation to block triangular form Diagonal blocks are strong Hall (irreducible / strongly connected) Sequential: linear time by depth-first search [Tarjan] Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar] PAP T G(A)

31 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)

32 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or parallelism.

33 The memory wall blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or parallelism.  Most interesting graph problems have lousy locality.  Thus the algorithms need even more parallelism!

34 Architectural impact on algorithms Full matrix multiplication: C = A * B C = 0; for i = 1 : n for j = 1 : n for k = 1 : n C(i,j) = C(i,j) + A(i,k) * B(k,j); O(n 3 ) operations

35 Architectural impact on algorithms T = N 4.7 Naïve algorithm is O(N 5 ) time under UMH model. BLAS-3 DGEMM and recursive blocked algorithms are O(N 3 ). Size 2000 took 5 days would take 1095 years Diagram from Larry Carter Naïve 3-loop matrix multiply [Alpern et al., 1992]:

36  A big opportunity exists for computer architecture to influence combinatorial algorithms.  (Maybe even vice versa.) The architecture & algorithms challenge

37 A novel architectural approach: Cray MTA / XMT Hide latency by massive multithreading Per-tick context switching Uniform (sort of) memory access time But the economic case is still not completely clear.

38 A Few Other Challenges

39 The Productivity Challenge Raw performance isn’t always the only criterion. Other factors include: Seamless scaling from desktop to HPC Interactive response for data exploration and viz Rapid prototyping Just plain programmability

40 The Education Challenge  How do you teach this stuff?  Where do you go to take courses in  Graph algorithms …  … on massive data sets …  … in the presence of uncertainty …  … analyzed on parallel computers …  … applied to a domain science?

41 Final thoughts Combinatorial algorithms are pervasive in scientific computing and will become more so. Linear algebra and combinatorics can support each other in computation as well as in theory. A big opportunity exists for computer architecture to influence combinatorial algorithms. This is a great time to be doing research in combinatorial scientific computing!