1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara SIAM Annual Meeting July 10, 2009 Support: DOE.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

1 Computational models of the physical world Cortical bone Trabecular bone.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

1 Optimization Algorithms on a Quantum Computer A New Paradigm for Technical Computing Richard H. Warren, PhD Optimization.

MATH 685/ CSI 700/ OR 682 Lecture Notes

1 Parallel Algorithms II Topics: matrix and graph algorithms.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

High-Performance Computation for Path Problems in Graphs

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Combinatorial Scientific Computing: Experiences, Directions, and Challenges John R. Gilbert University of California, Santa Barbara DOE CSCAPES Workshop.

Tools and Primitives for High Performance Graph Computation

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Interactive Supercomputing Update IDC HPC User’s Forum, September 2008.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Data Structures and Algorithms in Parallel Computing Lecture 7.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

CES 592 Theory of Software Systems B. Ravikumar (Ravi) Office: 124 Darwin Hall.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Mining of Massive Datasets Edited based on Leskovec’s from

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

SketchVisor: Robust Network Measurement for Software Packet Processing

These slides are based on the book:

TensorFlow– A system for large-scale machine learning

Advanced Computer Systems

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Department of Computer Science University of California, Santa Barbara

All-Pairs Shortest Paths

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara SIAM Annual Meeting July 10, 2009 Support: DOE Office of Science, NSF, DARPA, SGI

2 Combinatorial Scientific Computing “I observed that most of the coefficients in our matrices were zero; i.e., the nonzeros were ‘sparse’ in the matrix, and that typically the triangular matrices associated with the forward and back solution provided by Gaussian elimination would remain sparse if pivot elements were chosen with care” - Harry Markowitz, describing the 1950s work on portfolio theory that won the 1990 Nobel Prize for Economics

3 IC10: Mathematical Analysis of Hyperlinks in the World-Wide Web IP3: Discrete Mathematics and Theoretical Computer Science VNL: The Human Genome and Beyond IC18: Finding Provably Near-Optimal Solutions to Discrete Optimization Problems CP6: Computer Science: Discrete Algorithms MS54: New Approaches for Scalable Sparse Linear System Solution MS71: Special Session on IR Algorithms and Software MS73: Reverse Engineering Gene Networks MS86/110:Combinatorial Algorithms in Scientific Computing 2002 SIAM Annual Meeting

4 IC 7: Combinatorics Inspired by Biology IC 12: Parallel Network Analysis IC 14: On the Complexity of Game, Market, and Network Equilibria MS1/8: Adaptive Algebraic Multigrid Methods MS 4/11/26 : Mathematical Challenges in Cyber Security MS 36/43/55 : High Performance Computing on Massive Real-World Graphs MS 47: Optimization and Graph Algorithms MS 79: Enumerative and Geometric Combinatorics MS 71: Modeling of Large-Scale Metabolic Networks MS 75: Combinatorial Scientific Computing 2007 SIAM Annual Meeting

5 An analogy? As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

6 An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

7 An analogy? Well, we’re not there yet …. Discrete structure analysis Graph theory Computers Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libs Ways to extract performance from computer architecture Interactive environments

8 Five Challenges in Combinatorial Scientific Computing

9 How do we know we’re solving the right problem? –Harder in graph analytics than in numerical linear algebra –I’ll pretend this is a math modeling question, not a CSC question How will applications keep us honest in the future? –Tony Chan, 1991: T(A, B) = elapsed time before field A would notice the disappearance of field B – min A ( T(A, CSC) ) ? Challenges I’m not going to talk about

10 #1: The Architecture & Algorithms Challenge LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  Parallelism is no longer optional…  … in every part of a computation.

11 High-Performance Architecture  Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.  Originally, because linear algebra is the middleware of scientific computing.  Nowadays, largely for bragging rights. = x P A L U

12 The Memory Wall Blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)

13 The Memory Wall Blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or (parallelism + bandwidth).  You can hide lack of bandwidth with locality, but not with parallelism.

14 The Memory Wall Blues  Most of memory is hundreds or thousands of cycles away from the processor that wants it.  You can buy more bandwidth, but you can’t buy less latency. (Speed of light, for one thing.)  You can hide latency with either locality or (parallelism + bandwidth).  You can hide lack of bandwidth with locality, but not with parallelism.  Most emerging graph problems have lousy locality.  Thus the algorithms need even more parallelism!

15 Architectural impact on algorithms T = N 4.7 Naïve algorithm is O(N 5 ) time under UMH model. BLAS-3 DGEMM and recursive blocked algorithms are O(N 3 ). Size 2000 took 5 days would take 1095 years Slide from Larry Carter Naïve 3-loop matrix multiply [Alpern et al., 1992]:

16  Efficient sequential algorithms for graph-theoretic problems often follow long chains of dependencies  Computation per edge traversal is often small  Little locality or opportunity for reuse  A big opportunity exists for architecture to influence combinatorial algorithms.  Maybe even vice versa. The parallel computing challenge

17 Strongly connected components Symmetric permutation to block triangular form Diagonal blocks are strong Hall (irreducible / strongly connected) Sequential: linear time by depth-first search [Tarjan] Parallel: divide & conquer, work and span depend on input [Fleischer, Hendrickson, Pinar] PAP T G(A)

18 Strong components of 1M-vertex RMAT graph

19 One architectural approach: Cray MTA / XMT Hide latency by massive multithreading Per-tick context switching Uniform (sort of) memory access time But the economic case for the machine is less than completely clear.

20 #2: The Data Size Challenge “Can we understand anything interesting about our data when we do not even have time to read all of it?” - Ronitt Rubinfeld, 2002

21 Approximating min spanning tree weight in sublinear time [Chazelle, Rubinfeld, Trevisan] Key subroutine: estimate number of connected components of a graph, in time depending on relative error but not on size of graph Idea: for each vertex v define f(v) = 1 / (size of component containing v) Then Σ v f(v) = number of connected components Estimate Σ v f(v) by breadth-first search from a few vertices

22 Features of (many) large graph applications “Feasible” means O(n), or even less. You can’t scan all the data. –you want to poke at it in various ways –maybe interactively. Multiple simultaneous queries to the same graph –maybe to differently filtered subgraphs –throughput and response time are both important. Benchmark data sets are a big challenge! Think of data not as a finite object but as a statistical sample of an infinite process... Multiple simultaneous queries to same graph –Graph may be fixed, or slowly changing –Throughput and response time both important

23 #3: The Uncertainty Challenge “You could not step twice into the same river; for other waters are ever flowing on to you.” - Heraclitus

24 Statistical perspectives; robustness to fluctuations; streaming, noisy data; dynamic structure; analysis of sensitivity and propagation of uncertainty K-betweenness centrality for robustness [Bader] Stochastic centrality for infection vulnerability analysis [Vullikanti] Dynamic community updating [Bhowmick] Detection theory [Kepner]: “You should never worry about implementation; you should worry about whether you can correctly model the background and foreground.” Change and uncertainty

25 Horizontal - vertical decomposition [Mezic et al.] Strongly connected components, ordered by levels of DAG Linear-time sequentially; no work/span efficient parallel algorithms known … but we don’t want an exact algorithm anyway; edges are probabilities, and the application is a kind of statistical model reduction … level 1 level 2 level 3 level 4

26 Approach: 1.Decompose networks 2.Propagate uncertainty through components 3.Iteratively aggregate component uncertainty Spectral graph decomposition technique combined with dynamical systems analysis leads to deconstruction of a possibly unknown network into inputs, outputs, forward and feedback loops and allows identification of a minimal functional unit (MFU) of a system. (node 4 and several connections pruned, with no loss of performance) H-V decomposition Output, execution Feedback loops Trim the network, preserve dynamics! Input, initiator Forward, production unit Additional functional requirements Minimal functional units: sensitive edges (leading to lack of production) easily identifiable Allows identification of roles of different feedback loops Level of output For MFU Level of output with feedback loops Mezic group, UCSB Model reduction and graph decomposition

27 By analogy to numerical scientific computing... What should the combinatorial BLAS look like? #4: The Primitives Challenge C = A*B y = A*x μ = x T y Basic Linear Algebra Subroutines (BLAS): Speed (MFlops) vs. Matrix Size (n)

28 Primitives should… Supply a common notation to express computations Have broad scope but fit into a concise framework Allow programming at the appropriate level of abstraction and granularity Scale seamlessly from desktop to supercomputer Hide architecture-specific details from users

29 Frameworks for graph primitives Many possibilities; none completely satisfactory; relatively little work on common frameworks or interoperability. Visitor-based, distributed-memory: PBGL Visitor-based, multithreaded: MTGL Heterogeneous, tuned kernels: SNAP Sparse array-based: Matlab dialects, CBLAS Scan-based vectorized: NESL Map-reduce: lots of visibility, utility unclear

30 Distributed-memory parallel sparse matrix-matrix multiplication j * = i k k C ij C ij += A ik * B kj  2D block layout  Outer product formulation  Sequential “hypersparse” kernel Asynchronous MPI-2 implementation Experiments: TACC Lonestar cluster Good scaling to 256 processors Time vs Number of cores -- 1M-vertex RMAT

31 Recursive All-Pairs Shortest Paths AB CD A B D C A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC; + is “min”, × is “add” Based on R-Kleene algorithm Well suited for GPU architecture: Fast matrix-multiply kernel In-place computation => low memory bandwidth Few, large MatMul calls => low GPU dispatch overhead Recursion stack on host CPU, not on multicore GPU Careful tuning of GPU code

32 APSP: Experiments and observations 128-core Nvidia 8800 GPU Speedup relative to... 1-core CPU: 120x – 480x 16-core CPU: 17x – 45x Iterative, 128-core GPU: 40x – 680x MSSSP, 128-core GPU: ~3x Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key Time vs. Matrix Dimension

33 #5: The Education Challenge  Or, the interdisciplinary challenge.  Where do you go to take courses in  Graph algorithms …  … on massive data sets …  … in the presence of uncertainty …  … analyzed on parallel computers …  … applied to a domain science? CSC needs to be both coherent and interdisciplinary!

34 Five Challenges In CSC 1.Architecture & algorithms 2. Data size & management 3. Uncertainty 4. Primitives & software 5. Education

35 Morals Things are clearer if you look at them from multiple perspectives Combinatorial algorithms are pervasive in scientific computing and will become more so This is a great time to be doing combinatorial scientific computing!

36 Thanks … David Bader, Jon Berry, Nadya Bliss, Aydin Buluc, Larry Carter, Alan Edelman, John Feo, Bruce Hendrickson, Jeremy Kepner, Jure Leskovic, Kamesh Madduri, Michael Mahoney, Igor Mezic, Cleve Moler, Steve Reinhardt, Eric Robinson, Ronitt Rubinfeld, Rob Schreiber, Viral Shah, Sivan Toledo, Anil Vullikanti