1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing with OpenMP
1 Computational models of the physical world Cortical bone Trabecular bone.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
1 Parallel Sparse Operations in Matlab: Exploring Large Graphs John R. Gilbert University of California at Santa Barbara Aydin Buluc (UCSB) Brad McRae.
Microsoft Proprietary High Productivity Computing Large-scale Knowledge Discovery: Co-evolving Algorithms and Mechanisms Steve Reinhardt Principal Architect.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Introduction CS 524 – High-Performance Computing.
High-Performance Computation for Path Problems in Graphs
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Tools and Primitives for High Performance Graph Computation
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Contemporary Languages in Parallel Computing Raymond Hummel.
Deployed Large-Scale Graph Analytics: Use Cases, Target Audiences, and Knowledge Discovery Toolbox (KDT) Technology Aydin Buluc, LBL John.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
1 Graph Algorithms in the Language of Linear Algebra John R. Gilbert University of California, Santa Barbara CS 240A presentation adapted from: Intel Non-Numeric.
Knowledge Discovery Toolbox kdt.sourceforge.net Adam Lugowski.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
1 Sparse Matrices for High-Performance Graph Analytics John R. Gilbert University of California, Santa Barbara Oak Ridge National Laboratory October 3,
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
STE 6239 Simulering Friday, Week 1: 5. Scientific computing: basic solvers.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
JAVA AND MATRIX COMPUTATION
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
CS240A: Conjugate Gradients and the Model Problem.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
 Programming - the process of creating computer programs.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Parallel Computing Presented by Justin Reschke
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Parallel Programming Models
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Introduction Super-computing Tuesday
Parallel Programming By J. H. Wang May 2, 2017.
Performance Evaluation of Adaptive MPI
Introduction.
Introduction.
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Presentation transcript:

1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB; Shoaib Kamil, MIT; Adam Lugowski, UCSB; Lenny Oliker, LBNL, Sam Williams, LBNL Dynamic Graphs Institute April 17, 2013 Support: Intel, Microsoft, DOE Office of Science, NSF

2 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds

3 Top 500 list (November 2012) = x P A L U Top500 Benchmark: Solve a large system of linear equations by Gaussian elimination

4 Graph 500 list (November 2012) Graph500 Benchmark: Breadth-first search in a large power-law graph

5 Floating-point vs. graphs, November 2012 = x P A L U Peta / 15.3 Tera is about Terateps 17.6 Petaflops

6 Floating-point vs. graphs, November 2012 = x P A L U Nov 2012: 17.6 Peta / 15.3 Tera ~ 1,100 Nov 2010: 2.5 Peta / 6.6 Giga ~ 380, Petaflops 15.3 Terateps

7 By analogy to numerical scientific computing... What should the combinatorial BLAS look like? The challenge of the software stack C = A*B y = A*x μ = x T y Basic Linear Algebra Subroutines (BLAS): Ops/Sec vs. Matrix Size

8 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds

9 Identification of Primitives Sparse matrix-matrix multiplication (SpGEMM) Element-wise operations × Matrices over various semirings: (+. x), (min. +), (or. and), … Sparse matrix-dense vector multiplication Sparse matrix indexing ×.* Sparse array-based primitives

10 Multiple-source breadth-first search X ATAT

11 Multiple-source breadth-first search Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Three possible levels of parallelism: searches, vertices, edges X ATAT ATXATX 

Graph contraction via sparse triple product A1 A3 A2 A1 A2 A3 Contract x x =

Subgraph extraction via sparse triple product Extract x x =

14 The Case for Sparse Matrices Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level. Traditional graph computations Graphs in the language of linear algebra Data driven, unpredictable communication. Fixed communication patterns Irregular and unstructured, poor locality of reference Operations on matrix blocks exploit memory hierarchy Fine grained data accesses, dominated by latency Coarse grained parallelism, bandwidth limited The case for sparse matrices

15 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds

Combinatorial BLAS class hierarchy CommGrid DCSCCSCCSBTriples SpMatSpDistMat DenseDistMa t DistMat Enforces interface only Combinatorial BLAS functions and operators DenseDistVecSpDistVec FullyDistVec... HAS A Polymorphism

17 Some Combinatorial BLAS functions

Parallel sparse matrix-matrix multiplication: SpGEMM x = 100K 25K 5K 25K 100K 5K A B C 2D algorithm: Sparse SUMMA (based on dense SUMMA) General implementation that handles rectangular matrices C ij += HyperSparseGEMM( A recv, B recv )

1D vs. 2D scaling for sparse matrix-matrix multiplication 2D algorithms have the potential to scale, but not linearly. SpSUMMA = 2-D data layout (Combinatorial BLAS) EpetraExt = 1-D data layout (Trilinos)

Almost linear scaling until bandwidth costs starts to dominate Scaling to more processors… Scaling proportional to √p afterwards

21 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds

Parallel graph analysis software Discrete structure analysis Graph theory Computers

Parallel graph analysis software Discrete structure analysis Graph theory Computers Communication Support (MPI, GASNet, etc) Threading Support (OpenMP, Cilk, etc)) Distributed Combinatorial BLAS Shared-address space Combinatorial BLAS HPC scientists and engineers Graph algorithm developers Knowledge Discovery Toolbox (KDT) KDT is higher level (graph abstractions) Combinatorial BLAS is for performance Domain scientists

(Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings Domain expert vs. graph expert Markov Clustering Input Graph Largest Component Graph of Clusters

Markov Clustering Input Graph Largest Component Graph of Clusters Domain expert vs. graph expert (Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings comp = bigG.connComp() giantComp = comp.hist().argmax() G = bigG.subgraph(comp==giantComp) clusters = G.cluster(‘Markov’) clusNedge = G.nedge(clusters) smallG = G.contract(clusters) # visualize

Domain expert vs. graph expert Markov Clustering Input Graph Largest Component Graph of Clusters (Semantic) directed graphs – constructors, I/O – basic graph metrics (e.g., degree() ) – vectors Clustering / components Centrality / authority: betweenness centrality, PageRank Hypergraphs and sparse matrices Graph primitives (e.g., bfsTree() ) SpMV / SpGEMM on semirings […] L = G.toSpParMat() d = L.sum(kdt.SpParMat.Column) L = -L L.setDiag(d) M = kdt.SpParMat.eye(G.nvert()) – mu*L pos = kdt.ParVec.rand(G.nvert()) for i in range(nsteps): pos = M.SpMV(pos) comp = bigG.connComp() giantComp = comp.hist().argmax() G = bigG.subgraph(comp==giantComp) clusters = G.cluster(‘Markov’) clusNedge = G.nedge(clusters) smallG = G.contract(clusters) # visualize

Aimed at domain experts who know their problem well but don’t know how to program a supercomputer Easy-to-use Python interface Runs on a laptop as well as a cluster with 10,000 processors Open source software (New BSD license) V0.3 release April 2013 A general graph library with operations based on linear algebraic primitives

28 A few KDT applications

Graph of text & phone calls Betweenness centrality on text messages Betweenness centrality Betweenness centrality on phone calls Filtering a graph by attribute value

Example: Vertex types: Person, Phone, Camera, Gene, Pathway Edge types: PhoneCall, TextMessage, CoLocation, SequenceSimilarity Edge attributes: Time, Duration Calculate centrality just for s among engineers sent between given start and end times Attributed semantic graphs and filters def onlyEngineers (self): return self.position == Engineer def timed (self, sTime, eTime): return ((self.type == ) and (self.Time > sTime) and (self.Time < eTime)) G.addVFilter(onlyEngineers) G.addEFilter(timed (start, end)) # rank via centrality based on recent transactions among engineers bc = G.rank(’approxBC’)

31 Outline Motivation Sparse matrices for graph algorithms CombBLAS: sparse arrays and graphs on parallel machines KDT: attributed semantic graphs in a high-level language Specialization: getting the best of both worlds

32 On-the-fly filter performance issues Write filters in Python and call back from CombBLAS –Flexible and easy but runs slow The issue is local computation, not parallelism or data movement All user-written semirings face the same issue

33 Solution: Just-in-time specialization (SEJITS) On first call, translate Python filter or semiring op to C++ Compile with GCC Call the compiled code thereafter (Lots of details omitted ….)

Filtered BFS with SEJITS Time (in seconds) for a single BFS iteration on scale 25 RMAT (33M vertices, 500M edges) with 10% of elements passing filter. Machine is NERSC’s Hopper.

35 Conclusion Sparse arrays and matrices give useful primitives and algorithms for high-performance graph computation. Just-in-time specialization enables flexible graph analytics. It helps to look at things from two directions. kdt.sourceforge.net/ gauss.cs.ucsb.edu/~aydin/CombBLAS/