Computation on meshes, sparse matrices, and graphs

Slides:



Advertisements
Similar presentations
CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.
Advertisements

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
CS 584. Review n Systems of equations and finite element methods are related.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)
Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.
CS240A: Conjugate Gradients and the Model Problem.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
CS 240A: Parallelism in Physical Simulation
L21: “Irregular” Graph Algorithms November 11, 2010.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
CS 219: Sparse Matrix Algorithms
1 CS240A: Parallelism in CSE Applications Tao Yang Slides revised from James Demmel and Kathy Yelick
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Solution of Sparse Linear Systems
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
Parallel Direct Methods for Sparse Linear Systems
Auburn University
High Altitude Low Opening?
Model Problem: Solving Poisson’s equation for temperature
CS 290N / 219: Sparse Matrix Algorithms
Solving Linear Systems Ax=b
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
CS 290H Administrivia: April 16, 2008
Introduction to parallel algorithms
Lecture 19 MA471 Fall 2003.
Parallel Graph Algorithms
Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.
Parallel Matrix Operations
Introduction to parallel algorithms
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
Jacobi Project Salvatore Orlando.
Read GLN sections 6.1 through 6.4.
Administrivia: November 9, 2009
Parallel Graph Algorithms
To accompany the text “Introduction to Parallel Computing”,
Parallel Programming in C with MPI and OpenMP
James Demmel CS 267 Applications of Parallel Computers Lecture 14: Graph Partitioning - I James Demmel.
Introduction to parallel algorithms
Presentation transcript:

Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267 CS267 Lecture 2

Parallelizing Stencil Computations Parallelism is simple Grid is a regular data structure Even decomposition across processors gives load balance Spatial locality limits communication cost Communicate only boundary values from neighboring patches Communication volume v = total # of boundary cells between patches

Two-dimensional block decomposition n mesh cells, p processors Each processor has a patch of n/p cells Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n)

Ghost Nodes in Stencil Computations Comm cost = α * (#messages) + β * (total size of messages) Keep a ghost copy of neighbors’ boundary nodes Communicate every second iteration, not every iteration Reduces #messages, not total size of messages Costs extra memory and computation Can also use more than one layer of ghost nodes Green = my interior nodes Blue = my boundary nodes Yellow = neighbors’ boundary nodes = my “ghost nodes”

Parallelism in Regular meshes Computing a Stencil on a regular mesh need to communicate mesh points near boundary to neighboring processors. Often done with ghost regions Surface-to-volume ratio keeps communication down, but Still may be problematic in practice Implemented using “ghost” regions. Adds memory overhead

Irregular mesh: NASA Airfoil in 2D

Composite Mesh from a Mechanical Structure

Adaptive Mesh Refinement (AMR) Adaptive mesh around an explosion Refinement done by calculating errors

Adaptive Mesh fluid density Shock waves in a gas dynamics using AMR (Adaptive Mesh Refinement) See: http://www.llnl.gov/CASC/SAMRAI/

Irregular mesh: Tapered Tube (Multigrid)

Challenges of Irregular Meshes for PDE’s How to generate them in the first place For example, Triangle, a 2D mesh generator (Shewchuk) 3D mesh generation is harder! For example, QMD (Vavasis) How to partition them into patches For example, ParMetis, a parallel graph partitioner (Karypis) How to design iterative solvers For example, PETSc, Aztec, Hypre (all from national labs) … Prometheus, a multigrid solver for finite elements on irregular meshes How to design direct solvers For example, SuperLU, parallel sparse Gaussian elimination These are challenges to do sequentially, more so in parallel

Converting the Mesh to a Matrix

Sparse matrix data structure (stored by rows) 31 53 59 41 26 31 53 59 41 26 1 3 2 Full: 2-dimensional array of real or complex numbers (nrows*ncols) memory Sparse: compressed row storage about (2*nzs + nrows) memory

Distributed row sparse matrix data structure Each processor stores: # of local nonzeros range of local rows nonzeros in CSR form P2 Pn

Conjugate gradient iteration to solve A*x=b x0 = 0, r0 = b, d0 = r0 for k = 1, 2, 3, . . . αk = (rTk-1rk-1) / (dTk-1Adk-1) step length xk = xk-1 + αk dk-1 approx solution rk = rk-1 – αk Adk-1 residual βk = (rTk rk) / (rTk-1rk-1) improvement dk = rk + βk dk-1 search direction One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage

Parallel Vector Operations Suppose x, y, and z are column vectors Data is partitioned among all processors Vector add: z = x + y Embarrassingly parallel Scaled vector add (DAXPY): z = α*x + y Broadcast the scalar α, then independent * and + Vector dot product (DDOT): β = xTy = Sj x[j] * y[j] Independent *, then + reduction

Broadcast and reduction Broadcast of 1 value to p processors in log p time Reduction of p values to 1 in log p time Takes advantage of associativity in +, *, min, max, etc. α Broadcast 1 3 1 0 4 -6 3 2 Add-reduction 8

Parallel Dense Matrix-Vector Product (Review) y = A*x, where A is a dense matrix Layout: 1D by rows Algorithm: Foreach processor j Broadcast X(j) Compute A(p)*x(j) A(i) is the n by n/p block row that processor Pi owns Algorithm uses the formula Y(i) = A(i)*X = Sj A(i)*X(j) P0 P1 P2 P3 x P0 P1 P2 P3 y

Parallel sparse matrix-vector product Lay out matrix and vectors by rows y(i) = sum(A(i,j)*x(j)) Only compute terms with A(i,j) ≠ 0 Algorithm Each processor i: Broadcast x(i) Compute y(i) = A(i,:)*x Optimizations Only send each proc the parts of x it needs, to reduce comm Reorder matrix for better locality by graph partitioning Worry about balancing number of nonzeros / processor, if rows have very different nonzero counts x y P0 P1 P2 P3 P0 P1 P2 P3

Other memory layouts for matrix-vector product Column layout of the matrix eliminates the broadcast But adds a reduction to update the destination – same total comm Blocked layout uses a broadcast and reduction, both on only sqrt(p) processors – less total comm Blocked layout has advantages in multicore / Cilk++ too P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15

Graphs and Sparse Matrices Sparse matrix is a representation of a (sparse) graph 1 2 3 4 5 6 1 1 1 2 1 1 1 3 1 1 1 4 1 1 5 1 1 6 1 1 3 2 4 1 5 6 Matrix entries are edge weights Number of nonzeros per row is the vertex degree Edges represent data dependencies in matrix-vector multiplication

Graph partitioning Assigns subgraphs to processors Determines parallelism and locality. Tries to make subgraphs all same size (load balance) Tries to minimize edge crossings (communication). Exact minimization is NP-complete. edge crossings = 6 edge crossings = 10 Nodes and edges in graph may be weighted Optimal graph partitioning is NP-complete (define for non-CS audience) Good graph partitioning techniques exist More on this in a later lecture CS267 Lecture 2

Link analysis of the web 1 5 2 3 4 6 7 1 2 3 4 7 6 5 Web page = vertex Link = directed edge Link matrix: Aij = 1 if page i links to page j

Web graph: PageRank (Google) [Brin, Page] An important page is one that many important pages point to. Markov process: follow a random link most of the time; otherwise, go to any page at random. Importance = stationary distribution of Markov process. Transition matrix is p*A + (1-p)*ones(size(A)), scaled so each column sums to 1. Importance of page i is the i-th entry in the principal eigenvector of the transition matrix. But the matrix is 1,000,000,000,000 by 1,000,000,000,000.

A Page Rank Matrix Importance ranking of web pages Stationary distribution of a Markov chain Power method: matvec and vector arithmetic Matlab*P page ranking demo (from SC’03) on a web crawl of mit.edu (170,000 pages)

Social Network Analysis in Matlab: 1993 Co-author graph from 1993 Householder symposium

Social Network Analysis in Matlab: 1993 Sparse Adjacency Matrix Which author has the most collaborators? >>[count,author] = max(sum(A)) count = 32 author = 1 >>name(author,:) ans = Golub

Social Network Analysis in Matlab: 1993 Have Gene Golub and Cleve Moler ever been coauthors? >> A(Golub,Moler) ans = 0 No. But how many coauthors do they have in common? >> AA = A^2; >> AA(Golub,Moler) ans = 2 And who are those common coauthors? >> name( find ( A(:,Golub) .* A(:,Moler) ), :) ans = Wilkinson VanLoan