1 Sources of Parallelism and Locality in Simulation.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

03/23/07CS267 Lecture 201 CS 267: Multigrid on Structured Grids Kathy Yelick
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Ma221 - Multigrid DemmelFall 2004 Ma 221 – Fall 2004 Multigrid Overview James Demmel
Numerical Algorithms • Matrix multiplication
Total Recall Math, Part 2 Ordinary diff. equations First order ODE, one boundary/initial condition: Second order ODE.
CS 584. Review n Systems of equations and finite element methods are related.
03/09/2005CS267 Lecture 14 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I James Demmel
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
CS267 Poisson Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - I James Demmel
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
CS267 Poisson 2.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
02/09/2005CS267 Lecture 71 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 James Demmel
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
3/1/2004CS267 Lecure 101 CS 267 Sources of Parallelism Kathy Yelick
03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
04/13/2009CS267 Lecture 20 CS 267: Applications of Parallel Computers Lecture Structured Grids Horst D. Simon
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.
CS240A: Conjugate Gradients and the Model Problem.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
02/14/2007CS267 Lecture 91 CS 267 Sources of Parallelism and Locality in Simulation Kathy Yelick
PDEs & Parabolic problems Jacob Y. Kazakia © Partial Differential Equations Linear in two variables: Usual classification at a given point (x,y):
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Partial differential equations Function depends on two or more independent variables This is a very simple one - there are many more complicated ones.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Ordinary Differential Equations (ODEs)
III Solution of pde’s using variational principles
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Systems of Linear Equations Iterative Methods
02/14/2007CS267 Lecture 91 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 Kathy Yelick
Scientific Computing Partial Differential Equations Poisson Equation.
Finite Element Method.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Discontinuous Galerkin Methods and Strand Mesh Generation
Linear Systems Iterative Solutions CSE 541 Roger Crawfis.
Discontinuous Galerkin Methods for Solving Euler Equations Andrey Andreyev Advisor: James Baeder Mid.
Elliptic PDEs and the Finite Difference Method
02/07/2006CS267 Lecture 71 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 James Demmel
1 CS240A: Parallelism in CSE Applications Tao Yang Slides revised from James Demmel and Kathy Yelick
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
CS267 Lecture 41 CS 267 Sources of Parallelism and Locality in Simulation James Demmel and Kathy Yelick
Partial Derivatives Example: Find If solution: Partial Derivatives Example: Find If solution: gradient grad(u) = gradient.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
CS267 L24 Solving PDEs.1 Demmel Sp 1999 Math Poisson’s Equation, Jacobi, Gauss-Seidel, SOR, FFT Plamen Koev
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
Solving Linear Systems Ax=b
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
Lecture 19 MA471 Fall 2003.
Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
CS6068 Applications: Numerical Methods
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
James Demmel CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I James Demmel.
James Demmel CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality.
Presentation transcript:

1 Sources of Parallelism and Locality in Simulation

2 Partial Differential Equations PDEs

3 Continuous Variables, Continuous Parameters Examples of such systems include Parabolic (time-dependent) problems: Heat flow: Temperature(position, time) Diffusion: Concentration(position, time) Elliptic (steady state) problems: Electrostatic or Gravitational Potential: Potential(position) Hyperbolic problems (waves): Quantum mechanics: Wave-function(position,time) Many problems combine features of above Fluid flow: Velocity,Pressure,Density(position,time) Elasticity: Stress,Strain(position,time)

4 Terminology Term hyperbolic, parabolic, elliptic, come from special cases of the general form of a second order linear PDE a*d 2 u/dx + b*d 2 u/dxdy + c*d 2 u/dy 2 + d*du/dx + e*du/dy + f = 0 where y is time Analog to solutions of general quadratic equation a*x 2 + b*xy + c*y 2 + d*x + e*y + f

5 Example: Deriving the Heat Equation 01 x x+h Consider a simple problem A bar of uniform material, insulated except at ends Let u(x,t) be the temperature at position x at time t Heat travels from x-h to x+h at rate proportional to: As h  0, we get the heat equation: d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h dt h = C * d u(x,t) d 2 u(x,t) dt dx 2 = C * x-h

6 Details of the Explicit Method for Heat From experimentation (physical observation) we have:  u(x,t) /  t =  2 u(x,t)/  x (assume C = 1 for simplicity) Discretize time and space and use explicit approach (as described for ODEs) to approximate derivative: (u(x,t+1) – u(x,t))/dt = (u(x-h,t) – 2*u(x,t) + u(x+h,t))/h 2 u(x,t+1) – u(x,t)) = dt/h 2 * (u(x-h,t) - 2*u(x,t) + u(x+h,t)) u(x,t+1) = u(x,t)+ dt/h 2 *(u(x-h,t) – 2*u(x,t) + u(x+h,t)) Let z = dt/h 2 u(x,t+1) = z* u(x-h,t) + (1-2z)*u(x,t) + z+u(x+h,t) By changing variables (x to j and y to i): u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+ z*u[j+1,i]

7 Explicit Solution of the Heat Equation Use finite differences with u[j,i] as the heat at time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) initial conditions on u[j,0] boundary conditions on u[0,i] and u[N,i] At each timestep i = 0,1,2,... This corresponds to matrix vector multiply nearest neighbors on grid t=5 t=4 t=3 t=2 t=1 t=0 u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+ z*u[j+1,i] where z = dt/h 2

8 Matrix View of Explicit Method for Heat Multiplying by a tridiagonal matrix at each step For a 2D mesh (5 point stencil) the matrix is pentadiagonal More on the matrix/grid views later 1-2z z z 1-2z z z 1-2z T = 1-2z z z Graph and “3 point stencil”

9 Parallelism in Explicit Method for PDEs Partitioning the space (x) into p largest chunks good load balance (assuming large number of points relative to p) minimized communication (only p chunks) Generalizes to multiple dimensions. arbitrary graphs (= arbitrary sparse matrices). Explicit approach often used for hyperbolic equations Problem with explicit approach for heat (parabolic): numerical instability. solution blows up eventually if z = dt/h 2 >.5 need to make the time steps very small when h is small: dt <.5*h 2

10 Instability in Solving the Heat Equation Explicitly

11 Implicit Solution of the Heat Equation From experimentation (physical observation) we have:  u(x,t) /  t =  2 u(x,t)/  x (assume C = 1 for simplicity) Discretize time and space and use implicit approach (backward Euler) to approximate derivative: (u(x,t+1) – u(x,t))/dt = (u(x-h,t+1) – 2*u(x,t+1) + u(x+h,t+1))/h 2 u(x,t) = u(x,t+1)+ dt/h 2 *(u(x-h,t+1) – 2*u(x,t+1) + u(x+h,t+1)) Let z = dt/h 2 and change variables (t to j and x to i) u(:,i) = (I - z *L)* u(:, i+1) Where I is identity and L is Laplacian L =

12 Implicit Solution of the Heat Equation The previous slide used Backwards Euler, but using the trapezoidal rule gives better numerical properties. This turns into solving the following equation: Again I is the identity matrix and L is: This is essentially solving Poisson’s equation in 1D (I + (z/2)*L) * u[:,i+1]= (I - (z/2)*L) *u[:,i] L = 2 Graph and “stencil”

13 2D Implicit Method Similar to the 1D case, but the matrix L is now Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation on 2D grid. To solve this system, there are several techniques L = 4 Graph and “5 point stencil” 3D case is analogous (7 point stencil)

14 Relation of Poisson to Gravity, Electrostatics Poisson equation arises in many problems E.g., force on particle at (x,y,z) due to particle at 0 is -(x,y,z)/r 3, where r = sqrt(x 2 +y 2 +z 2 ) Force is also gradient of potential V = -1/r = -(d/dx V, d/dy V, d/dz V) = -grad V V satisfies Poisson’s equation (try working this out!)

15 Algorithms for 2D Poisson Equation (N vars) AlgorithmSerialPRAMMemory #Procs Dense LUN 3 NN 2 N 2 Band LUN 2 NN 3/2 N JacobiN 2 NNN Explicit Inv.N log NNN Conj.Grad.N 3/2 N 1/2 *log NNN RB SORN 3/2 N 1/2 NN Sparse LUN 3/2 N 1/2 N*log NN FFTN*log Nlog NNN MultigridNlog 2 NNN Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM,

16 Overview of Algorithms Sorted in two orders (roughly): from slowest to fastest on sequential machines. from most general (works on any matrix) to most specialized (works on matrices “like” T). Dense LU: Gaussian elimination; works on any N-by-N matrix. Band LU: Exploits the fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal. Jacobi: Essentially does matrix-vector multiply by T in inner loop of iterative algorithm. Explicit Inverse: Assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it (but still expensive). Conjugate Gradient: Uses matrix-vector multiplication, like Jacobi, but exploits mathematical properties of T that Jacobi does not. Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits yet different mathematical properties of T. Used in multigrid schemes. LU: Gaussian elimination exploiting particular zero structure of T. FFT (fast Fourier transform): Works only on matrices very like T. Multigrid: Also works on matrices like T, that come from elliptic PDEs. Lower Bound: Serial (time to print answer); parallel (time to combine N inputs). Details in class notes and

17 Mflop/s Versus Run Time in Practice Problem: Iterative solver for a convection-diffusion problem; run on a 1024-CPU NCUBE-2. Reference: Shadid and Tuminaro, SIAM Parallel Processing Conference, March SolverFlopsCPU TimeMflop/s Jacobi3.82x Gauss-Seidel1.21x Least Squares2.59x Multigrid2.13x Which solver would you select?

18 Summary of Approaches to Solving PDEs As with ODEs, either explicit or implicit approaches are possible Explicit, sparse matrix-vector multiplication Implicit, sparse matrix solve at each step Direct solvers are hard (more on this later) Iterative solves turn into sparse matrix-vector multiplication Grid and sparse matrix correspondence: Sparse matrix-vector multiplication is nearest neighbor “averaging” on the underlying mesh Not all nearest neighbor computations have the same efficiency Factors are the mesh structure (nonzero structure) and the number of Flops per point.

19 Comments on practical meshes Regular 1D, 2D, 3D meshes Important as building blocks for more complicated meshes Practical meshes are often irregular Composite meshes, consisting of multiple “bent” regular meshes joined at edges Unstructured meshes, with arbitrary mesh points and connectivities Adaptive meshes, which change resolution during solution process to put computational effort where needed

20 Parallelism in Regular meshes Computing a Stencil on a regular mesh need to communicate mesh points near boundary to neighboring processors. Often done with ghost regions Surface-to-volume ratio keeps communication down, but Still may be problematic in practice Implemented using “ghost” regions. Adds memory overhead

21 Adaptive Mesh Refinement (AMR) Adaptive mesh around an explosion Refinement done by calculating errors Parallelism Mostly between “patches,” dealt to processors for load balance May exploit some within a patch (SMP) Projects: Titanium ( Chombo (P. Colella, LBL), KeLP (S. Baden, UCSD), J. Bell, LBL

22 Adaptive Mesh Shock waves in a gas dynamics using AMR (Adaptive Mesh Refinement) See: fluid density

23 Composite Mesh from a Mechanical Structure

24 Converting the Mesh to a Matrix

25 Effects of Reordering on Gaussian Elimination

26 Irregular mesh: NASA Airfoil in 2D

27 Irregular mesh: Tapered Tube (Multigrid)

28 Challenges of Irregular Meshes How to generate them in the first place Triangle, a 2D mesh partitioner by Jonathan Shewchuk 3D harder! How to partition them ParMetis, a parallel graph partitioner How to design iterative solvers PETSc, a Portable Extensible Toolkit for Scientific Computing Prometheus, a multigrid solver for finite element problems on irregular meshes How to design direct solvers SuperLU, parallel sparse Gaussian elimination These are challenges to do sequentially, more so in parallel

29 Jacobi’s Method To derive Jacobi’s method, write Poisson as: u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4 Let u(i,j,m) be approximation for u(i,j) after m steps u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) + u(i,j+1,m) + b(i,j)) / 4 I.e., u(i,j,m+1) is a weighted average of neighbors Motivation: u(i,j,m+1) chosen to exactly satisfy equation at (i,j) Convergence is proportional to problem size, N=n 2 See for details Therefore, serial complexity is O(N 2 )

30 Parallelizing Jacobi’s Method Reduces to sparse-matrix-vector multiply by (nearly) T U(m+1) = (T/4 - I) * U(m) + B/4 Each value of U(m+1) may be updated independently keep 2 copies for timesteps m and m+1 Requires that boundary values be communicated if each processor owns n 2 /p elements to update amount of data communicated, n/p per neighbor, is relatively small if n>>p

31 Gauss-Seidel Updating left-to-right row-wise order, we get the Gauss- Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 Cannot be parallelized, so use a “red-black” order forall black points u(i,j) u(i,j,m+1) = (u(i-1,j,m) + … forall red points u(i,j) u(i,j,m+1) = (u(i-1,j,m+1) + … ° For general graph, use graph coloring ° Graph(T) is bipartite => 2 colorable (red and black) ° Nodes for each color can be updated simultaneously ° Still Sparse-matrix-vector multiply, using submatrices

32 Successive Overrelaxation (SOR) Red-black Gauss-Seidel converges twice as fast as Jacobi, but there are twice as many parallel steps, so the same in practice To motivate next improvement, write basic step in algorithm as: u(i,j,m+1) = u(i,j,m) + correction(i,j,m) If “correction” is a good direction to move, then one should move even further in that direction by some factor w>1 u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m) Called successive overrelaxation (SOR) Parallelizes like Jacobi (Still sparse-matrix-vector multiply…) Can prove w = 2/(1+sin(  /(n+1)) ) for best convergence Number of steps to converge = parallel complexity = O(n), instead of O(n 2 ) for Jacobi Serial complexity O(n 3 ) = O(N 3/2 ), instead of O(n 4 ) = O(N 2 ) for Jacobi

33 Conjugate Gradient (CG) for solving A*x = b This method can be used when the matrix A is symmetric, i.e., A = A T positive definite, defined equivalently as: all eigenvalues are positive x T * A * x > 0 for all nonzero vectors s a Cholesky factorization, A = L*L T exists Algorithm maintains 3 vectors x = the approximate solution, improved after each iteration r = the residual, r = A*x - b p = search direction, also called the conjugate gradient One iteration costs Sparse-matrix-vector multiply by A (major cost) 3 dot products, 3 saxpys (scale*vector + vector) Converges in O(n) = O(N 1/2 ) steps, like SOR Serial complexity = O(N 3/2 ) Parallel complexity = O(N 1/2 log N), log N factor from dot-products

34 Summary of Jacobi, SOR and CG Jacobi, SOR, and CG all perform sparse-matrix-vector multiply For Poisson, this means nearest neighbor communication on an n-by-n grid It takes n = N 1/2 steps for information to travel across an n- by-n grid Since solution on one side of grid depends on data on other side of grid faster methods require faster ways to move information FFT Multigrid Domain Decomposition (preconditioned CG so that # iter ~ const.)