Jim Demmel and Horst D. Simon

Jim Demmel and Horst D. Simon
CS 267: Applications of Parallel Computers Lecture Structured Grids Jim Demmel and Horst D. Simon

The Motifs (formerly “Dwarfs”) from
“The Berkeley View” (Asanovic et al.) Motifs form key computational patterns Topic of this lecture 04/13/09 CS267 Lecture 20 2

Overview of Methods for Poisson Equation Jacobi’s method
Outline Review PDEs Overview of Methods for Poisson Equation Jacobi’s method Red-Black SOR method Conjugate Gradient Sparse LU (Lecture 17, 3/16/10, Sherry Li) FFT (Lecture 22, 4/1/10 ) Multigrid 04/13/09 CS267 Lecture 20

Solving PDEs ° Hyperbolic problems (waves):
Sound wave(position, time) Use explicit time-stepping Solution at each point depends on neighbors at previous time Elliptic (steady state) problems: Electrostatic Potential (position) Everything depends on everything else This means locality is harder to find than in hyperbolic problems Parabolic (time-dependent) problems: Temperature(position,time) Involves an elliptic solve at each time-step Focus on elliptic problems Canonical example is the Poisson equation 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z) 04/13/09 CS267 Lecture 20

Explicit Solution of PDEs
Explicit methods often used for hyperbolic PDEs Stability limits size of time-step Computation corresponds to Matrix vector multiply Combine nearest neighbors on grid Use finite differences with u[j,i] as the solution at time t= i* (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) initial conditions on u[j,0] boundary conditions on u[0,i] and u[N,i] At each timestep i = 0,1,2,... i i=5 i=4 i=3 i=2 i=1 i=0 U[j,I+1] = For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] where z =C*/h2 j 04/13/09 CS267 Lecture 20 u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]

Matrix View of Explicit Method for Heat Equation
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] u[ :, i+1] = T * u[ :, i] where T is tridiagonal: L called Laplacian (in 1D) For a 2D mesh (5 point stencil) the Laplacian is pentadiagonal (5 nonzeros per row, along diagonals) For a 3D mesh there are 7 nonzeros per row T = = I – z*L, L = 1-2z z z z z z z 1-2z z Graph and “3 point stencil” 04/13/09 CS267 Lecture 20

Poisson’s equation in 1D: 2u/x2 = f(x)
T = 2 -1 Graph and “stencil” 04/13/09 CS267 Lecture 20

Similar to the 1D case, but the matrix T is now
2D Poisson’s equation Similar to the 1D case, but the matrix T is now 3D is analogous Graph and “stencil” -1 -1 4 -1 T = -1 04/13/09 CS267 Lecture 20

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 04/13/09 CS267 Lecture 20

Overview of Algorithms on n x n grid (n x n x n grid) (1)
Sorted in two orders (roughly): from slowest to fastest on sequential machines. from most general (works on any matrix) to most specialized (works on matrices “like” T). Dense LU: Gaussian elimination; works on any N-by-N matrix. Band LU: Exploits the fact that T is nonzero only on N1/2 (N2/3) diagonals nearest main diagonal. Jacobi: Essentially does matrix-vector multiply by T in inner loop of iterative algorithm. Explicit Inverse: Assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it (but still expensive!). Conjugate Gradient: Uses matrix-vector multiplication, like Jacobi, but exploits mathematical properties of T that Jacobi does not. Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits yet different mathematical properties of T. Used in multigrid schemes. Sparse LU: Gaussian elimination exploiting particular zero structure of T. FFT (fast Fourier transform): Works only on matrices very like T. Multigrid: Also works on matrices like T, that come from elliptic PDEs. Lower Bound: Serial (time to print answer); parallel (time to combine N inputs). Details in class notes and Jacobi – converges if matrix diagonally dominant (our problem is nearly so) CG – converges if matrix symmetric positive definite (SPD) Red Black SOR – converges if matrix is SPD, analysis of convergence rate depends on more structure (RB ordering) 04/13/09 CS267 Lecture 20

Overview of Algorithms on n x n grid (n x n x n grid) (2)
Dimension = N = n2 in 2D (=n3 in 3D) Bandwidth = BW = n in 2D (=n2 in 3D) Condition number = k = n2 in 2D or 3D #its = number of iterations SpMV = Sparse-matrix-vector-multiply Dense LU: cost = N3 (serial) or N (parallel, with N2 procs) Band LU: cost = N * BW2 (serial) or N (parallel with BW2 procs) Jacobi: cost = cost(SpMV) * #its, #its = k. Conjugate Gradient: cost = (cost(SpMV) + cost(dot prod))* #its, #its = sqrt(k) Red-Black SOR: cost = cost(SpMV) * #its, #its = sqrt(k) Iterative algorithms need different assumptions for convergence analysis SpMV cost is cost of nearest neighbor relaxation, which is O(N) Jacobi – converges if matrix diagonally dominant (our problem is nearly so) CG – converges if matrix symmetric positive definite (SPD) Red Black SOR – converges if matrix is SPD, analysis of convergence rate depends on more structure (RB ordering) 04/13/09 CS267 Lecture 20

To derive Jacobi’s method, write Poisson as:
u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4 Let u(i,j,m) be approximation for u(i,j) after m steps u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) + u(i,j+1,m) + b(i,j)) / 4 I.e., u(i,j,m+1) is a weighted average of neighbors Motivation: u(i,j,m+1) chosen to exactly satisfy equation at (i,j) Steps to converge proportional to problem size, N=n2 See Lecture 24 of Therefore, serial complexity is O(N2) 04/13/09 CS267 Lecture 20

Convergence of Nearest Neighbor Methods
Jacobi’s method involves nearest neighbor computation on nxn grid (N = n2) So it takes O(n) = O(sqrt(N)) iterations for information to propagate E.g., consider a rhs (b) that is 0, except the center is 1 The exact solution looks like: Even in the best case, any nearest neighbor computation will take n/2 steps to propagate on an nxn grid 04/13/09 CS267 Lecture 20

Convergence of Nearest Neighbor Methods
Best 5 step solution is not a particular method, but uses the true solution 5 steps out from center, and zero elsewhere. 04/13/09 CS267 Lecture 20

Parallelizing Jacobi’s Method
Reduces to sparse-matrix-vector multiply by (nearly) T U(m+1) = (T/4 - I) * U(m) + B/4 Each value of U(m+1) may be updated independently keep 2 copies for iterations m and m+1 Requires that boundary values be communicated if each processor owns n2/p elements to update amount of data communicated, n/p per neighbor, is relatively small if n>>p 04/13/09 CS267 Lecture 20

Locality Optimizations in Jacobi
Nearest neighbor update in Jacobi has: Good spatial locality (for regular mesh): traverse rows/columns Poor temporal locality: few flops per mesh point E.g., on 2D mesh: 4 adds and 1 multiply for 5 loads and 1 store For both parallel and single processors, may trade off extra computation for reduced memory operations Idea: for each subgrid, do multiple Jacobi iterations Size of subgrid shrinks on each iteration, so start with larger one Used for uniprocessors: Rescheduling for Locality, Michelle Mills Strout: www-unix.mcs.anl.gov/~mstrout Cache-efficient multigrid algorithms, Sid Chatterjee And parallel machines, including “tuning” for grids Ian Foster et al, SC2001 04/13/09 CS267 Lecture 20

Redundant Ghost Nodes in Jacobi
Overview of Memory Hierarchy Optimization Can be used on unstructured meshes Size of ghost region (and redundant computation) depends on network/memory speed vs. computation To compute green Copy yellow Compute blue 04/13/09 CS267 Lecture 20

Improvements on Jacobi
Similar to Jacobi: u(i,j,m+1) is computed as a linear combination of neighbors Numeric coefficients and update order are different Based on 2 improvements over Jacobi Use “most recent values” of u that are available, since these are probably more accurate Update value of u(m+1) “more aggressively” at each step First, note that while evaluating sequentially u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) … some of the values for m+1 are already available u(i,j,m+1) = (u(i-1,j,latest) + u(i+1,j,latest) … where latest is either m or m+1 04/13/09 CS267 Lecture 20

For general graph, use graph coloring
Gauss-Seidel Updating left-to-right row-wise order, we get the Gauss-Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points u(i,j) u(i,j,m+1) = (u(i-1,j,m) + … forall red points u(i,j) u(i,j,m+1) = (u(i-1,j,m+1) + … For general graph, use graph coloring Can use repeated Maximal Independent Sets to color Graph(T) is bipartite => 2 colorable (red and black) Nodes for each color can be updated simultaneously Still Sparse-matrix-vector multiply, using submatrices 04/13/09 CS267 Lecture 20

Successive Overrelaxation (SOR)
Red-black Gauss-Seidel converges twice as fast as Jacobi, but there are twice as many parallel steps, so the same in practice To motivate next improvement, write basic step in algorithm as: u(i,j,m+1) = u(i,j,m) + correction(i,j,m) If “correction” is a good direction to move, then one should move even further in that direction by some factor w>1 u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m) Called successive overrelaxation (SOR) Parallelizes like Jacobi (Still sparse-matrix-vector multiply…) Can prove w = 2/(1+sin(p/(n+1)) ) for best convergence Number of steps to converge = parallel complexity = O(n), instead of O(n2) for Jacobi Serial complexity O(n3) = O(N3/2), instead of O(n4) = O(N2) for Jacobi 04/13/09 CS267 Lecture 20

Conjugate Gradient (CG) for solving A*x = b
This method can be used when the matrix A is symmetric, i.e., A = AT positive definite, defined equivalently as: all eigenvalues are positive xT * A * x > 0 for all nonzero vectors s a Cholesky factorization, A = L*LT exists Algorithm maintains 3 vectors x = the approximate solution, improved after each iteration r = the residual, r = A*x - b p = search direction, also called the conjugate gradient One iteration costs Sparse-matrix-vector multiply by A (major cost) 3 dot products, 3 saxpys (scale*vector + vector) Converges in O(n) = O(N1/2) steps, like SOR Serial complexity = O(N3/2) Parallel complexity = O(N1/2 log N), log N factor from dot-products 04/13/09 CS267 Lecture 20

Conjugate Gradient Algorithm for Solving Ax=b
Initial guess x r = b – A*x, j=1 Repeat rho = rT*r … dot product If j=1, p = r, else beta = rho/old_rho, p = r + beta*p, endif … saxpy q = A*p … sparse matrix vector multiply alpha = rho / pT * q … dot product x = x + alpha * p … saxpy r = r – alpha * q … saxpy old_rho = rho; j=j+1 Until rho small enough 04/13/09 CS267 Lecture 20

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 04/13/09 CS267 Lecture 20

Similar to the 1D case, but the matrix T is now
2D Poisson’s equation Similar to the 1D case, but the matrix T is now 3D is analogous Graph and “stencil” -1 -1 4 -1 T = -1 04/13/09 CS267 Lecture 20

Multigrid Motivation Recall that Jacobi, SOR, CG, or any other sparse- matrix-vector-multiply-based algorithm can only move information one grid cell at a time Take at least n steps to move information across n x n grid Can show that decreasing error by fixed factor c<1 takes W(log n) steps Convergence to fixed error < 1 takes W(log n ) steps Therefore, converging in O(1) steps requires moving information across grid faster than to one neighboring grid cell per step 04/13/09 CS267 Lecture 20

Multigrid Methods We studied several iterative methods
Jacobi, SOR, Guass-Seidel, Red-Black variations, Conjugate Gradients (CG) All use sparse matrix-vector multiply (nearest neighbor communication on grid) Key problem with iterative methods is that: detail (short wavelength) is correct convergence controlled by coarse (long wavelength) structure In simple methods one needs of order N2 iterations to get good results Ironically, one goes to large N (fine mesh) to get detail If all you wanted was coarse structure, a smaller mesh would be fine Basic idea in multigrid is key in many areas of science Solve a problem at multiple scales We get coarse structure from small N and fine detail from large N Good qualitative idea but how do we implement? 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox and (indirectly) Ulrich Ruede

Boundary Conditions Exact Solution
Gauss Seidel is Slow I Take Laplace’s equation in the Unit Square with initial guess as  = 0 and boundary conditions that are zero except on one side For N=31 x 31 Grid it takes around 1000 (N2) iterations to get a reasonable answer Boundary Conditions Exact Solution 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox and (indirectly) Ulrich Ruede

Gauss Seidel is Slow II 1 Iteration 10 Iterations 1000 Iterations
04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox and (indirectly) Ulrich Ruede

Multigrid Overview Basic Algorithm:
Replace problem on fine grid by an approximation on a coarser grid Solve the coarse grid problem approximately, and use the solution as a starting guess for the fine-grid problem, which is then iteratively updated Solve the coarse grid problem recursively, i.e. by using a still coarser grid approximation, etc. Success depends on coarse grid solution being a good approximation to the fine grid 04/13/09 CS267 Lecture 20

Same Big Idea used elsewhere
Replace fine problem by coarse approximation, recursively Multilevel Graph Partitioning (METIS): Replace graph to be partitioned by a coarser graph, obtained via Maximal Independent Set Given partitioning of coarse grid, refine using Kernighan-Lin Barnes-Hut (and Fast Multipole Method) for computing gravitational forces on n particles in O(n log n) time: Approximate particles in box by total mass, center of gravity Good enough for distant particles; for close ones, divide box recursively All examples depend on coarse approximation being accurate enough (at least if we are far enough away) 04/13/09 CS267 Lecture 20

Multigrid uses Divide-and-Conquer in 2 Ways
First way: Solve problem on a given grid by calling Multigrid on a coarse approximation to get a good guess to refine Second way: Think of error as a sum of sine curves of different frequencies Each call to Multigrid responsible for suppressing coefficients of sine curves of the lower half of the frequencies in the error (pictures later) 04/13/09 CS267 Lecture 20

Multigrid Sketch in 1D Consider a 2m+1 grid in 1D for simplicity
Let P(i) be the problem of solving the discrete Poisson equation on a 2i+1 grid in 1D Write linear system as T(i) * x(i) = b(i) P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest 04/13/09 CS267 Lecture 20

Multigrid Sketch in 2D Consider a 2m+1 by 2m+1 grid in 2D
Let P(i) be the problem of solving the discrete Poisson equation on a 2i+1 by 2i+1 grid in 2D Write linear system as T(i) * x(i) = b(i) P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest 04/13/09 CS267 Lecture 20

Multigrid Hierarchy Relax Interpolate Restrict
04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Basic Multigrid Ideas k Restrict Interpolate k+1
In picture, relax is application of standard iteration scheme “solve” short wavelength solution at a given level i.e. use Jacobi, Gauss-Seidel, Conjugate Gradient Interpolation is taking a solution at a coarser grid and interpolating to find a solution at half the grid size Restriction is taking solution at given grid and averaging to find solution at coarser grid k Restrict Interpolate k+1 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Multigrid Operators For problem P(i) at varying coarsening levels (i, grid size grows with i): b(i) is the Right Hand Side (RHS) and x(i) is the current estimated solution All the following operators just average values on neighboring grid points (so information moves fast on coarse grids) The restriction operator R(i) maps P(i) to P(i-1) Restricts problem on fine grid P(i) to coarse grid P(i-1) Uses sampling or averaging b(i-1)= R(i) (b(i)) The interpolation operator In(i-1) maps approx. solution x(i-1) to x(i) Interpolates solution on coarse grid P(i-1) to fine grid P(i) x(i) = In(i-1)(x(i-1)) The solution operator S(i) takes P(i) and improves solution x(i) Uses “weighted” Jacobi or SOR on single level of grid x improved (i) = S(i) (b(i), x(i)) Overall algorithm, then details of operators both live on grids of size 2i-1 04/13/09 CS267 Lecture 20

Multigrid V-Cycle Algorithm
Function MGV ( b(i), x(i) ) … Solve T(i)*x(i) = b(i) given b(i) and an initial guess for x(i) … return an improved x(i) if (i = 1) compute exact solution x(1) of P(1) only 1 unknown return x(1) else solve recursively x(i) = S(i) (b(i), x(i)) improve solution by damping high frequency error r(i) = T(i)*x(i) - b(i) compute residual d(i) = In(i-1) ( MGV( R(i) ( r(i) ), 0 ) ) solve T(i)*d(i) = r(i) recursively x(i) = x(i) - d(i) correct fine grid solution x(i) = S(i) ( b(i), x(i) ) improve solution again return x(i) Function MGV ( b(i), x(i) ) … Solve T(i)*x(i) = b(i) given b(i) and an initial guess for x(i) … return an improved x(i) if (i = 1) compute exact solution x(1) of P(1) only 1 unknown return x(1) else x(i) = S(i) (b(i), x(i)) improve solution by damping high frequency error r(i) = T(i)*x(i) - b(i) compute residual d(i) = In(i-1) ( MGV( R(i) ( r(i) ), 0 ) ) solve T(i)*d(i) = r(i) recursively x(i) = x(i) - d(i) correct fine grid solution x(i) = S(i) ( b(i), x(i) ) improve solution again return x(i) 04/13/09 CS267 Lecture 20

This is called a V-Cycle
Just a picture of the call graph In time a V-cycle looks like the following 04/13/09 CS267 Lecture 20

Complexity of a V-Cycle
On a serial machine Work at each point in a V-cycle is O(the number of unknowns) Cost of Level i is (2i-1)2 = O(4 i) for a 2D grid If finest grid level is m, total time is: S O(4 i) = O( 4 m) for a 2D grid = O(# unknowns) in general On a parallel machine (PRAM) with one processor per grid point and free communication, each step in the V-cycle takes constant time Total V-cycle time is O(m) = O(log #unknowns) m i=1 04/13/09 CS267 Lecture 20

Full Multigrid (FMG) Intuition: Function FMG (b(m), x(m))
improve solution by doing multiple V-cycles avoid expensive fine-grid (high frequency) cycles analysis of why this works is beyond the scope of this class Function FMG (b(m), x(m)) … return improved x(m) given initial guess compute the exact solution x(1) of P(1) for i=2 to m x(i) = MGV ( b(i), In (i-1) (x(i-1) ) ) In other words: Solve the problem with 1 unknown Given a solution to the coarser problem, P(i-1) , map it to starting guess for P(i) Solve the finer problem using the Multigrid V-cycle 04/13/09 CS267 Lecture 20

Full Multigrid Cost Analysis
One V for each call to FMG people also use Ws and other compositions Serial time: S O(4 i) = O( 4 m) = O(# unknowns) PRAM time: S O(i) = O( m 2) = O( log2 # unknowns) m i=1 m i=1 04/13/09 CS267 Lecture 20

Complexity of Solving Poisson’s Equation
Theorem: error after one FMG call: error_after <= (.5* error_before) independent of # unknowns At least 1 bit each time Corollary: We can make the error < any fixed tolerance in a fixed number of steps, independent of size of finest grid This is the most important convergence property of MG, distinguishing it from all other methods, which converge more slowly for large grids Total complexity just proportional to cost of one FMG call JD: explain first line (parse) 04/13/09 CS267 Lecture 20

The Solution Operator S(i) - Details
The solution operator, S(i), is a weighted Jacobi Consider the 1D problem At level i, pure Jacobi replaces: x(j) := 1/2 (x(j-1) + x(j+1) + b(j) ) Weighted Jacobi uses: x(j) := 1/3 (x(j-1) + x(j) + x(j+1) + b(j) ) In 2D, similar average of nearest neighbors JD 04/13/09 CS267 Lecture 20

Weighted Jacobi chosen to damp high frequency error
Initial error “Rough” Lots of high frequency components Norm = 1.65 Error after 1 weighted Jacobi step “Smoother” Less high frequency component Norm = 1.06 Norm is probably the root sum of squares of the errors (2norm) Error after 2 weighted Jacobi steps “Smooth” Little high frequency component Norm = .92, won’t decrease much more 04/13/09 CS267 Lecture 20

Multigrid as Divide and Conquer Algorithm
Each level in a V-Cycle reduces the error in one part of the frequency domain 04/13/09 CS267 Lecture 20

The Restriction Operator R(i) - Details
The restriction operator, R(i), takes a problem P(i) with RHS b(i) and maps it to a coarser problem P(i-1) with RHS b(i-1) In 1D, average values of neighbors xcoarse(i) = 1/4 * xfine(i-1) /2 * xfine(i) /4 * xfine(i+1) In 2D, average with all 8 neighbors (N,S,E,W,NE,NW,SE,SW) 04/13/09 CS267 Lecture 20

Interpolation Operator In(i-1): details
The interpolation operator In(i-1), takes a function on a coarse grid P(i-1) , and produces a function on a fine grid P(i) In 1D, linearly interpolate nearest coarse neighbors xfine(i) = xcoarse(i) if the fine grid point i is also a coarse one, else xfine(i) = 1/2 * xcoarse(left of i) + 1/2 * xcoarse(right of i) In 2D, interpolation requires averaging with 4 nearest neighbors (NW,SW,NE,SE) 04/13/09 CS267 Lecture 20

Convergence Picture of Multigrid in 1D
04/13/09 CS267 Lecture 20

Convergence Picture of Multigrid in 2D
04/13/09 CS267 Lecture 20

Start with n=2m+1 by 2m+1 grid (here m=5)
Parallel 2D Multigrid Multigrid on 2D requires nearest neighbor (up to 8) computation at each level of the grid Start with n=2m+1 by 2m+1 grid (here m=5) Use an s by s processor grid (here s=4) 04/13/09 CS267 Lecture 20

Performance Model of parallel 2D Multigrid (V-cycle)
Assume 2m+1 by 2m+1 grid of points, n= 2m-1, N=n2 Assume p = 4k processors, arranged in 2k by 2k grid Processors start with 2m-k by 2m-k subgrid of unknowns Consider V-cycle starting at level m At levels m through k of V-cycle, each processor does some work At levels k-1 through 1, some processors are idle, because a 2k-1 by 2k-1 grid of unknowns cannot occupy each processor Cost of one level in V-cycle If level j >= k, then cost = O(4j-k ) …. Flops, proportional to the number of grid points/processor + O( 1 ) a …. Send a constant # messages to neighbors + O( 2j-k) b …. Number of words sent If level j < k, then cost = O(1) …. Flops, proportional to the number of grid points/processor + O(1) a …. Send a constant # messages to neighbors + O(1) b … Number of words sent Sum over all levels in all V-cycles in FMG to get complexity 04/13/09 CS267 Lecture 20

Comparison of Methods (in O(.) sense)
# Flops # Messages # Words sent MG N/p (log N) (N/p)1/2 + log p * log N log p * log N FFT N log N / p p1/ N/p SOR N3/2 /p N1/ N/p SOR is slower than others on all counts Flops for MG and FFT depends on accuracy of MG MG communicates less total data (bandwidth) Total messages (latency) depends … This coarse analysis can’t say whether MG or FFT is better when a >> b 04/13/09 CS267 Lecture 20

In practice, we don’t go all the way to P(1)
Practicalities In practice, we don’t go all the way to P(1) In sequential code, the coarsest grids are negligibly cheap, but on a parallel machine they are not. Consider 1000 points per processor In 2D, the surface to communicate is 4xsqrt(1000) ~= 128, or 13% In 3D, the surface is ~= 500, or 50% See Tuminaro and Womble, SIAM J. Sci. Comp., v14, n5, 1993 for analysis of MG on 1024 nCUBE2 on 64x64 grid of unknowns, only 4 per processor efficiency of 1 V-cycle was .02, and on FMG .008 on 1024x1024 grid efficiencies were .7 (MG Vcycle) and .42 (FMG) although worse parallel efficiency, FMG is 2.6 times faster that V-cycles alone nCUBE had fast communication, slow processors 04/13/09 CS267 Lecture 20

Multigrid on an Adaptive Mesh
For problems with very large dynamic range, another level of refinement is needed Build adaptive mesh and solve multigrid (typically) at each level Can’t afford to use finest mesh everywhere 04/13/09 CS267 Lecture 20

Multiblock Applications
Solve system of equations on a union of rectangles subproblem of AMR E.g., 04/13/09 CS267 Lecture 20

Adaptive Mesh Refinement
Data structures in AMR Usual parallelism is to assign grids on each level to processors Load balancing is a problem 04/13/09 CS267 Lecture 20

Domains in Titanium designed for this problem
Support for AMR Domains in Titanium designed for this problem Kelp, Boxlib, and AMR++ are libraries for this Primitives to help with boundary value updates, etc. 04/13/09 CS267 Lecture 20

Multigrid on an Unstructured Mesh
Another approach to variable activity is to use an unstructured mesh that is more refined in areas of interest Adaptive rectangular or unstructured? Numerics easier on rectangular Supposedly easier to implement (arrays without indirection) but boundary cases tend to dominate code Up to 39M unknowns on 960 processors, With 50% efficiency (Source: M. Adams) 04/13/09 CS267 Lecture 20

Multigrid on an Unstructured Mesh
Need to partition graph for parallelism What does it mean to do Multigrid anyway? Need to be able to coarsen grid (hard problem) Can’t just pick “every other grid point” anymore Use “maximal independent sets” again How to make coarse graph approximate fine one Need to define R() and In() How do we convert from coarse to fine mesh and back? Need to define S() How do we define coarse matrix (no longer formula, like Poisson) Dealing with coarse meshes efficiently Should we switch to using fewer processors on coarse meshes? Should we switch to another solver on coarse meshes? 04/13/09 CS267 Lecture 20

Source of Unstructured Finite Element Mesh: Vertebra
Source: M. Adams, H. Bayraktar, T. Keaveny, P. Papadopoulos, A. Gupta 04/13/09 CS267 Lecture 20

Multigrid for nonlinear elastic analysis of bone
Source: M. Adams et al Mechanical testing for material properties 3D image mFE mesh 2.5 mm3 44 mm elements Micro Computed 22 mm resolution Up to 537M unknowns 4088 Processors (ASCI White) 70% parallel efficiency 04/13/09 CS267 Lecture 20

Extra Slides 04/13/2009 CS267 Lecture 20

Preconditioning One can change A and b by preconditioning
M1-1 A (M2-1 M2)x = M1-1 b is same equation as before for any choice of matrices M1 and M2 All these choices are designed to accelerate convergence of iterative methods Anew = M1-1 A M2-1 xnew = M2 x bnew = M1-1 b We choose M1 and M2 to make our standard methods perform better There are specialized preconditioning ideas and perhaps better general approaches such as multigrid and Incomplete LU (ILU) decomposition Anew xnew = bnew has same form as above and we can apply any of the methods that we used on A x = b 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Multigrid Philosophically
Suppose we have a finest level M(1) with N by N points (in 2D) Then the k’th coarsest approximation M(k) to this has N/2k by N/2k points One way to think about Multigrid is that M(k+1) can form a preconditioner to M(k) so that one can replace natural matrix A(k) by A- 1(k+1)A(k) A-1(k+1)A(k) is a nicer matrix than A(k) and iterative solvers converge faster as long wavelength terms have been removed Basic issue is that A(k) and A(k+1) are of different size so we need prolongation and restriction to map solutions at level k and k+1 We apply this idea recursively 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox and (indirectly) Ulrich Ruede

Multigrid Algorithm: procedure MG(level, A, u, f)
if level = coarsest then solve coarsest grid equation by another method “exactly” else smooth Alevel u = f (m1 times) Compute residual r = f - Alevel u Restrict F = R r ( R is Restriction Operator) Call MG( level + 1, A(level+1), V, F) (mc times) Interpolate v = P V (Interpolate new solution at this level) correct unew = u + v smooth Aunew = f (m2 times) and set u = unew endif endprocedure Alevel uexact = f uexact = u + v Alevel v = r = f - Alevel u 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Multigrid Cycles Finest Grid Coarsest Grid V Cycle W Cycle
The parameter mc determines the exact strategy for how one iterates through different meshes One ends up with a full cycle as shown Coarsest Grid Full Multigrid Cycle Finest Coarsest 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Extra Slides 04/13/2009 CS267 Lecture 20

Templates for choosing an iterative algorithm for Ax=b
Use only matrix-vector-multiplication, dot, saxpy, … Symmetric? N Y AT available? Definite? N N Y Y Storage Expensive? Try Minres or CG Eigenvalues Known? Try QMR N Y N Y Try CG or Chebyshev Try CGS or Bi-CGSTAB Try CG Try GMRES 04/13/09 CS267 Lecture 20

Summary of Jacobi, SOR and CG
Jacobi, SOR, and CG all perform sparse-matrix-vector multiply For Poisson, this means nearest neighbor communication on an n-by-n grid It takes n = N1/2 steps for information to travel across an n-by-n grid Since solution on one side of grid depends on data on other side of grid faster methods require faster ways to move information FFT Multigrid 04/13/09 CS267 Lecture 20

Jim Demmel and Horst D. Simon

Similar presentations

Presentation on theme: "Jim Demmel and Horst D. Simon"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jim Demmel and Horst D. Simon

Similar presentations

Presentation on theme: "Jim Demmel and Horst D. Simon"— Presentation transcript:

Similar presentations

About project

Feedback