02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07

02/28/2007CS267 Lecture 13 Outline °Review PDEs °Overview of Methods for Poisson Equation °Jacobi’s method °Red-Black SOR method °Conjugate Gradient °Optimizing Stencils for Memory Hierarchies °Parallel Stencil Computations °FFT (future lecture) °Multigrid (future lecture)

02/28/2007CS267 Lecture 13 Solving PDEs ° Hyperbolic problems (waves): Sound wave(position, time) Use explicit time-stepping Solution at each point depends on neighbors at previous time °Elliptic (steady state) problems: Electrostatic Potential (position) Everything depends on everything else This means locality is harder to find than in hyperbolic problems °Parabolic (time-dependent) problems: Temperature(position,time) Involves an elliptic solve at each time-step °Focus on elliptic problems Canonical example is the Poisson equation  2 u/  x 2 +  2 u/  y 2 +  2 u/  z 2 = f(x,y,z)

02/28/2007CS267 Lecture 13 Explicit Solution of the Poisson Equation °Explicit methods often used for hyperbolic PDEs °Computation corresponds to Matrix vector multiply Combine nearest neighbors on grid °Use finite differences with u[j,i] as the temperature at time t= i*  (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) initial conditions on u[j,0] boundary conditions on u[0,i] and u[N,i] °At each timestep i = 0,1,2,... i=5 i=4 i=3 i=2 i=1 i=0 u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] j i For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] where z =C*  /h 2

02/28/2007CS267 Lecture 13 Matrix View of Explicit Method for Heat u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] u[ :, i+1] = T * u[ :, i] where T is tridiagonal: L called Laplacian (in 1D) For a 2D mesh (5 point stencil) the Laplacian is pentadiagonal (5 nonzeros per row, along diagonals) For a 3D mesh there are 7 nonzeros per row 1-2z z z Graph and “3 point stencil” T = = I – z*L, L = 2 -1 -1 2 -1 -1 2 1-2z z z 1-2z z z 1-2z

02/28/2007CS267 Lecture 13 Algorithms for 2D (3D) Poisson Equation (N = n 2 (n 3 ) vars) AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 (N 7/3 )NN 3/2 (N 5/3 )N (N 4/3 ) °JacobiN 2 (N 5/3 ) N (N 2/3 ) NN °Explicit Inv.N 2 log NN 2 N 2 °Conj.Gradients N 3/2 (N 4/3 ) N 1/2(1/3) *log NNN °Red/Black SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) NN °Sparse LUN 3/2 (N 2 ) N 1/2 N*log N(N 4/3 ) N °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.

02/28/2007CS267 Lecture 13 Overview of Algorithms on n x n grid (n x n x n grid) (1) °Sorted in two orders (roughly): from slowest to fastest on sequential machines. from most general (works on any matrix) to most specialized (works on matrices “like” T). °Dense LU: Gaussian elimination; works on any N-by-N matrix. °Band LU: Exploits the fact that T is nonzero only on N 1/2 (N 2/3 ) diagonals nearest main diagonal. °Jacobi: Essentially does matrix-vector multiply by T in inner loop of iterative algorithm. °Explicit Inverse: Assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it (but still expensive!). °Conjugate Gradient: Uses matrix-vector multiplication, like Jacobi, but exploits mathematical properties of T that Jacobi does not. °Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits yet different mathematical properties of T. Used in multigrid schemes. °Sparse LU: Gaussian elimination exploiting particular zero structure of T. °FFT (fast Fourier transform): Works only on matrices very like T. °Multigrid: Also works on matrices like T, that come from elliptic PDEs. °Lower Bound: Serial (time to print answer); parallel (time to combine N inputs). °Details in class notes and www.cs.berkeley.edu/~demmel/ma221.

02/28/2007CS267 Lecture 13 Overview of Algorithms on n x n grid (n x n x n grid) (2) °Dimension = N = n 2 in 2D (=n 3 in 3D) °Bandwidth = BW = n in 2D (=n 2 in 3D) °Condition number = k = n 2 in 2D or 3D °#its = number of iterations °SpMV = Sparse-matrix-vector-multiply °Dense LU: cost = N 3 (serial) or N (parallel, with N 2 procs) °Band LU: cost = N * BW 2 (serial) or N (parallel with BW 2 procs) °Jacobi: cost = cost(SpMV) * #its, #its = k. °Conjugate Gradient: cost = (cost(SpMV) + cost(dot prod))* #its, #its = sqrt(k) °Red-Black SOR: cost = cost(SpMV) * #its, #its = sqrt(k) °Iterative algorithms need different assumptions for convergence analysis SpMV cost is cost of nearest neighbor relaxation, which is O(N)

02/28/2007CS267 Lecture 13 Jacobi’s Method °To derive Jacobi’s method, write Poisson as: u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4 °Let u(i,j,m) be approximation for u(i,j) after m steps u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) + u(i,j+1,m) + b(i,j)) / 4 °I.e., u(i,j,m+1) is a weighted average of neighbors °Motivation: u(i,j,m+1) chosen to exactly satisfy equation at (i,j) °Steps to converge proportional to problem size, N=n 2 See Lecture 24 of www.cs.berkeley.edu/~demmel/cs267_Spr99www.cs.berkeley.edu/~demmel/cs267_Spr99 °Therefore, serial complexity is O(N 2 )

02/28/2007CS267 Lecture 13 Convergence of Nearest Neighbor Methods °Jacobi’s method involves nearest neighbor computation on nxn grid (N = n 2 ) So it takes O(n) = O(sqrt(N)) iterations for information to propagate °E.g., consider a rhs (b) that is 0, except the center is 1 °The exact solution looks like: Even in the best case, any nearest neighbor computation will take n/2 steps to propagate on an nxn grid

02/28/2007CS267 Lecture 13 Convergence of Nearest Neighbor Methods

02/28/2007CS267 Lecture 13 Improvements on Jacobi °Similar to Jacobi: u(i,j,m+1) is computed as a linear combination of neighbors °Numeric coefficients and update order are different °Based on 2 improvements over Jacobi Use “most recent values” of u that are available, since these are probably more accurate Update value of u(m+1) “more aggressively” at each step °First, note that while evaluating sequentially u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) … some of the values for m+1 are already available u(i,j,m+1) = (u(i-1,j,latest) + u(i+1,j,latest) … where latest is either m or m+1

02/28/2007CS267 Lecture 13 Gauss-Seidel °Updating left-to-right row-wise order, we get the Gauss-Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 °Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points u(i,j) u(i,j,m+1) = (u(i-1,j,m) + … forall red points u(i,j) u(i,j,m+1) = (u(i-1,j,m+1) + … ° For general graph, use graph coloring ° Can use repeated Maximal Independent Sets to color ° Graph(T) is bipartite => 2 colorable (red and black) ° Nodes for each color can be updated simultaneously ° Still Sparse-matrix-vector multiply, using submatrices

02/28/2007CS267 Lecture 13 Successive Overrelaxation (SOR) °Red-black Gauss-Seidel converges twice as fast as Jacobi, but there are twice as many parallel steps, so the same in practice °To motivate next improvement, write basic step in algorithm as: u(i,j,m+1) = u(i,j,m) + correction(i,j,m) °If “correction” is a good direction to move, then one should move even further in that direction by some factor w>1 u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m) °Called successive overrelaxation (SOR) °Parallelizes like Jacobi (Still sparse-matrix-vector multiply…) °Can prove w = 2/(1+sin(  /(n+1)) ) for best convergence Number of steps to converge = parallel complexity = O(n), instead of O(n 2 ) for Jacobi Serial complexity O(n 3 ) = O(N 3/2 ), instead of O(n 4 ) = O(N 2 ) for Jacobi

02/28/2007CS267 Lecture 13 Conjugate Gradient (CG) for solving A*x = b °This method can be used when the matrix A is symmetric, i.e., A = A T positive definite, defined equivalently as: -all eigenvalues are positive -x T * A * x > 0 for all nonzero vectors s -a Cholesky factorization, A = L*L T exists °Algorithm maintains 3 vectors x = the approximate solution, improved after each iteration r = the residual, r = A*x - b p = search direction, also called the conjugate gradient °One iteration costs Sparse-matrix-vector multiply by A (major cost) 3 dot products, 3 saxpys (scale*vector + vector) °Converges in O(n) = O(N 1/2 ) steps, like SOR Serial complexity = O(N 3/2 ) Parallel complexity = O(N 1/2 log N), log N factor from dot-products

02/28/2007CS267 Lecture 13 Conjugate Gradient Algorithm for Solving Ax=b °Initial guess x °r = b – A*x, j=1 °Repeat rho = r T *r … dot product If j=1, p = r, else beta = rho/old_rho, p = r + beta*p, endif … saxpy q = A*p … sparse matrix vector multiply alpha = rho / p T * q … dot product x = x + alpha * p … saxpy r = r – alpha * q … saxpy old_rho = rho; j=j+1 °Until rho small enough

02/28/2007CS267 Lecture 13 Optimizations of Structure Grid Operations Sparse matrix-vector multiply (aka nearest neighbor on a grid) is a critical component For arbitrary grids and matrices, we will talk about optimizations in a future lecture For structured grids we can look at Memory system performance Blocking/titling as in matrix multiplication Parallel implementations and optimizations

02/28/2007CS267 Lecture 13 Basic Stencil Code for 3D 7Point Stencil void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all i in x-dim { for all j in y-dim { for all k in z-dim { B[center] = S0* A[i,j,k] + S1*(A[i+1,j,k] + A[i-1,j,k] + A[i,j+1,k] + A[i,j-1,k] + A[i,j,k+1] + A[i,j,k-1]); } z (unit-stride) y x

02/28/2007CS267 Lecture 13 3D (7pt) Stencils Achieve a Low Fraction of Peak… Even on a single processor, performance (of the

02/28/2007CS267 Lecture 13 Tiling Stencil Computations °Several papers on tiling stencil computations E.g., Rivera and Tseng SC2002, … °Old Conventional Wisdom Cache misses are the most important factor Cache misses in blocked/unblockedTime for unblocked/blocked

02/28/2007CS267 Lecture 13 Stanza Triad Results °Perfect prefetching: performance is independent of L, the stanza length expect flat line at STREAM peak our results show performance depends on L

02/28/2007CS267 Lecture 13 Cost Model For Memory Cost °Goal: model cost of memory operations for Stencils understand effect of prefetch on performance °Start with model for Stanza Triad °Simplified (2-point) model of a prefetch engine any non-consecutive cache line is not prefetched any consecutive cache line is prefetched in reality, need more than one cache miss to enable prefetch may have more than 2 points (step function based on size) °First cache line in every L-element stanza is not prefetched assign cost C non-prefetched get value from Stanza Triad with L=cache line size °The rest of the cache lines are prefetched assign cost C prefetched value from Stanza Triad with large L °Total Cost: Cost = #non-prefetched * C non-prefetched + #prefetched * C prefetched

02/28/2007CS267 Lecture 13 Stencil Cost Model °Lower Bound only load entire grid once & store once Memory traffic: 2 * N 3 °Upper Bound only k plane is in cache, must reload other 2 Memory traffic: 4 * N 3 °Model built using parameters measured in STriad k-1 planek planek+1 plane

02/28/2007CS267 Lecture 13 Stencil Probe Cost Model

02/28/2007CS267 Lecture 13 Stencil PCache Blocking Speedups New Conventional Wisdom: Prefetching is as important as caching °Little’s Law (Bailey ‘97): need data in-flight = latency * bandwidth Cache blocking is useful for 1.large grid sizes: 3 planes do not fit in cache for 3D problem 2.do not cut/block the unit-stride dimension

02/28/2007CS267 Lecture 13 We Are Still far from FP Peak But are much closer (even without blocking) to memory peak This data only counts 1 load of each mesh point (infinitely large cache)

02/28/2007CS267 Lecture 13 Blocking Over Time / Iterations °Can we do better than this? Code is still severely limited by memory bandwidth °For some computations, you can merge across k sweeps over the grid Re-use data k times (as well as re-use within a plane) °Dependencies produce pyramid patterns Frigo & Strumpen, ICS05. time spacex1x0 t1 t0 dx1 dx0

02/28/2007CS267 Lecture 13 The Algorithm - Base Case If the height is 1 (ie t1-t0=1) then we simply have a line of points (t0,x) where x0 <= x <= x. Do the kernel on this set of points. Order does not matter (no interdependencies). time spacex1 t1 t0

02/28/2007CS267 Lecture 13 The Algorithm - Space Cut °If the width <= 2*height, then cut with slope=-1 through the center. °Do T1, then T2. No point in T1 depends on values from T2. time spacex1 t1 t0 T1 T2

02/28/2007CS267 Lecture 13 The Algorithm - Time Cut °Otherwise, cut trapezoid in half in the time dimension. °Do T1, then T2. No point in T1 depends on values of T2. time spacex1 t1 t0 T1 T2

02/28/2007CS267 Lecture 13 Performance Issues °Stencils have simple inner loops Usually about 1 Flop/load Reuse is limited to the number of points in the stencil °Performance usually limited by memory For large meshes (e.g., 512 3 ), cache blocking helps For smaller meshes, running time is roughly the time to read the mesh once from main memory Tradeoff of blocking: reduces cache misses (bandwidth), but increases prefetch misses (latency) °Power5 has code generation problems Platform has aliasing issues Causes low percentage of both machine and bandwidth peak

02/28/2007CS267 Lecture 13 Cache Blocking- Multiple Iterations °Now we allow reuse across iterations °Cache blocking now becomes slightly trickier Need to shift block after each iteration to respect dependencies Requires cache block dimension c as a parameter We call this “Cache Blocking with Time Skewing” °Simple 3-point 1D stencil with 4 cache blocks shown above

02/28/2007CS267 Lecture 13 Cache Blocking with Time Skewing Animation z (unit-stride) y x

02/28/2007CS267 Lecture 13 Cache Blocking with Time Skewing Performance °Itanium 2 and Opteron both improve at least 20% °Power5 has aliasing issues G O O D

02/28/2007CS267 Lecture 13 Time Skewing - Optimal Block Size Search G O O D

02/28/2007CS267 Lecture 13 Time Skewing - Optimal Block Size Search °Reduced memory traffic does correlate well to higher GFlop rates G O O D

02/28/2007CS267 Lecture 13 Performance Model GOAL: Find optimal cache block size without exhaustive search °Most important factors: memory traffic and prefetching °First count the number of cache misses Inputs: cache size, cache line size, and grid size Model then classifies block sizes into 5 cases Misses are classified as either “fast” or “slow” °Then predict performance by factoring in prefetching STriad microbenchmark determines cost of “fast” and “slow” misses Combine with cache miss model to compute running time °This will eliminate poor block sizes, but does not choose best Best predicted block size may not actually be best Still need to do search over pruned parameter space

02/28/2007CS267 Lecture 13 Memory Read Traffic Model Underpredict Overpredict

02/28/2007CS267 Lecture 13 Performance Model Underpredict Overpredict

02/28/2007CS267 Lecture 13 Parallelizing Jacobi’s Method °Reduces to sparse-matrix-vector multiply by (nearly) T U(m+1) = (T/4 - I) * U(m) + B/4 °Each value of U(m+1) may be updated independently keep 2 copies for iterations m and m+1 °Requires that boundary values be communicated if each processor owns n 2 /p elements to update amount of data communicated, n/p per neighbor, is relatively small if n>>p

02/28/2007CS267 Lecture 13 Locality Optimizations in Jacobi °Nearest neighbor update in Jacobi has: Good spatial locality (for regular mesh): traverse rows/columns Poor temporal locality: few flops per mesh point -E.g., on 2D mesh: 4 adds and 1 multiply for 5 loads and 1 store °For both parallel and single processors, may trade off extra computation for reduced memory operations Idea: for each subgrid, do multiple Jacobi iterations Size of subgrid shrinks on each iteration, so start with larger one Used for uniprocessors: -Rescheduling for Locality, Michelle Mills Strout: –www-unix.mcs.anl.gov/~mstrout -Cache-efficient multigrid algorithms, Sid Chatterjee And parallel machines, including “tuning” for grids -Ian Foster et al, SC2001

02/28/2007CS267 Lecture 13 Redundant Ghost Nodes in Jacobi °Overview of Memory Hierarchy Optimization °Can be used on unstructured meshes °Size of ghost region (and redundant computation) depends on network/memory speed vs. computation To compute green Copy yellow Compute blue

02/28/2007CS267 Lecture 13 Extra Slides

02/28/2007CS267 Lecture 13 Templates for choosing an iterative algorithm for Ax=b °Use only matrix-vector-multiplication, dot, saxpy, … °www.netlib.org/templates Symmetric?Definite? A T available? Eigenvalues Known? Try CGTry CG or Chebyshev Try Minres or CG Try QMRStorage Expensive? Try CGS or Bi-CGSTAB Try GMRES NY N N N N Y Y Y Y

02/28/2007CS267 Lecture 13 Summary of Jacobi, SOR and CG °Jacobi, SOR, and CG all perform sparse-matrix-vector multiply °For Poisson, this means nearest neighbor communication on an n-by-n grid °It takes n = N 1/2 steps for information to travel across an n-by-n grid °Since solution on one side of grid depends on data on other side of grid faster methods require faster ways to move information FFT Multigrid

02/28/2007CS267 Lecture 13 Solving the Poisson equation with the FFT °Motivation: express continuous solution as Fourier series u(x,y) =  i  k u ik sin(  ix) sin(  ky) u ik called Fourier coefficient of u(x,y) °Poisson’s equation  2 u/  x 2 +  2 u/  y 2 = b becomes  i  k (-  i 2 -  k 2 ) u ik sin(  ix) sin(  ky)  i  k b ik sin(  ix) sin(  ky) ° where b ik are Fourier coefficients of b(x,y) °By uniqueness of Fourier series, u ik = b ik / (-  i 2 -  k 2 ) °Continuous Algorithm (Discrete Algorithm) °Compute Fourier coefficient b ik of right hand side °Apply 2D FFT to values of b(i,k) on grid °Compute Fourier coefficients u ik of solution °Divide each transformed b(i,k) by function(i,k) °Compute solution u(x,y) from Fourier coefficients °Apply 2D inverse FFT to values of b(i,k)

02/28/2007CS267 Lecture 13 Serial FFT °Let i=sqrt(-1) and index matrices and vectors from 0. °The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] =  (j*k) Where  is:  = e (2  i/m) = cos(2  /m) + i*sin(2  /m)  is a complex number with whose m th power  m =1 and is therefore called an m th root of unity °E.g., for m = 4:  = i,   = -1,   = -i,   = 1,

02/28/2007CS267 Lecture 13 Using the 1D FFT for filtering °Signal = sin(7t) +.5 sin(5t) at 128 points °Noise = random number bounded by.75 °Filter by zeroing out FFT components <.25

02/28/2007CS267 Lecture 13 Using the 2D FFT for image compression °Image = 200x320 matrix of values °Compress by keeping largest 2.5% of FFT components °Similar idea used by jpeg

02/28/2007CS267 Lecture 13 Related Transforms °Most applications require multiplication by both F and inverse(F). °Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) °For solving the Poisson equation and various other applications, we use variations on the FFT The sin transform -- imaginary part of F The cos transform -- real part of F °Algorithms are similar, so we will focus on the forward FFT.

02/28/2007CS267 Lecture 13 Serial Algorithm for the FFT °Compute the FFT of an m-element vector v, F*v (F*v)[j] =  F(j,k) * v(k) =   (j*k) * v(k) =  (  j ) k * v(k) = V(  j ) °Where V is defined as the polynomial V(x) =  x k * v(k) m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0

02/28/2007CS267 Lecture 13 Divide and Conquer FFT °V can be evaluated using divide-and-conquer V(x) =  (x) k * v(k) = v[0] + x 2 *v[2] + x 4 *v[4] + … + x*(v[1] + x 2 *v[3] + x 4 *v[5] + … ) = V even (x 2 ) + x*V odd (x 2 ) °V has degree m-1, so V even and V odd are polynomials of degree m/2-1 °We evaluate these at points (  j ) 2 for 0<=j<=m-1 °But this is really just m/2 different points, since (  (j+m/2) ) 2 = (  j *  m/2) ) 2 =  2j *  m = (  j ) 2 °So FFT on m points reduced to 2 FFTs on m/2 points Divide and conquer! m-1 k = 0

02/28/2007CS267 Lecture 13 Divide-and-Conquer FFT FFT(v, , m) if m = 1 return v[0] else v even = FFT(v[0:2:m-2],  2, m/2) v odd = FFT(v[1:2:m-1],  2, m/2)  -vec = [  0,  1, …  (m/2-1) ] return [v even + (  -vec.* v odd ), v even - (  -vec.* v odd ) ] °The.* above is component-wise multiply. °The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed

02/28/2007CS267 Lecture 13 An Iterative Algorithm °The call tree of the d&c FFT algorithm is a complete binary tree of log m levels °Practical algorithms are iterative, going across each level in the tree starting at the bottom °Algorithm overwrites v[i] by (F*v)[bitreverse(i)] FFT(0,1,2,3,…,15) = FFT(xxxx) FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0) FFT(xx10)FFT(xx01)FFT(xx11)FFT(xx00) FFT(x100)FFT(x010)FFT(x110)FFT(x001)FFT(x101)FFT(x011)FFT(x111)FFT(x000) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) even odd

02/28/2007CS267 Lecture 13 Parallel 1D FFT °Data dependencies in 1D FFT Butterfly pattern °A PRAM algorithm takes O(log m) time each step to right is parallel there are log m steps °What about communication cost? °See LogP paper for details

02/28/2007CS267 Lecture 13 Block Layout of 1D FFT °Using a block layout (m/p contiguous elts per processor) °No communication in last log m/p steps °Each step requires fine- grained communication in first log p steps

02/28/2007CS267 Lecture 13 Cyclic Layout of 1D FFT °Cyclic layout (only 1 element per processor, wrapped) °No communication in first log(m/p) steps °Communication in last log(p) steps

02/28/2007CS267 Lecture 13 Parallel Complexity °m = vector size, p = number of processors °f = time per flop = 1 °  = startup for message (in f units) °  = time per word in a message (in f units) °Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) *  + m*log(p)/p * 

02/28/2007CS267 Lecture 13 FFT With “Transpose” °If we start with a cyclic layout for first log(p) steps, there is no communication °Then transpose the vector for last log(m/p) steps °All communication is in the transpose

02/28/2007CS267 Lecture 13 Why is the Communication Step Called a Transpose? °Analogous to transposing an array °View as a 2D array of n/p by p °Note: same idea is useful for uniprocessor caches

02/28/2007CS267 Lecture 13 Complexity of the FFT with Transpose °If no communication is overlapped (overestimate!) °Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) *  was log(p) *  + m*(p-1)/p 2 *  was m* log(p)/p  °If communication is overlapped, so we do not pay for p-1 messages, the second term becomes simply , rather than (p-1) . °This is close to optimal. See LogP paper for details. °See also following papers on class resource page A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” R. Nishtala et al, “Optimizing bandwidth limited problems using one- sided communication”

02/28/2007CS267 Lecture 13 Comment on the 1D Parallel FFT °The above algorithm leaves data in bit-reversed order Some applications can use it this way, like Poisson Others require another transpose-like operation °Other parallel algorithms also exist A very different 1D FFT is due to Edelman (see http://www- math.mit.edu/~edelman) Based on the Fast Multipole algorithm Less communication for non-bit-reversed algorithm

02/28/2007CS267 Lecture 13 Higher Dimension FFTs °FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. °E.g., a 2D FFT does 1D FFTs on all rows and then all columns °There are 3 obvious possibilities for the 2D FFT: (1) 2D blocked layout for matrix, using 1D algorithms for each row and column (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns Option 2 is best, if we overlap communication and computation °For a 3D FFT the options are similar 2 phases done with serial FFTs, followed by a transpose for 3rd can overlap communication with 2nd phase in practice

02/28/2007CS267 Lecture 13 FFTW – Fastest Fourier Transform in the West °www.fftw.orgwww.fftw.org °Produces FFT implementation optimized for Your version of FFT (complex, real,…) Your value of n (arbitrary, possibly prime) Your architecture Close to optimal for serial, can be improved for parallel °Similar in spirit to PHIPAC/ATLAS/Sparsity °Won 1999 Wilkinson Prize for Numerical Software °Widely used

02/28/2007CS267 Lecture 13 Extra Slides

02/28/2007CS267 Lecture 13 Poisson’s equation arises in many models °Heat flow: Temperature(position, time) °Diffusion: Concentration(position, time) °Electrostatic or Gravitational Potential: Potential(position) °Fluid flow: Velocity,Pressure,Density(position,time) °Quantum mechanics: Wave-function(position,time) °Elasticity: Stress,Strain(position,time)

02/28/2007CS267 Lecture 13 Poisson’s equation in 1D 2 -1 -1 2 -1 -1 2 T = 2 Graph and “stencil”

02/28/2007CS267 Lecture 13 Algorithms for 2D Poisson Equation with N unknowns AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 NN 3/2 N °JacobiN 2 NNN °Explicit Inv.N log NNN °Conj.Grad.N 3/2 N 1/2 *log NNN °RB SORN 3/2 N 1/2 NN °Sparse LUN 3/2 N 1/2 N*log NN °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication 222

02/28/2007CS267 Lecture 13 Short explanations of algorithms on previous slide °Sorted in two orders (roughly): from slowest to fastest on sequential machines from most general (works on any matrix) to most specialized (works on matrices “like” Poisson) °Dense LU: Gaussian elimination; works on any N-by-N matrix °Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal, so faster °Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative algorithm °Explicit Inverse: assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it It’s still expensive! °Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but exploits mathematical properies of T that Jacobi does not °Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits yet different mathematical properties of T Used in Multigrid °Sparse LU: Gaussian elimination exploiting particular zero structure of T °FFT (Fast Fourier Transform): works only on matrices very like T °Multigrid: also works on matrices like T, that come from elliptic PDEs °Lower Bound: serial (time to print answer); parallel (time to combine N inputs) °Details in class notes and www.cs.berkeley.edu/~demmel/ma221

02/28/2007CS267 Lecture 13 Irregular mesh: Tapered Tube (multigrid)

02/28/2007CS267 Lecture 13 Composite mesh from a mechanical structure

02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

Similar presentations

Presentation on theme: "02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

Similar presentations

Presentation on theme: "02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."— Presentation transcript:

Similar presentations

About project

Feedback