Presentation is loading. Please wait.

Presentation is loading. Please wait.

02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07.

Similar presentations


Presentation on theme: "02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."— Presentation transcript:

1 02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07

2 02/28/2007CS267 Lecture 13 Outline °Review PDEs °Overview of Methods for Poisson Equation °Jacobi’s method °Red-Black SOR method °Conjugate Gradient °Optimizing Stencils for Memory Hierarchies °Parallel Stencil Computations °FFT (future lecture) °Multigrid (future lecture)

3 02/28/2007CS267 Lecture 13 Solving PDEs ° Hyperbolic problems (waves): Sound wave(position, time) Use explicit time-stepping Solution at each point depends on neighbors at previous time °Elliptic (steady state) problems: Electrostatic Potential (position) Everything depends on everything else This means locality is harder to find than in hyperbolic problems °Parabolic (time-dependent) problems: Temperature(position,time) Involves an elliptic solve at each time-step °Focus on elliptic problems Canonical example is the Poisson equation  2 u/  x 2 +  2 u/  y 2 +  2 u/  z 2 = f(x,y,z)

4 02/28/2007CS267 Lecture 13 Explicit Solution of the Poisson Equation °Explicit methods often used for hyperbolic PDEs °Computation corresponds to Matrix vector multiply Combine nearest neighbors on grid °Use finite differences with u[j,i] as the temperature at time t= i*  (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) initial conditions on u[j,0] boundary conditions on u[0,i] and u[N,i] °At each timestep i = 0,1,2,... i=5 i=4 i=3 i=2 i=1 i=0 u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] j i For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] where z =C*  /h 2

5 02/28/2007CS267 Lecture 13 Matrix View of Explicit Method for Heat u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] u[ :, i+1] = T * u[ :, i] where T is tridiagonal: L called Laplacian (in 1D) For a 2D mesh (5 point stencil) the Laplacian is pentadiagonal (5 nonzeros per row, along diagonals) For a 3D mesh there are 7 nonzeros per row 1-2z z z Graph and “3 point stencil” T = = I – z*L, L = 2 -1 -1 2 -1 -1 2 1-2z z z 1-2z z z 1-2z

6 02/28/2007CS267 Lecture 13 Algorithms for 2D (3D) Poisson Equation (N = n 2 (n 3 ) vars) AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 (N 7/3 )NN 3/2 (N 5/3 )N (N 4/3 ) °JacobiN 2 (N 5/3 ) N (N 2/3 ) NN °Explicit Inv.N 2 log NN 2 N 2 °Conj.Gradients N 3/2 (N 4/3 ) N 1/2(1/3) *log NNN °Red/Black SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) NN °Sparse LUN 3/2 (N 2 ) N 1/2 N*log N(N 4/3 ) N °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.

7 02/28/2007CS267 Lecture 13 Overview of Algorithms on n x n grid (n x n x n grid) (1) °Sorted in two orders (roughly): from slowest to fastest on sequential machines. from most general (works on any matrix) to most specialized (works on matrices “like” T). °Dense LU: Gaussian elimination; works on any N-by-N matrix. °Band LU: Exploits the fact that T is nonzero only on N 1/2 (N 2/3 ) diagonals nearest main diagonal. °Jacobi: Essentially does matrix-vector multiply by T in inner loop of iterative algorithm. °Explicit Inverse: Assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it (but still expensive!). °Conjugate Gradient: Uses matrix-vector multiplication, like Jacobi, but exploits mathematical properties of T that Jacobi does not. °Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits yet different mathematical properties of T. Used in multigrid schemes. °Sparse LU: Gaussian elimination exploiting particular zero structure of T. °FFT (fast Fourier transform): Works only on matrices very like T. °Multigrid: Also works on matrices like T, that come from elliptic PDEs. °Lower Bound: Serial (time to print answer); parallel (time to combine N inputs). °Details in class notes and www.cs.berkeley.edu/~demmel/ma221.

8 02/28/2007CS267 Lecture 13 Overview of Algorithms on n x n grid (n x n x n grid) (2) °Dimension = N = n 2 in 2D (=n 3 in 3D) °Bandwidth = BW = n in 2D (=n 2 in 3D) °Condition number = k = n 2 in 2D or 3D °#its = number of iterations °SpMV = Sparse-matrix-vector-multiply °Dense LU: cost = N 3 (serial) or N (parallel, with N 2 procs) °Band LU: cost = N * BW 2 (serial) or N (parallel with BW 2 procs) °Jacobi: cost = cost(SpMV) * #its, #its = k. °Conjugate Gradient: cost = (cost(SpMV) + cost(dot prod))* #its, #its = sqrt(k) °Red-Black SOR: cost = cost(SpMV) * #its, #its = sqrt(k) °Iterative algorithms need different assumptions for convergence analysis SpMV cost is cost of nearest neighbor relaxation, which is O(N)

9 02/28/2007CS267 Lecture 13 Jacobi’s Method °To derive Jacobi’s method, write Poisson as: u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4 °Let u(i,j,m) be approximation for u(i,j) after m steps u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) + u(i,j+1,m) + b(i,j)) / 4 °I.e., u(i,j,m+1) is a weighted average of neighbors °Motivation: u(i,j,m+1) chosen to exactly satisfy equation at (i,j) °Steps to converge proportional to problem size, N=n 2 See Lecture 24 of www.cs.berkeley.edu/~demmel/cs267_Spr99www.cs.berkeley.edu/~demmel/cs267_Spr99 °Therefore, serial complexity is O(N 2 )

10 02/28/2007CS267 Lecture 13 Convergence of Nearest Neighbor Methods °Jacobi’s method involves nearest neighbor computation on nxn grid (N = n 2 ) So it takes O(n) = O(sqrt(N)) iterations for information to propagate °E.g., consider a rhs (b) that is 0, except the center is 1 °The exact solution looks like: Even in the best case, any nearest neighbor computation will take n/2 steps to propagate on an nxn grid

11 02/28/2007CS267 Lecture 13 Convergence of Nearest Neighbor Methods

12 02/28/2007CS267 Lecture 13 Improvements on Jacobi °Similar to Jacobi: u(i,j,m+1) is computed as a linear combination of neighbors °Numeric coefficients and update order are different °Based on 2 improvements over Jacobi Use “most recent values” of u that are available, since these are probably more accurate Update value of u(m+1) “more aggressively” at each step °First, note that while evaluating sequentially u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) … some of the values for m+1 are already available u(i,j,m+1) = (u(i-1,j,latest) + u(i+1,j,latest) … where latest is either m or m+1

13 02/28/2007CS267 Lecture 13 Gauss-Seidel °Updating left-to-right row-wise order, we get the Gauss-Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 °Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points u(i,j) u(i,j,m+1) = (u(i-1,j,m) + … forall red points u(i,j) u(i,j,m+1) = (u(i-1,j,m+1) + … ° For general graph, use graph coloring ° Can use repeated Maximal Independent Sets to color ° Graph(T) is bipartite => 2 colorable (red and black) ° Nodes for each color can be updated simultaneously ° Still Sparse-matrix-vector multiply, using submatrices

14 02/28/2007CS267 Lecture 13 Successive Overrelaxation (SOR) °Red-black Gauss-Seidel converges twice as fast as Jacobi, but there are twice as many parallel steps, so the same in practice °To motivate next improvement, write basic step in algorithm as: u(i,j,m+1) = u(i,j,m) + correction(i,j,m) °If “correction” is a good direction to move, then one should move even further in that direction by some factor w>1 u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m) °Called successive overrelaxation (SOR) °Parallelizes like Jacobi (Still sparse-matrix-vector multiply…) °Can prove w = 2/(1+sin(  /(n+1)) ) for best convergence Number of steps to converge = parallel complexity = O(n), instead of O(n 2 ) for Jacobi Serial complexity O(n 3 ) = O(N 3/2 ), instead of O(n 4 ) = O(N 2 ) for Jacobi

15 02/28/2007CS267 Lecture 13 Conjugate Gradient (CG) for solving A*x = b °This method can be used when the matrix A is symmetric, i.e., A = A T positive definite, defined equivalently as: -all eigenvalues are positive -x T * A * x > 0 for all nonzero vectors s -a Cholesky factorization, A = L*L T exists °Algorithm maintains 3 vectors x = the approximate solution, improved after each iteration r = the residual, r = A*x - b p = search direction, also called the conjugate gradient °One iteration costs Sparse-matrix-vector multiply by A (major cost) 3 dot products, 3 saxpys (scale*vector + vector) °Converges in O(n) = O(N 1/2 ) steps, like SOR Serial complexity = O(N 3/2 ) Parallel complexity = O(N 1/2 log N), log N factor from dot-products

16 02/28/2007CS267 Lecture 13 Conjugate Gradient Algorithm for Solving Ax=b °Initial guess x °r = b – A*x, j=1 °Repeat rho = r T *r … dot product If j=1, p = r, else beta = rho/old_rho, p = r + beta*p, endif … saxpy q = A*p … sparse matrix vector multiply alpha = rho / p T * q … dot product x = x + alpha * p … saxpy r = r – alpha * q … saxpy old_rho = rho; j=j+1 °Until rho small enough

17 02/28/2007CS267 Lecture 13 Optimizations of Structure Grid Operations Sparse matrix-vector multiply (aka nearest neighbor on a grid) is a critical component For arbitrary grids and matrices, we will talk about optimizations in a future lecture For structured grids we can look at Memory system performance Blocking/titling as in matrix multiplication Parallel implementations and optimizations

18 02/28/2007CS267 Lecture 13 Basic Stencil Code for 3D 7Point Stencil void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all i in x-dim { for all j in y-dim { for all k in z-dim { B[center] = S0* A[i,j,k] + S1*(A[i+1,j,k] + A[i-1,j,k] + A[i,j+1,k] + A[i,j-1,k] + A[i,j,k+1] + A[i,j,k-1]); } z (unit-stride) y x

19 02/28/2007CS267 Lecture 13 3D (7pt) Stencils Achieve a Low Fraction of Peak… Even on a single processor, performance (of the

20 02/28/2007CS267 Lecture 13 Tiling Stencil Computations °Several papers on tiling stencil computations E.g., Rivera and Tseng SC2002, … °Old Conventional Wisdom Cache misses are the most important factor Cache misses in blocked/unblockedTime for unblocked/blocked

21 02/28/2007CS267 Lecture 13 Stanza Triad Results °Perfect prefetching: performance is independent of L, the stanza length expect flat line at STREAM peak our results show performance depends on L

22 02/28/2007CS267 Lecture 13 Cost Model For Memory Cost °Goal: model cost of memory operations for Stencils understand effect of prefetch on performance °Start with model for Stanza Triad °Simplified (2-point) model of a prefetch engine any non-consecutive cache line is not prefetched any consecutive cache line is prefetched in reality, need more than one cache miss to enable prefetch may have more than 2 points (step function based on size) °First cache line in every L-element stanza is not prefetched assign cost C non-prefetched get value from Stanza Triad with L=cache line size °The rest of the cache lines are prefetched assign cost C prefetched value from Stanza Triad with large L °Total Cost: Cost = #non-prefetched * C non-prefetched + #prefetched * C prefetched

23 02/28/2007CS267 Lecture 13 Stencil Cost Model °Lower Bound only load entire grid once & store once Memory traffic: 2 * N 3 °Upper Bound only k plane is in cache, must reload other 2 Memory traffic: 4 * N 3 °Model built using parameters measured in STriad k-1 planek planek+1 plane

24 02/28/2007CS267 Lecture 13 Stencil Probe Cost Model

25 02/28/2007CS267 Lecture 13 Stencil Probe Cost Model

26 02/28/2007CS267 Lecture 13 Stencil Probe Cost Model

27 02/28/2007CS267 Lecture 13 Stencil PCache Blocking Speedups New Conventional Wisdom: Prefetching is as important as caching °Little’s Law (Bailey ‘97): need data in-flight = latency * bandwidth Cache blocking is useful for 1.large grid sizes: 3 planes do not fit in cache for 3D problem 2.do not cut/block the unit-stride dimension

28 02/28/2007CS267 Lecture 13 We Are Still far from FP Peak But are much closer (even without blocking) to memory peak This data only counts 1 load of each mesh point (infinitely large cache)

29 02/28/2007CS267 Lecture 13 Blocking Over Time / Iterations °Can we do better than this? Code is still severely limited by memory bandwidth °For some computations, you can merge across k sweeps over the grid Re-use data k times (as well as re-use within a plane) °Dependencies produce pyramid patterns Frigo & Strumpen, ICS05. time spacex1x0 t1 t0 dx1 dx0

30 02/28/2007CS267 Lecture 13 The Algorithm - Base Case If the height is 1 (ie t1-t0=1) then we simply have a line of points (t0,x) where x0 <= x <= x. Do the kernel on this set of points. Order does not matter (no interdependencies). time spacex1 t1 t0

31 02/28/2007CS267 Lecture 13 The Algorithm - Space Cut °If the width <= 2*height, then cut with slope=-1 through the center. °Do T1, then T2. No point in T1 depends on values from T2. time spacex1 t1 t0 T1 T2

32 02/28/2007CS267 Lecture 13 The Algorithm - Time Cut °Otherwise, cut trapezoid in half in the time dimension. °Do T1, then T2. No point in T1 depends on values of T2. time spacex1 t1 t0 T1 T2

33 02/28/2007CS267 Lecture 13 Performance Issues °Stencils have simple inner loops Usually about 1 Flop/load Reuse is limited to the number of points in the stencil °Performance usually limited by memory For large meshes (e.g., 512 3 ), cache blocking helps For smaller meshes, running time is roughly the time to read the mesh once from main memory Tradeoff of blocking: reduces cache misses (bandwidth), but increases prefetch misses (latency) °Power5 has code generation problems Platform has aliasing issues Causes low percentage of both machine and bandwidth peak

34 02/28/2007CS267 Lecture 13 Cache Blocking- Multiple Iterations °Now we allow reuse across iterations °Cache blocking now becomes slightly trickier Need to shift block after each iteration to respect dependencies Requires cache block dimension c as a parameter We call this “Cache Blocking with Time Skewing” °Simple 3-point 1D stencil with 4 cache blocks shown above

35 02/28/2007CS267 Lecture 13 Cache Blocking with Time Skewing Animation z (unit-stride) y x

36 02/28/2007CS267 Lecture 13 Cache Blocking with Time Skewing Performance °Itanium 2 and Opteron both improve at least 20% °Power5 has aliasing issues G O O D

37 02/28/2007CS267 Lecture 13 Time Skewing - Optimal Block Size Search G O O D

38 02/28/2007CS267 Lecture 13 Time Skewing - Optimal Block Size Search °Reduced memory traffic does correlate well to higher GFlop rates G O O D

39 02/28/2007CS267 Lecture 13 Performance Model GOAL: Find optimal cache block size without exhaustive search °Most important factors: memory traffic and prefetching °First count the number of cache misses Inputs: cache size, cache line size, and grid size Model then classifies block sizes into 5 cases Misses are classified as either “fast” or “slow” °Then predict performance by factoring in prefetching STriad microbenchmark determines cost of “fast” and “slow” misses Combine with cache miss model to compute running time °This will eliminate poor block sizes, but does not choose best Best predicted block size may not actually be best Still need to do search over pruned parameter space

40 02/28/2007CS267 Lecture 13 Memory Read Traffic Model Underpredict Overpredict

41 02/28/2007CS267 Lecture 13 Performance Model Underpredict Overpredict

42 02/28/2007CS267 Lecture 13 Parallelizing Jacobi’s Method °Reduces to sparse-matrix-vector multiply by (nearly) T U(m+1) = (T/4 - I) * U(m) + B/4 °Each value of U(m+1) may be updated independently keep 2 copies for iterations m and m+1 °Requires that boundary values be communicated if each processor owns n 2 /p elements to update amount of data communicated, n/p per neighbor, is relatively small if n>>p

43 02/28/2007CS267 Lecture 13 Locality Optimizations in Jacobi °Nearest neighbor update in Jacobi has: Good spatial locality (for regular mesh): traverse rows/columns Poor temporal locality: few flops per mesh point -E.g., on 2D mesh: 4 adds and 1 multiply for 5 loads and 1 store °For both parallel and single processors, may trade off extra computation for reduced memory operations Idea: for each subgrid, do multiple Jacobi iterations Size of subgrid shrinks on each iteration, so start with larger one Used for uniprocessors: -Rescheduling for Locality, Michelle Mills Strout: –www-unix.mcs.anl.gov/~mstrout -Cache-efficient multigrid algorithms, Sid Chatterjee And parallel machines, including “tuning” for grids -Ian Foster et al, SC2001

44 02/28/2007CS267 Lecture 13 Redundant Ghost Nodes in Jacobi °Overview of Memory Hierarchy Optimization °Can be used on unstructured meshes °Size of ghost region (and redundant computation) depends on network/memory speed vs. computation To compute green Copy yellow Compute blue

45 02/28/2007CS267 Lecture 13 Extra Slides

46 02/28/2007CS267 Lecture 13 Templates for choosing an iterative algorithm for Ax=b °Use only matrix-vector-multiplication, dot, saxpy, … °www.netlib.org/templates Symmetric?Definite? A T available? Eigenvalues Known? Try CGTry CG or Chebyshev Try Minres or CG Try QMRStorage Expensive? Try CGS or Bi-CGSTAB Try GMRES NY N N N N Y Y Y Y

47 02/28/2007CS267 Lecture 13 Summary of Jacobi, SOR and CG °Jacobi, SOR, and CG all perform sparse-matrix-vector multiply °For Poisson, this means nearest neighbor communication on an n-by-n grid °It takes n = N 1/2 steps for information to travel across an n-by-n grid °Since solution on one side of grid depends on data on other side of grid faster methods require faster ways to move information FFT Multigrid

48 02/28/2007CS267 Lecture 13 Solving the Poisson equation with the FFT °Motivation: express continuous solution as Fourier series u(x,y) =  i  k u ik sin(  ix) sin(  ky) u ik called Fourier coefficient of u(x,y) °Poisson’s equation  2 u/  x 2 +  2 u/  y 2 = b becomes  i  k (-  i 2 -  k 2 ) u ik sin(  ix) sin(  ky)  i  k b ik sin(  ix) sin(  ky) ° where b ik are Fourier coefficients of b(x,y) °By uniqueness of Fourier series, u ik = b ik / (-  i 2 -  k 2 ) °Continuous Algorithm (Discrete Algorithm) °Compute Fourier coefficient b ik of right hand side °Apply 2D FFT to values of b(i,k) on grid °Compute Fourier coefficients u ik of solution °Divide each transformed b(i,k) by function(i,k) °Compute solution u(x,y) from Fourier coefficients °Apply 2D inverse FFT to values of b(i,k)

49 02/28/2007CS267 Lecture 13 Serial FFT °Let i=sqrt(-1) and index matrices and vectors from 0. °The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] =  (j*k) Where  is:  = e (2  i/m) = cos(2  /m) + i*sin(2  /m)  is a complex number with whose m th power  m =1 and is therefore called an m th root of unity °E.g., for m = 4:  = i,   = -1,   = -i,   = 1,

50 02/28/2007CS267 Lecture 13 Using the 1D FFT for filtering °Signal = sin(7t) +.5 sin(5t) at 128 points °Noise = random number bounded by.75 °Filter by zeroing out FFT components <.25

51 02/28/2007CS267 Lecture 13 Using the 2D FFT for image compression °Image = 200x320 matrix of values °Compress by keeping largest 2.5% of FFT components °Similar idea used by jpeg

52 02/28/2007CS267 Lecture 13 Related Transforms °Most applications require multiplication by both F and inverse(F). °Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) °For solving the Poisson equation and various other applications, we use variations on the FFT The sin transform -- imaginary part of F The cos transform -- real part of F °Algorithms are similar, so we will focus on the forward FFT.

53 02/28/2007CS267 Lecture 13 Serial Algorithm for the FFT °Compute the FFT of an m-element vector v, F*v (F*v)[j] =  F(j,k) * v(k) =   (j*k) * v(k) =  (  j ) k * v(k) = V(  j ) °Where V is defined as the polynomial V(x) =  x k * v(k) m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0

54 02/28/2007CS267 Lecture 13 Divide and Conquer FFT °V can be evaluated using divide-and-conquer V(x) =  (x) k * v(k) = v[0] + x 2 *v[2] + x 4 *v[4] + … + x*(v[1] + x 2 *v[3] + x 4 *v[5] + … ) = V even (x 2 ) + x*V odd (x 2 ) °V has degree m-1, so V even and V odd are polynomials of degree m/2-1 °We evaluate these at points (  j ) 2 for 0<=j<=m-1 °But this is really just m/2 different points, since (  (j+m/2) ) 2 = (  j *  m/2) ) 2 =  2j *  m = (  j ) 2 °So FFT on m points reduced to 2 FFTs on m/2 points Divide and conquer! m-1 k = 0

55 02/28/2007CS267 Lecture 13 Divide-and-Conquer FFT FFT(v, , m) if m = 1 return v[0] else v even = FFT(v[0:2:m-2],  2, m/2) v odd = FFT(v[1:2:m-1],  2, m/2)  -vec = [  0,  1, …  (m/2-1) ] return [v even + (  -vec.* v odd ), v even - (  -vec.* v odd ) ] °The.* above is component-wise multiply. °The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed

56 02/28/2007CS267 Lecture 13 An Iterative Algorithm °The call tree of the d&c FFT algorithm is a complete binary tree of log m levels °Practical algorithms are iterative, going across each level in the tree starting at the bottom °Algorithm overwrites v[i] by (F*v)[bitreverse(i)] FFT(0,1,2,3,…,15) = FFT(xxxx) FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0) FFT(xx10)FFT(xx01)FFT(xx11)FFT(xx00) FFT(x100)FFT(x010)FFT(x110)FFT(x001)FFT(x101)FFT(x011)FFT(x111)FFT(x000) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) even odd

57 02/28/2007CS267 Lecture 13 Parallel 1D FFT °Data dependencies in 1D FFT Butterfly pattern °A PRAM algorithm takes O(log m) time each step to right is parallel there are log m steps °What about communication cost? °See LogP paper for details

58 02/28/2007CS267 Lecture 13 Block Layout of 1D FFT °Using a block layout (m/p contiguous elts per processor) °No communication in last log m/p steps °Each step requires fine- grained communication in first log p steps

59 02/28/2007CS267 Lecture 13 Cyclic Layout of 1D FFT °Cyclic layout (only 1 element per processor, wrapped) °No communication in first log(m/p) steps °Communication in last log(p) steps

60 02/28/2007CS267 Lecture 13 Parallel Complexity °m = vector size, p = number of processors °f = time per flop = 1 °  = startup for message (in f units) °  = time per word in a message (in f units) °Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) *  + m*log(p)/p * 

61 02/28/2007CS267 Lecture 13 FFT With “Transpose” °If we start with a cyclic layout for first log(p) steps, there is no communication °Then transpose the vector for last log(m/p) steps °All communication is in the transpose

62 02/28/2007CS267 Lecture 13 Why is the Communication Step Called a Transpose? °Analogous to transposing an array °View as a 2D array of n/p by p °Note: same idea is useful for uniprocessor caches

63 02/28/2007CS267 Lecture 13 Complexity of the FFT with Transpose °If no communication is overlapped (overestimate!) °Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) *  was log(p) *  + m*(p-1)/p 2 *  was m* log(p)/p  °If communication is overlapped, so we do not pay for p-1 messages, the second term becomes simply , rather than (p-1) . °This is close to optimal. See LogP paper for details. °See also following papers on class resource page A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” R. Nishtala et al, “Optimizing bandwidth limited problems using one- sided communication”

64 02/28/2007CS267 Lecture 13 Comment on the 1D Parallel FFT °The above algorithm leaves data in bit-reversed order Some applications can use it this way, like Poisson Others require another transpose-like operation °Other parallel algorithms also exist A very different 1D FFT is due to Edelman (see http://www- math.mit.edu/~edelman) Based on the Fast Multipole algorithm Less communication for non-bit-reversed algorithm

65 02/28/2007CS267 Lecture 13 Higher Dimension FFTs °FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. °E.g., a 2D FFT does 1D FFTs on all rows and then all columns °There are 3 obvious possibilities for the 2D FFT: (1) 2D blocked layout for matrix, using 1D algorithms for each row and column (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns Option 2 is best, if we overlap communication and computation °For a 3D FFT the options are similar 2 phases done with serial FFTs, followed by a transpose for 3rd can overlap communication with 2nd phase in practice

66 02/28/2007CS267 Lecture 13 FFTW – Fastest Fourier Transform in the West °www.fftw.orgwww.fftw.org °Produces FFT implementation optimized for Your version of FFT (complex, real,…) Your value of n (arbitrary, possibly prime) Your architecture Close to optimal for serial, can be improved for parallel °Similar in spirit to PHIPAC/ATLAS/Sparsity °Won 1999 Wilkinson Prize for Numerical Software °Widely used

67 02/28/2007CS267 Lecture 13 Extra Slides

68 02/28/2007CS267 Lecture 13 Poisson’s equation arises in many models °Heat flow: Temperature(position, time) °Diffusion: Concentration(position, time) °Electrostatic or Gravitational Potential: Potential(position) °Fluid flow: Velocity,Pressure,Density(position,time) °Quantum mechanics: Wave-function(position,time) °Elasticity: Stress,Strain(position,time)

69 02/28/2007CS267 Lecture 13 Poisson’s equation in 1D 2 -1 -1 2 -1 -1 2 T = 2 Graph and “stencil”

70 02/28/2007CS267 Lecture 13 Algorithms for 2D Poisson Equation with N unknowns AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 NN 3/2 N °JacobiN 2 NNN °Explicit Inv.N log NNN °Conj.Grad.N 3/2 N 1/2 *log NNN °RB SORN 3/2 N 1/2 NN °Sparse LUN 3/2 N 1/2 N*log NN °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication 222

71 02/28/2007CS267 Lecture 13 Short explanations of algorithms on previous slide °Sorted in two orders (roughly): from slowest to fastest on sequential machines from most general (works on any matrix) to most specialized (works on matrices “like” Poisson) °Dense LU: Gaussian elimination; works on any N-by-N matrix °Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal, so faster °Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative algorithm °Explicit Inverse: assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it It’s still expensive! °Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but exploits mathematical properies of T that Jacobi does not °Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits yet different mathematical properties of T Used in Multigrid °Sparse LU: Gaussian elimination exploiting particular zero structure of T °FFT (Fast Fourier Transform): works only on matrices very like T °Multigrid: also works on matrices like T, that come from elliptic PDEs °Lower Bound: serial (time to print answer); parallel (time to combine N inputs) °Details in class notes and www.cs.berkeley.edu/~demmel/ma221

72 02/28/2007CS267 Lecture 13 Irregular mesh: Tapered Tube (multigrid)

73 02/28/2007CS267 Lecture 13 Composite mesh from a mechanical structure


Download ppt "02/28/2007CS267 Lecture 13 CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07."

Similar presentations


Ads by Google