Presentation is loading. Please wait.

Presentation is loading. Please wait.

04/15/2009CS267 Lecture 20 Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon

Similar presentations


Presentation on theme: "04/15/2009CS267 Lecture 20 Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon"— Presentation transcript:

1 04/15/2009CS267 Lecture 20 Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon hdsimon@eecs.berkeley.edu www.cs.berkeley.edu/~demmel/cs267_Spr09

2 04/15/2009CS267 Lecture 212 Motifs The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.) Motifs form key computational patterns Topic of this lecture

3 04/15/2009CS267 Lecture 21 References °Previous CS267 lectures °Lecture by Geoffrey Fox: http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt °FFTW project http://www.fftw.org °Spiral project http://www.spiral.net

4 04/15/2009CS267 Lecture 21 Poisson’s equation arises in many models °Electrostatic or Gravitational Potential: Potential(position) °Heat flow: Temperature(position, time) °Diffusion: Concentration(position, time) °Fluid flow: Velocity,Pressure,Density(position,time) °Elasticity: Stress,Strain(position,time) °Variations of Poisson have variable coefficients 3D:  2 u/  x 2 +  2 u/  y 2 +  2 u/  z 2 = f(x,y,z) 2D:  2 u/  x 2 +  2 u/  y 2 = f(x,y) 1D: d 2 u/dx 2 = f(x) f represents the sources; also need boundary conditions

5 04/15/2009CS267 Lecture 21 Algorithms for 2D (3D) Poisson Equation (N = n 2 (n 3 ) vars) AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 (N 7/3 )NN 3/2 (N 5/3 )N (N 4/3 ) °JacobiN 2 (N 5/3 ) N (N 2/3 ) NN °Explicit Inv.N 2 log NN 2 N 2 °Conj.Gradients N 3/2 (N 4/3 ) N 1/2(1/3) *log NNN °Red/Black SOR N 3/2 (N 4/3 ) N 1/2 (N 1/3 ) NN °Sparse LUN 3/2 (N 2 ) N 1/2 N*log N(N 4/3 ) N °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.

6 04/15/2009CS267 Lecture 21 Solving Poisson’s Equation with the FFT °Express any 2D function defined in 0  x,y  1 as a series  x,y  =  j  k  jk sin(  jx) sin(  ky) Here  jk are called Fourier coefficient of  (x,y) °The inverse of this is:  jk = 4  (x,y) sin(  jx) sin(  ky) °Poisson’s equation    /  x 2 +  2  /  y 2 = f(x,y) becomes  j  k (-   j 2 -   k 2 )  jk sin(  jx) sin(  ky)  j  k f jk sin(  jx) sin(  ky) ° where f jk are Fourier coefficients of f(x,y)  and  f  x,y  =  j  k f jk sin(  jx) sin(  ky) °This implies PDE can be solved exactly algebraically,  jk = f jk / (-   j 2 -   k 2 )

7 04/15/2009CS267 Lecture 21 Solving Poisson’s Equation with the FFT °So solution of Poisson’s equation involves the following steps °1) Find the Fourier coefficients f jk of f(x,y) by performing integral °2) Form the Fourier coefficients of  by  jk = f jk / (-   j 2 -   k 2 ) °3) Construct the solution by performing sum  (x,y) °There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral Also the simplest (mathematically) transform uses exp(-2  ijx) not sin(  jx) Let us first consider 1D discrete version of this case PDE case normally deals with discretized functions as these needed for other parts of problem

8 04/15/2009CS267 Lecture 21 Serial FFT °Let i=sqrt(-1) and index matrices and vectors from 0. °The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] =  (j*k) Where  is:  = e (2  i/m) = cos(2  /m) + i*sin(2  /m)  is a complex number with whose m th power  m =1 and is therefore called an m th root of unity °E.g., for m = 4:  = i,   = -1,   = -i,   = 1,

9 04/15/2009CS267 Lecture 21 Using the 1D FFT for filtering °Signal = sin(7t) +.5 sin(5t) at 128 points °Noise = random number bounded by.75 °Filter by zeroing out FFT components <.25

10 04/15/2009CS267 Lecture 21 Using the 2D FFT for image compression °Image = 200x320 matrix of values °Compress by keeping largest 2.5% of FFT components °Similar idea used by jpeg

11 04/15/2009CS267 Lecture 21 Related Transforms °Most applications require multiplication by both F and inverse(F). °Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) °For solving the Poisson equation and various other applications, we use variations on the FFT The sin transform -- imaginary part of F The cos transform -- real part of F °Algorithms are similar, so we will focus on the forward FFT.

12 04/15/2009CS267 Lecture 21 Serial Algorithm for the FFT °Compute the FFT of an m-element vector v, F*v (F*v)[j] =  F(j,k) * v(k) =   (j*k) * v(k) =  (  j ) k * v(k) = V(  j ) °Where V is defined as the polynomial V(x) =  x k * v(k) m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0

13 04/15/2009CS267 Lecture 21 Divide and Conquer FFT °V can be evaluated using divide-and-conquer V(x) =  (x) k * v(k) = v[0] + x 2 *v[2] + x 4 *v[4] + … + x*(v[1] + x 2 *v[3] + x 4 *v[5] + … ) = V even (x 2 ) + x*V odd (x 2 ) °V has degree m-1, so V even and V odd are polynomials of degree m/2-1 °We evaluate these at points (  j ) 2 for 0<=j<=m-1 °But this is really just m/2 different points, since (  (j+m/2) ) 2 = (  j *  m/2) ) 2 =  2j *  m = (  j ) 2 °So FFT on m points reduced to 2 FFTs on m/2 points Divide and conquer! m-1 k = 0

14 04/15/2009CS267 Lecture 21 Divide-and-Conquer FFT FFT(v, , m) if m = 1 return v[0] else v even = FFT(v[0:2:m-2],  2, m/2) v odd = FFT(v[1:2:m-1],  2, m/2)  -vec = [  0,  1, …  (m/2-1) ] return [v even + (  -vec.* v odd ), v even - (  -vec.* v odd ) ] °The.* above is component-wise multiply. °The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed

15 04/15/2009CS267 Lecture 21 An Iterative Algorithm °The call tree of the d&c FFT algorithm is a complete binary tree of log m levels °An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottom Algorithm overwrites v[i] by (F*v)[bitreverse(i)] °Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead) FFT(0,1,2,3,…,15) = FFT(xxxx) FFT(1,3,…,15) = FFT(xxx1)FFT(0,2,…,14) = FFT(xxx0) FFT(xx10)FFT(xx01)FFT(xx11)FFT(xx00) FFT(x100)FFT(x010)FFT(x110)FFT(x001)FFT(x101)FFT(x011)FFT(x111)FFT(x000) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) even odd

16 04/15/2009CS267 Lecture 21 Parallel 1D FFT °Data dependencies in 1D FFT Butterfly pattern °A PRAM algorithm takes O(log m) time each step to right is parallel there are log m steps °What about communication cost? °See LogP paper for details

17 04/15/2009CS267 Lecture 21 Block Layout of 1D FFT °Using a block layout (m/p contiguous elts per processor) °No communication in last log m/p steps °Each step requires fine- grained communication in first log p steps

18 04/15/2009CS267 Lecture 21 Cyclic Layout of 1D FFT °Cyclic layout (only 1 element per processor, wrapped) °No communication in first log(m/p) steps °Communication in last log(p) steps

19 04/15/2009CS267 Lecture 21 Parallel Complexity °m = vector size, p = number of processors °f = time per flop = 1 °  = startup for message (in f units) °  = time per word in a message (in f units) °Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) *  + m*log(p)/p * 

20 04/15/2009CS267 Lecture 21 FFT With “Transpose” °If we start with a cyclic layout for first log(p) steps, there is no communication °Then transpose the vector for last log(m/p) steps °All communication is in the transpose °Note: This example has log(m/p) = log(p) If log(m/p) > log(p) more phases/layouts will be needed We will work with this assumption for simplicity

21 04/15/2009CS267 Lecture 21 Why is the Communication Step Called a Transpose? °Analogous to transposing an array °View as a 2D array of n/p by p °Note: same idea is useful for uniprocessor caches

22 04/15/2009CS267 Lecture 21 Complexity of the FFT with Transpose °If no communication is pipelined (overestimate!) °Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) *  was log(p) *  + m*(p-1)/p 2 *  was m* log(p)/p  °If communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply , rather than (p-1) . °This is close to optimal. See LogP paper for details. °See also following papers on class resource page A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” R. Nishtala et al, “Optimizing bandwidth limited problems using one- sided communication”

23 04/15/2009CS267 Lecture 21 Comment on the 1D Parallel FFT °The above algorithm leaves data in bit-reversed order Some applications can use it this way, like Poisson Others require another transpose-like operation °Other parallel algorithms also exist A very different 1D FFT is due to Edelman (see http://www- math.mit.edu/~edelman) Based on the Fast Multipole algorithm Less communication for non-bit-reversed algorithm

24 04/15/2009CS267 Lecture 21 Higher Dimension FFTs °FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. °E.g., a 2D FFT does 1D FFTs on all rows and then all columns °There are 3 obvious possibilities for the 2D FFT: (1) 2D blocked layout for matrix, using 1D algorithms for each row and column (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns Option 2 is best, if we overlap communication and computation °For a 3D FFT the options are similar 2 phases done with serial FFTs, followed by a transpose for 3rd can overlap communication with 2nd phase in practice

25 04/15/2009CS267 Lecture 21 FFTW – Fastest Fourier Transform in the West °www.fftw.orgwww.fftw.org °Produces FFT implementation optimized for Your version of FFT (complex, real,…) Your value of n (arbitrary, possibly prime) Your architecture Close to optimal for serial, can be improved for parallel °Similar in spirit to PHIPAC/ATLAS/Sparsity °Won 1999 Wilkinson Prize for Numerical Software °Widely used for serial FFTs Had parallel FFTs in version 2, but no longer supporting them Layout constraints from users/apps + network differences are hard to support

26 04/15/2009CS267 Lecture 21 Bisection Bandwidth °FFT requires one (or more) transpose operations: Ever processor send 1/P of its data to each other one °Bisection Bandwidth limits this performance Bisection bandwidth is the bandwidth across the narrowest part of the network Important in global transpose operations, all-to-all, etc. °“Full bisection bandwidth” is expensive Fraction of machine cost in the network is increasing Fat-tree and full crossbar topologies may be too expensive Especially on machines with 100K and more processors SMP clusters often limit bandwidth at the node level

27 04/15/2009CS267 Lecture 21 Modified LogGP Model °LogGP: no overlap P0 P1 o send L o recv EEL: end to end latency (1/2 roundtrip) g: minimum time between small message sends G: additional gap per byte for larger messages °LogGP: no overlap P0 P1 g

28 04/15/2009CS267 Lecture 21 Historical Perspective Potential performance advantage for fine-grained, one-sided programs Potential productivity advantage for irregular applications ½ round-trip latency

29 04/15/2009CS267 Lecture 21 General Observations °The overlap potential is the difference between the gap and overhead No potential if CPU is tied up throughout message send -E.g., no send size DMA Grows with message size for machines with DMA (per byte cost is handled by network) -Because per-Byte cost is handled by NIC Grows with amount of network congestion -Because gap grows as network becomes saturated °Remote overhead is 0 for machine with RDMA

30 04/15/2009CS267 Lecture 21 GASNet Communications System GASNet offers put/get communication °One-sided: no remote CPU involvement required in API (key difference with MPI) Message contains remote address No need to match with a receive No implicit ordering required Compiler-generated code Language-specific runtime GASNet Network Hardware °Used in language runtimes (UPC, etc.) °Fine-grained and bulk xfers °Split-phase communication

31 04/15/2009CS267 Lecture 21 Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI °Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5 °Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in streaming bandwidth °Half power point (N/2) differs by one order of magnitude

32 04/15/2009CS267 Lecture 21 (up is good) GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize Usually outperform MPI at medium message sizes - often by a large margin GASNet: Performance for mid-range message sizes

33 04/15/2009CS267 Lecture 21 NAS FT Case Study °Performance of Exchange (Alltoall) is critical Communication to computation ratio increases with faster, more optimized 1-D FFTs Determined by available bisection bandwidth Between 30-40% of the applications total runtime °Two ways to reduce Exchange cost 1. Use a better network (higher Bisection BW) 2. Overlap the all-to-all with communication (where possible) – “break up” the exchange Default NAS FT Fortran/MPI relies on #1 Our approach uses UPC/GASNet and builds on #2 °Started as CS267 project 1D partition of 3D grid is a limitation At most N processors for N^3 grid HPC Challenge benchmark has large 1D FFT (can be viewed as 3D or more with proper roots of unity)

34 04/15/2009CS267 Lecture 21 3D FFT Operation with Global Exchange Cachelines 1D-FFT Columns 1D-FFT (Columns) Transpose + 1D-FFT (Rows) 1D-FFT Rows Exchange (Alltoall) send to Thread 0 send to Thread 1 send to Thread 2 Last 1D-FFT (Thread 0’s view) Transpose + 1D-FFT Divide rows among threads °Single Communication Operation (Global Exchange) sends THREADS large messages °Separate computation and communication phases

35 04/15/2009CS267 Lecture 21 Communication Strategies for 3D FFT Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea chunk = all rows with same destination pencil = 1 row °Three approaches: Chunk: -Wait for 2 nd dim FFTs to finish -Minimize # messages Slab: -Wait for chunk of rows destined for 1 proc to finish -Overlap with computation Pencil: -Send each row as it completes -Maximize overlap and -Match natural layout slab = all rows in a single plane with same destination

36 04/15/2009CS267 Lecture 21 Decomposing NAS FT Exchange into Smaller Messages Exchange (Default)512 Kbytes Slabs (set of contiguous rows)65 Kbytes Pencils (single row)16 Kbytes °Three approaches: Chunk: -Wait for 2 nd dim FFTs to finish Slab: -Wait for chunk of rows destined for 1 proc to finish Pencil: -Send each row as it completes °Example Message Size Breakdown for Class D (2048 x 1024 x 1024) at 256 processors

37 04/15/2009CS267 Lecture 21 Overlapping Communication °Goal: make use of “all the wires” Distributed memory machines allow for asynchronous communication Berkeley Non-blocking extensions expose GASNet’s non- blocking operations °Approach: Break all-to-all communication Interleave row computations and row communications since 1D- FFT is independent across rows Decomposition can be into slabs (contiguous sets of rows) or pencils (individual row) Pencils allow: -Earlier start for communication “phase” and improved local cache use -But more smaller messages (same total volume)

38 04/15/2009CS267 Lecture 21 NAS FT: UPC Non-blocking MFlops °Berkeley UPC compiler support non-blocking UPC extensions °Produce 15-45% speedup over best UPC Blocking version °Non-blocking version requires about 30 extra lines of UPC code

39 04/15/2009CS267 Lecture 21 NAS FT Variants Performance Summary °Shown are the largest classes/configurations possible on each test machine °MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths)

40 04/15/2009CS267 Lecture 21 Pencil/Slab optimizations: UPC vs MPI °Same data, viewed in the context of what MPI is able to overlap °“For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation” °On Infiniband, UPC overlaps almost all the time the MPI spends in communication °On Elan3, UPC obtains more overlap than MPI as the problem scales up

41 04/15/2009CS267 Lecture 21 Summary of Overlap in FFTs °One-sided communication has performance advantages Better match for most networking hardware -Most cluster networks have RDMA support -Machines with global address space support (X1, Altix) shown elsewhere Smaller messages may make better use of network -Spread communication over longer period of time -Postpone bisection bandwidth pain Smaller messages can also prevent cache thrashing for packing -Avoid packing overheads if natural message size is reasonable

42 04/15/2009CS267 Lecture 21 FFTW C library for real & complex FFTs (arbitrary size/dimensionality) Computational kernels (80% of code) automatically generated Self-optimizes for your hardware (picks best composition of steps) = portability + performance (+ parallel versions for threads & MPI) free software: http://www.fftw.org/ the “Fastest Fourier Tranform in the West”

43 04/15/2009CS267 Lecture 21 FFTW performance power-of-two sizes, double precision 833 MHz Alpha EV6 2 GHz PowerPC G5 2 GHz AMD Opteron 500 MHz Ultrasparc IIe

44 04/15/2009CS267 Lecture 21 FFTW performance non-power-of-two sizes, double precision 833 MHz Alpha EV6 2 GHz AMD Opteron unusual: non-power-of-two sizes receive as much optimization as powers of two …because we let the code do the optimizing

45 04/15/2009CS267 Lecture 21 FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two …because we let the code write itself non-powers-of-two exploiting CPU-specific SIMD instructions (rewriting the code) is easy

46 04/15/2009CS267 Lecture 21 Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. The resulting plan is executed with explicit recursion : enhances locality 3 1 2 The base cases of the recursion are codelets : highly-optimized dense code automatically generated by a special-purpose “compiler”

47 04/15/2009CS267 Lecture 21 FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE);... execute(p); /* repeat as needed */... destroy_plan(p); } Key fact: usually, many transforms of same size are required.

48 04/15/2009CS267 Lecture 21 Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. The resulting plan is executed with explicit recursion : enhances locality 3 1 2 The base cases of the recursion are codelets : highly-optimized dense code automatically generated by a special-purpose “compiler”

49 04/15/2009CS267 Lecture 21 But traditional implementation is non-recursive, breadth-first traversal: log 2 n passes over whole array FFTW Uses Natural Recursion Size 8 DFT Size 4 DFT Size 2 DFT p = 2 (radix 2)

50 04/15/2009CS267 Lecture 21 breadth-first, but with blocks of size = cache Traditional cache solution: Blocking Size 8 DFT Size 4 DFT Size 2 DFT p = 2 (radix 2) …requires program specialized for cache size

51 04/15/2009CS267 Lecture 21 Recursive Divide & Conquer is Good Size 8 DFT Size 4 DFT Size 2 DFT p = 2 (radix 2) eventually small enough to fit in cache …no matter what size the cache is (depth-first traversal) [Singleton, 1967]

52 04/15/2009CS267 Lecture 21 Cache Obliviousness A cache-oblivious algorithm does not know the cache size — it can be optimal for any machine & for all levels of cache simultaneously Exist for many other algorithms, too [Frigo et al. 1999] — all via the recursive divide & conquer approach

53 04/15/2009CS267 Lecture 21 Why is FFTW fast? three unusual features FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. The resulting plan is executed with explicit recursion : enhances locality 3 1 2 The base cases of the recursion are codelets : highly-optimized dense code automatically generated by a special-purpose “compiler”


Download ppt "04/15/2009CS267 Lecture 20 Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon"

Similar presentations


Ads by Google