Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng.

Similar presentations


Presentation on theme: "Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng."— Presentation transcript:

1 synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng

2 synergy.cs.vt.edu The Multi- and Many-core Menace “...when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced....I would be panicked if I were in industry.” John Hennessy, Stanford University Author of Computer Architecture: A Quantitative Approach The Co-Design Process for the FFT

3 synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

4 synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

5 synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods MapReduce Graphical Models Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines

6 synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines MapReduce Graphical Models

7 synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Fast Fourier Transform SoftwareHardware

8 synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Fast Fourier Transform SoftwareHardware

9 synergy.cs.vt.edu OpenFFT: A heterogeneous FFT library The Co-Design Process for the FFT C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.) C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication) C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

10 synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

11 synergy.cs.vt.edu Computational Pattern: The Butterfly The Co-Design Process for the FFT

12 synergy.cs.vt.edu Dwarf: Spectral Methods –Butterfly Pattern Computational Pattern The Co-Design Process for the FFT ab +- a + b * w k a – b * w k Figure 1: Simple ButterflyFigure 2: Butterfly with Twiddle, w k ab +- a + ba - b wkwk

13 synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

14 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 1 thread The Co-Design Process for the FFT

15 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 4 threads The Co-Design Process for the FFT

16 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT

17 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Block The Co-Design Process for the FFT

18 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One SM The Co-Design Process for the FFT 37.5% Occupancy on NVIDIA Kepler K20c

19 synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

20 synergy.cs.vt.edu GPU Memory Spaces The Co-Design Process for the FFT

21 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy The Co-Design Process for the FFT

22 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT

23 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

24 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

25 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

26 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

27 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

28 synergy.cs.vt.edu Global Data Bandwidth (Bus Traffic) The Co-Design Process for the FFT

29 synergy.cs.vt.edu Global Data Banwidth Bus Traffic: Bytes transferred from off-chip to on-chip memory and back. Suppose we take the FFT of a 128 MB data set –Minimum bus traffic is 2 x 128 = 256 MB Load 128 MB (global -> on-chip) Performs FFT Store 128 MB (on-chip -> global) Bus Traffic and Performance The Co-Design Process for the FFT

30 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern?

31 synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern? Scattered Memory Accesses (power-of-2 strides) Uncoalesced Memory Access

32 synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

33 synergy.cs.vt.edu System-level optimizations (applicable to any application) The Co-Design Process for the FFT

34 synergy.cs.vt.edu System-level optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing The Co-Design Process for the FFT

35 synergy.cs.vt.edu S1: Register Preloading Load to registers first The Co-Design Process for the FFT Without Register Preloading 79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4];// Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

36 synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

37 synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

38 synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) floatN + floatN The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

39 synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT

40 synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];

41 synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

42 synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) The Co-Design Process for the FFT

43 synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM 4.Register-to-register transpose (shuffle) The Co-Design Process for the FFT

44 synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

45 synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

46 synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory Register to Register (shfl) The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

47 synergy.cs.vt.edu Algorithm-level optimizations The Co-Design Process for the FFT OriginalTransposed

48 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

49 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

50 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

51 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

52 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT

53 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT FFT Transpose can be implemented using shuffle

54 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle

55 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 1: Horizontal

56 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 2: Vertical

57 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 3: Horizontal

58 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

59 synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

60 synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

61 synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

62 synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Divergence

63 synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence

64 synergy.cs.vt.edu Results (Experimental Testbed) The Co-Design Process for the FFT Evaluation Algorithm1D FFT (batched), N = 16-, 64-, and 256-pts FFTW Versionv3.3.2 (4 threads, OpenMP with AVX extensions) FFTW HardwareIntel i5-2400 (4 cores @ 3.1 GHz) GPU Testbed DeviceCores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Max TDP (Watts) AMD Radeon HD 697015362703176250 AMD Radeon HD 797020483788264250 NVIDIA Tesla C20754481288144225 NVIDIA Tesla K20c24964106208225

65 synergy.cs.vt.edu Results (optimizations in isolation) Minimize bus traffic via on-chip optimizations (RP, LM- CM, LM-CC, LM-CT) –Critical in AMD GPUs, not so much for NVIDIA GPUs Use VASM2/VASM4 (do not consider VAVM types) The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

66 synergy.cs.vt.edu Results (optimizations in concert) Device data transfer (black) subsumes execution time § One set of opts. for all GPUs § in RP+LM-CM + VASM2 + CM/GAP RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. The Co-Design Process for the FFT

67 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT

68 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% 16.5%

69 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x 16.5%

70 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

71 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

72 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

73 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!

74 synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!

75 synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

76 synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

77 synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

78 synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

79 synergy.cs.vt.edu NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Higher performance at higher occupancy

80 synergy.cs.vt.edu Results (1D FFT 256-pts) Speedups –as high as 9.1 and 5.8 over FFTW –as high as 18.2 and 2.9 over unoptimized GPU The Co-Design Process for the FFT

81 synergy.cs.vt.edu Summary Approach –Focus on identifying optimizations for hardware Takeaways –FFTs are memory-bound (focus should be on memory opts.) –Homogeneous set of optimizations for all GPUs: The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

82 synergy.cs.vt.edu Thank You! Contributions: –Optimization principles for FFT on GPUs –An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Contact: –Carlo del Mundo The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.


Download ppt "Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng."

Similar presentations


Ads by Google