synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng
synergy.cs.vt.edu The Multi- and Many-core Menace “...when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced....I would be panicked if I were in industry.” John Hennessy, Stanford University Author of Computer Architecture: A Quantitative Approach The Co-Design Process for the FFT
synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Dense Linear Algebra Sparse Linear Algebra N-Body Methods MapReduce Graphical Models Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Dense Linear Algebra Sparse Linear Algebra N-Body Methods Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines MapReduce Graphical Models
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing Fast Fourier Transform SoftwareHardware
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing Fast Fourier Transform SoftwareHardware
synergy.cs.vt.edu OpenFFT: A heterogeneous FFT library The Co-Design Process for the FFT C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May (Under review.) C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov (Poster publication) C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.
synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT
synergy.cs.vt.edu Computational Pattern: The Butterfly The Co-Design Process for the FFT
synergy.cs.vt.edu Dwarf: Spectral Methods –Butterfly Pattern Computational Pattern The Co-Design Process for the FFT ab +- a + b * w k a – b * w k Figure 1: Simple ButterflyFigure 2: Butterfly with Twiddle, w k ab +- a + ba - b wkwk
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 1 thread The Co-Design Process for the FFT
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 4 threads The Co-Design Process for the FFT
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Block The Co-Design Process for the FFT
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One SM The Co-Design Process for the FFT 37.5% Occupancy on NVIDIA Kepler K20c
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
synergy.cs.vt.edu GPU Memory Spaces The Co-Design Process for the FFT
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy The Co-Design Process for the FFT
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu Global Data Bandwidth (Bus Traffic) The Co-Design Process for the FFT
synergy.cs.vt.edu Global Data Banwidth Bus Traffic: Bytes transferred from off-chip to on-chip memory and back. Suppose we take the FFT of a 128 MB data set –Minimum bus traffic is 2 x 128 = 256 MB Load 128 MB (global -> on-chip) Performs FFT Store 128 MB (on-chip -> global) Bus Traffic and Performance The Co-Design Process for the FFT
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern?
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern? Scattered Memory Accesses (power-of-2 strides) Uncoalesced Memory Access
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
synergy.cs.vt.edu System-level optimizations (applicable to any application) The Co-Design Process for the FFT
synergy.cs.vt.edu System-level optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing The Co-Design Process for the FFT
synergy.cs.vt.edu S1: Register Preloading Load to registers first The Co-Design Process for the FFT Without Register Preloading 79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; float2 registers[4];// Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(®isters[0], ®isters[1], ®isters[2], ®isters[3]);
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) floatN + floatN The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT
synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) The Co-Design Process for the FFT
synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM 4.Register-to-register transpose (shuffle) The Co-Design Process for the FFT
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory Register to Register (shfl) The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
synergy.cs.vt.edu Algorithm-level optimizations The Co-Design Process for the FFT OriginalTransposed
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT FFT Transpose can be implemented using shuffle
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 1: Horizontal
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 2: Vertical
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 3: Horizontal
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Divergence
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence
synergy.cs.vt.edu Results (Experimental Testbed) The Co-Design Process for the FFT Evaluation Algorithm1D FFT (batched), N = 16-, 64-, and 256-pts FFTW Versionv3.3.2 (4 threads, OpenMP with AVX extensions) FFTW HardwareIntel i (4 3.1 GHz) GPU Testbed DeviceCores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Max TDP (Watts) AMD Radeon HD AMD Radeon HD NVIDIA Tesla C NVIDIA Tesla K20c
synergy.cs.vt.edu Results (optimizations in isolation) Minimize bus traffic via on-chip optimizations (RP, LM- CM, LM-CC, LM-CT) –Critical in AMD GPUs, not so much for NVIDIA GPUs Use VASM2/VASM4 (do not consider VAVM types) The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu Results (optimizations in concert) Device data transfer (black) subsumes execution time § One set of opts. for all GPUs § in RP+LM-CM + VASM2 + CM/GAP RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. The Co-Design Process for the FFT
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% 16.5%
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x 16.5%
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K % SELP (IP) 494K % SELP (IP) % Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy
synergy.cs.vt.edu NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K % SELP (IP) 494K % SELP (IP) % Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Higher performance at higher occupancy
synergy.cs.vt.edu Results (1D FFT 256-pts) Speedups –as high as 9.1 and 5.8 over FFTW –as high as 18.2 and 2.9 over unoptimized GPU The Co-Design Process for the FFT
synergy.cs.vt.edu Summary Approach –Focus on identifying optimizations for hardware Takeaways –FFTs are memory-bound (focus should be on memory opts.) –Homogeneous set of optimizations for all GPUs: The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu Thank You! Contributions: –Optimization principles for FFT on GPUs –An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Contact: –Carlo del Mundo The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.