Download presentation
Presentation is loading. Please wait.
Published byRalf Ryan Modified over 8 years ago
1
synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng
2
synergy.cs.vt.edu The Multi- and Many-core Menace “...when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced....I would be panicked if I were in industry.” John Hennessy, Stanford University Author of Computer Architecture: A Quantitative Approach The Co-Design Process for the FFT
3
synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT
4
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.
5
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods MapReduce Graphical Models Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines
6
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines MapReduce Graphical Models
7
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Fast Fourier Transform SoftwareHardware
8
synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Fast Fourier Transform SoftwareHardware
9
synergy.cs.vt.edu OpenFFT: A heterogeneous FFT library The Co-Design Process for the FFT C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.) C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication) C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.
10
synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT
11
synergy.cs.vt.edu Computational Pattern: The Butterfly The Co-Design Process for the FFT
12
synergy.cs.vt.edu Dwarf: Spectral Methods –Butterfly Pattern Computational Pattern The Co-Design Process for the FFT ab +- a + b * w k a – b * w k Figure 1: Simple ButterflyFigure 2: Butterfly with Twiddle, w k ab +- a + ba - b wkwk
13
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
14
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 1 thread The Co-Design Process for the FFT
15
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 4 threads The Co-Design Process for the FFT
16
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT
17
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Block The Co-Design Process for the FFT
18
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One SM The Co-Design Process for the FFT 37.5% Occupancy on NVIDIA Kepler K20c
19
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
20
synergy.cs.vt.edu GPU Memory Spaces The Co-Design Process for the FFT
21
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy The Co-Design Process for the FFT
22
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT
23
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
24
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
25
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
26
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
27
synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970
28
synergy.cs.vt.edu Global Data Bandwidth (Bus Traffic) The Co-Design Process for the FFT
29
synergy.cs.vt.edu Global Data Banwidth Bus Traffic: Bytes transferred from off-chip to on-chip memory and back. Suppose we take the FFT of a 128 MB data set –Minimum bus traffic is 2 x 128 = 256 MB Load 128 MB (global -> on-chip) Performs FFT Store 128 MB (on-chip -> global) Bus Traffic and Performance The Co-Design Process for the FFT
30
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern?
31
synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern? Scattered Memory Accesses (power-of-2 strides) Uncoalesced Memory Access
32
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
33
synergy.cs.vt.edu System-level optimizations (applicable to any application) The Co-Design Process for the FFT
34
synergy.cs.vt.edu System-level optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing The Co-Design Process for the FFT
35
synergy.cs.vt.edu S1: Register Preloading Load to registers first The Co-Design Process for the FFT Without Register Preloading 79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4];// Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(®isters[0], ®isters[1], ®isters[2], ®isters[3]);
36
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
37
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
38
synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) floatN + floatN The Co-Design Process for the FFT a[0]a[1]a[2]a[3]
39
synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT
40
synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];
41
synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”
42
synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) The Co-Design Process for the FFT
43
synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM 4.Register-to-register transpose (shuffle) The Co-Design Process for the FFT
44
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
45
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
46
synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory Register to Register (shfl) The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory
47
synergy.cs.vt.edu Algorithm-level optimizations The Co-Design Process for the FFT OriginalTransposed
48
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
49
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
50
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
51
synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File
52
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT
53
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT FFT Transpose can be implemented using shuffle
54
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle
55
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 1: Horizontal
56
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 2: Vertical
57
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 3: Horizontal
58
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical
59
synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical
60
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
61
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.
62
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Divergence
63
synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence
64
synergy.cs.vt.edu Results (Experimental Testbed) The Co-Design Process for the FFT Evaluation Algorithm1D FFT (batched), N = 16-, 64-, and 256-pts FFTW Versionv3.3.2 (4 threads, OpenMP with AVX extensions) FFTW HardwareIntel i5-2400 (4 cores @ 3.1 GHz) GPU Testbed DeviceCores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Max TDP (Watts) AMD Radeon HD 697015362703176250 AMD Radeon HD 797020483788264250 NVIDIA Tesla C20754481288144225 NVIDIA Tesla K20c24964106208225
65
synergy.cs.vt.edu Results (optimizations in isolation) Minimize bus traffic via on-chip optimizations (RP, LM- CM, LM-CC, LM-CT) –Critical in AMD GPUs, not so much for NVIDIA GPUs Use VASM2/VASM4 (do not consider VAVM types) The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
66
synergy.cs.vt.edu Results (optimizations in concert) Device data transfer (black) subsumes execution time § One set of opts. for all GPUs § in RP+LM-CM + VASM2 + CM/GAP RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. The Co-Design Process for the FFT
67
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT
68
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% 16.5%
69
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x 16.5%
70
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
71
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x
72
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x
73
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!
74
synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!
75
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
76
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x
77
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy
78
synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy
79
synergy.cs.vt.edu NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Higher performance at higher occupancy
80
synergy.cs.vt.edu Results (1D FFT 256-pts) Speedups –as high as 9.1 and 5.8 over FFTW –as high as 18.2 and 2.9 over unoptimized GPU The Co-Design Process for the FFT
81
synergy.cs.vt.edu Summary Approach –Focus on identifying optimizations for hardware Takeaways –FFTs are memory-bound (focus should be on memory opts.) –Homogeneous set of optimizations for all GPUs: The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
82
synergy.cs.vt.edu Thank You! Contributions: –Optimization principles for FFT on GPUs –An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Contact: –Carlo del Mundo The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.