Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng.

synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng

synergy.cs.vt.edu The Multi- and Many-core Menace “...when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced....I would be panicked if I were in industry.” John Hennessy, Stanford University Author of Computer Architecture: A Quantitative Approach The Co-Design Process for the FFT

synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Deﬁning software requirements for scientiﬁc computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods MapReduce Graphical Models Combinational Logic 1 Colella, Phillip. Deﬁning software requirements for scientiﬁc computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Dense Linear Algebra Sparse Linear Algebra N-Body Methods Combinational Logic 1 Colella, Phillip. Deﬁning software requirements for scientiﬁc computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines MapReduce Graphical Models

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. 1 Colella, Phillip. Deﬁning software requirements for scientiﬁc computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf. Fast Fourier Transform SoftwareHardware

synergy.cs.vt.edu OpenFFT: A heterogeneous FFT library The Co-Design Process for the FFT C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.) C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication) C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

synergy.cs.vt.edu Computational Pattern: The Butterfly The Co-Design Process for the FFT

synergy.cs.vt.edu Dwarf: Spectral Methods –Butterfly Pattern Computational Pattern The Co-Design Process for the FFT ab +- a + b * w k a – b * w k Figure 1: Simple ButterflyFigure 2: Butterfly with Twiddle, w k ab +- a + ba - b wkwk

synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 1 thread The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 4 threads The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Block The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One SM The Co-Design Process for the FFT 37.5% Occupancy on NVIDIA Kepler K20c

synergy.cs.vt.edu GPU Memory Spaces The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Global Data Bandwidth (Bus Traffic) The Co-Design Process for the FFT

synergy.cs.vt.edu Global Data Banwidth Bus Traffic: Bytes transferred from off-chip to on-chip memory and back. Suppose we take the FFT of a 128 MB data set –Minimum bus traffic is 2 x 128 = 256 MB Load 128 MB (global -> on-chip) Performs FFT Store 128 MB (on-chip -> global) Bus Traffic and Performance The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern?

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern? Scattered Memory Accesses (power-of-2 strides) Uncoalesced Memory Access

synergy.cs.vt.edu System-level optimizations (applicable to any application) The Co-Design Process for the FFT

synergy.cs.vt.edu System-level optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing The Co-Design Process for the FFT

synergy.cs.vt.edu S1: Register Preloading Load to registers first The Co-Design Process for the FFT Without Register Preloading 79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4];// Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) floatN + floatN The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT

synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];

synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) The Co-Design Process for the FFT

synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM 4.Register-to-register transpose (shuffle) The Co-Design Process for the FFT

synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory Register to Register (shfl) The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.edu Algorithm-level optimizations The Co-Design Process for the FFT OriginalTransposed

synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT FFT Transpose can be implemented using shuffle

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 1: Horizontal

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 3: Horizontal

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Divergence

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence

synergy.cs.vt.edu Results (Experimental Testbed) The Co-Design Process for the FFT Evaluation Algorithm1D FFT (batched), N = 16-, 64-, and 256-pts FFTW Versionv3.3.2 (4 threads, OpenMP with AVX extensions) FFTW HardwareIntel i5-2400 (4 cores @ 3.1 GHz) GPU Testbed DeviceCores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Max TDP (Watts) AMD Radeon HD 697015362703176250 AMD Radeon HD 797020483788264250 NVIDIA Tesla C20754481288144225 NVIDIA Tesla K20c24964106208225

synergy.cs.vt.edu Results (optimizations in isolation) Minimize bus traffic via on-chip optimizations (RP, LM- CM, LM-CC, LM-CT) –Critical in AMD GPUs, not so much for NVIDIA GPUs Use VASM2/VASM4 (do not consider VAVM types) The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.edu Results (optimizations in concert) Device data transfer (black) subsumes execution time § One set of opts. for all GPUs § in RP+LM-CM + VASM2 + CM/GAP RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% 16.5%

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x 16.5%

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K100128 Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

synergy.cs.vt.edu NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K38237.5% SELP (IP) 494K41837.5% SELP (IP) 49041850% Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Higher performance at higher occupancy

synergy.cs.vt.edu Results (1D FFT 256-pts) Speedups –as high as 9.1 and 5.8 over FFTW –as high as 18.2 and 2.9 over unoptimized GPU The Co-Design Process for the FFT

synergy.cs.vt.edu Summary Approach –Focus on identifying optimizations for hardware Takeaways –FFTs are memory-bound (focus should be on memory opts.) –Homogeneous set of optimizations for all GPUs: The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.edu Thank You! Contributions: –Optimization principles for FFT on GPUs –An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Contact: –Carlo del Mundo The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng.

Similar presentations

Presentation on theme: "Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng.

Similar presentations

Presentation on theme: "Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng."— Presentation transcript:

Similar presentations

About project

Feedback