Synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng.

Slides:



Advertisements
Similar presentations
Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department.
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
CS 732: Advance Machine Learning
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
The Present and Future of Parallelism on GPUs
GPU-based iterative CT reconstruction
Sathish Vadhiyar Parallel Programming
CS/EE 217 – GPU Architecture and Parallel Programming
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.
6- General Purpose GPU Programming
Presentation transcript:

synergy.cs.vt.edu The Hardware-Software Co-Design Process for the fast Fourier transform (FFT) Carlo C. del Mundo Advisor: Prof. Wu-chun Feng

synergy.cs.vt.edu The Multi- and Many-core Menace “...when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced....I would be panicked if I were in industry.” John Hennessy, Stanford University Author of Computer Architecture: A Quantitative Approach The Co-Design Process for the FFT

synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Dense Linear Algebra Sparse Linear Algebra N-Body Methods MapReduce Graphical Models Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Structured Grid Unstructured Grid Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Dense Linear Algebra Sparse Linear Algebra N-Body Methods Combinational Logic 1 Colella, Phillip. Defining software requirements for scientific computing Graph TraversalDynamic ProgrammingBranch-and-Bound Finite State Machines MapReduce Graphical Models

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing Fast Fourier Transform SoftwareHardware

synergy.cs.vt.edu Dwarfs of Symbolic Computation 1,2 Dwarf (noun): An algorithmic method that captures a pattern of computation and communication The Co-Design Process for the FFT Spectral Methods 2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, Colella, Phillip. Defining software requirements for scientific computing Fast Fourier Transform SoftwareHardware

synergy.cs.vt.edu OpenFFT: A heterogeneous FFT library The Co-Design Process for the FFT C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May (Under review.) C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov (Poster publication) C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

synergy.cs.vt.edu Berkeley’s View The Co-Design Process for the FFT

synergy.cs.vt.edu Computational Pattern: The Butterfly The Co-Design Process for the FFT

synergy.cs.vt.edu Dwarf: Spectral Methods –Butterfly Pattern Computational Pattern The Co-Design Process for the FFT ab +- a + b * w k a – b * w k Figure 1: Simple ButterflyFigure 2: Butterfly with Twiddle, w k ab +- a + ba - b wkwk

synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 1 thread The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: 4 threads The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Block The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One SM The Co-Design Process for the FFT 37.5% Occupancy on NVIDIA Kepler K20c

synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

synergy.cs.vt.edu GPU Memory Spaces The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers The Co-Design Process for the FFT Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.edu Global Data Bandwidth (Bus Traffic) The Co-Design Process for the FFT

synergy.cs.vt.edu Global Data Banwidth Bus Traffic: Bytes transferred from off-chip to on-chip memory and back. Suppose we take the FFT of a 128 MB data set –Minimum bus traffic is 2 x 128 = 256 MB Load 128 MB (global -> on-chip) Performs FFT Store 128 MB (on-chip -> global) Bus Traffic and Performance The Co-Design Process for the FFT

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern?

synergy.cs.vt.edu 16-pt FFT: Computation-to-Core: One Warp The Co-Design Process for the FFT What’s wrong with this access pattern? Scattered Memory Accesses (power-of-2 strides) Uncoalesced Memory Access

synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

synergy.cs.vt.edu System-level optimizations (applicable to any application) The Co-Design Process for the FFT

synergy.cs.vt.edu System-level optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing The Co-Design Process for the FFT

synergy.cs.vt.edu S1: Register Preloading Load to registers first The Co-Design Process for the FFT Without Register Preloading 79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; float2 registers[4];// Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S2: Vector Types Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) floatN + floatN The Co-Design Process for the FFT a[0]a[1]a[2]a[3]

synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT

synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data The Co-Design Process for the FFT 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];

synergy.cs.vt.edu 16-pt FFT: Stages of Computation The Co-Design Process for the FFT Input Array S1: “Columns” S2: “Twiddles” S3: “Transpose” S1: “Columns”

synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) The Co-Design Process for the FFT

synergy.cs.vt.edu Algorithm-level optimizations (applicable only to FFT) 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM 4.Register-to-register transpose (shuffle) The Co-Design Process for the FFT

synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.edu A1: Transpose via local 1 memory Via Shared Memory Register to Register (shfl) The Co-Design Process for the FFT 1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.edu Algorithm-level optimizations The Co-Design Process for the FFT OriginalTransposed

synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations The Co-Design Process for the FFT Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT FFT Transpose can be implemented using shuffle

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 1: Horizontal

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle Stage 3: Horizontal

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics The Co-Design Process for the FFT Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) FFT Transpose can be implemented using shuffle –Bottleneck: Intra-thread data movement Stage 2: Vertical

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Divergence

synergy.cs.vt.edu Shuffle Mechanics Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; The Co-Design Process for the FFT Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Shuffle instructions are cheap. CUDA local memory is slow. –Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence

synergy.cs.vt.edu Results (Experimental Testbed) The Co-Design Process for the FFT Evaluation Algorithm1D FFT (batched), N = 16-, 64-, and 256-pts FFTW Versionv3.3.2 (4 threads, OpenMP with AVX extensions) FFTW HardwareIntel i (4 3.1 GHz) GPU Testbed DeviceCores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) Max TDP (Watts) AMD Radeon HD AMD Radeon HD NVIDIA Tesla C NVIDIA Tesla K20c

synergy.cs.vt.edu Results (optimizations in isolation) Minimize bus traffic via on-chip optimizations (RP, LM- CM, LM-CC, LM-CT) –Critical in AMD GPUs, not so much for NVIDIA GPUs Use VASM2/VASM4 (do not consider VAVM types) The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.edu Results (optimizations in concert) Device data transfer (black) subsumes execution time § One set of opts. for all GPUs § in RP+LM-CM + VASM2 + CM/GAP RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% 16.5%

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x 16.5%

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!

synergy.cs.vt.edu Shuffle Results The Co-Design Process for the FFT NameReg.Shm.SLOCLME M Shared504K110 SELP (OOP) 744K3820 SELP (IP) 494K4180 DIV724K4620 Naive524K Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x –Speedup enhanced (s > 1) = 1.08x... but, wait! There’s more!

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced (s > 1) = 1.19x

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

synergy.cs.vt.edu Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K % SELP (IP) 494K % SELP (IP) % Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy

synergy.cs.vt.edu NameReg.Shm.SLOCOCC Shared504K1137.5% SELP (OOP) 744K % SELP (IP) 494K % SELP (IP) % Shuffle Results: At 50% occupancy The Co-Design Process for the FFT 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Fraction enhanced (0 < f < 1) = 16.5% Max. speedup enhanced 1 (s > 1) = 1.19x –Speedup enhanced 2 (s > 1) = 1.08x -> 1.17x 1 Calculation at 37.5% occupancy 2 Calculation at 50% occupancy Higher performance at higher occupancy

synergy.cs.vt.edu Results (1D FFT 256-pts) Speedups –as high as 9.1 and 5.8 over FFTW –as high as 18.2 and 2.9 over unoptimized GPU The Co-Design Process for the FFT

synergy.cs.vt.edu Summary Approach –Focus on identifying optimizations for hardware Takeaways –FFTs are memory-bound (focus should be on memory opts.) –Homogeneous set of optimizations for all GPUs: The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.edu Thank You! Contributions: –Optimization principles for FFT on GPUs –An analysis of GPU optimizations applied in isolation and in concert on AMD and NVIDIA GPU architectures Contact: –Carlo del Mundo The Co-Design Process for the FFT RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.