Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.

Similar presentations


Presentation on theme: "Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia."— Presentation transcript:

1 synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia Tech (Undergrad) Advisor: Dr. Wu-chun Feng* §, Virginia Tech * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

2 synergy.cs.vt.edu Forecast: Hardware-Software Co-Design Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Software (Transpose) Hardware (K20c and shuffle) NVIDIA Kepler K20c Shuffle Mechanism

3 synergy.cs.vt.edu Q: What is shuffle? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

4 synergy.cs.vt.edu Q: What is shuffle? Cheaper data movement Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

5 synergy.cs.vt.edu Q: What is shuffle? Cheaper data movement Faster than shared memory Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

6 synergy.cs.vt.edu Q: What is shuffle? Cheaper data movement Faster than shared memory Only in NVIDIA Tesla Kepler GPUs Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

7 synergy.cs.vt.edu Q: What is shuffle? Cheaper data movement Faster than shared memory Only in NVIDIA Tesla Kepler GPUs Limited to a warp Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

8 synergy.cs.vt.edu Q: What is shuffle? Cheaper data movement Faster than shared memory Only in NVIDIA Tesla Kepler GPUs Limited to a warp >>> Idea: reduce data communication between threads <<< Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

9 synergy.cs.vt.edu Q: What are you solving? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

10 synergy.cs.vt.edu Q: What are you solving? Enable efficient data communication Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

11 synergy.cs.vt.edu Q: What are you solving? Enable efficient data communication –Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

12 synergy.cs.vt.edu Q: What are you solving? Enable efficient data communication –Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

13 synergy.cs.vt.edu Q: What are you solving? Enable efficient data communication –Shared Memory (the “old” way) –Shuffle (the “new” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

14 synergy.cs.vt.edu Approach Evaluate shuffle using matrix transpose –Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

15 synergy.cs.vt.edu Approach Evaluate shuffle using matrix transpose –Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

16 synergy.cs.vt.edu Approach Evaluate shuffle using matrix transpose –Matrix transpose is a data communication step in FFT Devised Shuffle Transpose Algorithm –Consists of horizontal (inter-thread shuffles) and vertical (intra-thread) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

17 synergy.cs.vt.edu Analysis Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Bottleneck: Intra-thread data movement

18 synergy.cs.vt.edu Analysis Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Register File t0t0 t1t1 t2t2 t3t3 Bottleneck: Intra-thread data movement Stage 2: Vertical

19 synergy.cs.vt.edu Analysis Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Bottleneck: Intra-thread data movement Stage 2: Vertical

20 synergy.cs.vt.edu Analysis Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Bottleneck: Intra-thread data movement Stage 2: Vertical

21 synergy.cs.vt.edu Analysis Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Register File t0t0 t1t1 t2t2 t3t3 for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Bottleneck: Intra-thread data movement Stage 2: Vertical 15x

22 synergy.cs.vt.edu Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture General strategies Registers are fast. CUDA local memory is slow. –Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Analysis 15x

23 synergy.cs.vt.edu Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Registers are fast. CUDA local memory is slow. –Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Analysis 15x

24 synergy.cs.vt.edu Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Registers are fast. CUDA local memory is slow. –Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Divergence Analysis 15x 6%

25 synergy.cs.vt.edu Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Code 2 (DIV) int tmp = src_registers[0]; if (tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } else if (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } else if (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } General strategies Registers are fast. CUDA local memory is slow. –Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; Divergence Analysis 15x 6% 44%

26 synergy.cs.vt.edu Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

27 synergy.cs.vt.edu Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

28 synergy.cs.vt.edu Conclusion Overall Performance –Max. Speedup (Amdahl’s Law): 1.19-fold –Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

29 synergy.cs.vt.edu Conclusion Overall Performance –Max. Speedup (Amdahl’s Law): 1.19-fold –Achieved Speedup: 1.17-fold Surprise Result –Goal: Accelerate communication (“gray bar”) –Result: Accelerated the computation also (“black bar”) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

30 synergy.cs.vt.edu Thank You! Enabling Efficient Intra-Warp Comunication for Fourier Transforms in a Many-Core Architecture –Student: Carlo del Mundo, Virginia Tech (undergrad) –Overall Performance Theoretical Speedup: 1.19-fold Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) 64 dst_registers[k] = src_registers[(4 - tid + k) % 4];

31 synergy.cs.vt.edu Appendix Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

32 synergy.cs.vt.edu Motivation Goal –Accelerating an application based on hardware-specific mechanisms (e.g., “the hardware-software co-design process”) Case Study –Application: Matrix transpose as part of a 256-pt FFT –Architecture: NVIDIA Kepler K20c Use shuffle to accelerate communication Results –Max. Theoretical Speedup: 1.19-fold –Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

33 synergy.cs.vt.edu Background: The New and Old Shuffle –Idea: Communicate data within a warp w/o shared memory –Pros Faster (1 cycle to perform load and store) Eliminate the use of shared memory  higher thread occupancy –Cons Poorly understood Only available in Kepler GPUs Only limited to 32 threads Shared Memory –Idea Scratchpad memory to communicate data –Pros Easy to program Scales to a block (up to 1536 threads) –Cons Prone to bank conflicts Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture


Download ppt "Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia."

Similar presentations


Ads by Google