Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department.

Similar presentations


Presentation on theme: "Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department."— Presentation transcript:

1 synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

2 synergy.cs.vt.edu Forecast Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) –Replace fixed hardware ASICs with programmable GPUs Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

3 synergy.cs.vt.edu Forecast Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) –Replace fixed hardware ASICs with programmable GPUs Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

4 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

5 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg

6 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg

7 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

8 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

9 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

10 synergy.cs.vt.edu Motivation FFT is a critical building block across many disciplines Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.ajnr.org/content/27/6/1230/F1.large.jpg http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

11 synergy.cs.vt.edu Introduction Wideband Channelization –Purpose: To isolate channels within a wideband signal Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

12 synergy.cs.vt.edu Introduction Wideband Channelization –Purpose: To isolate channels within a wideband signal Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

13 synergy.cs.vt.edu Introduction Wideband Channelization –Purpose: To isolate channels within a wideband signal Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

14 synergy.cs.vt.edu Introduction Wideband Channelization –Purpose: To isolate channels within a wideband signal Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

15 synergy.cs.vt.edu Introduction (Channelization) Algorithm: Polyphase filter bank (PFB) channelizer Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Figure: Stages in a PFB Channelizer

16 synergy.cs.vt.edu Introduction (Channelization) Algorithm: Polyphase filter bank (PFB) channelizer –Problem: FFT stage grows fastest in channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Figure: Stages in a PFB Channelizer

17 synergy.cs.vt.edu Introduction (Channelization) Algorithm: Polyphase filter bank (PFB) channelizer –Problem: FFT stage grows fastest in channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Figure: Stages in a PFB Channelizer

18 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

19 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

20 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

21 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

22 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

23 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

24 synergy.cs.vt.edu Choosing the Right Processor Criteria: Programmability & Performance Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpghttp://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

25 synergy.cs.vt.edu Outline Motivation Introduction Background Approach –System-level optimizations –Algorithm-level optimizations Results –Optimizations in isolation –Optimizations in concert Conclusion Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

26 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

27 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

28 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

29 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Memory UnitRead Bandwidth (TB/s) Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

30 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Memory UnitRead Bandwidth (TB/s) L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

31 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Memory UnitRead Bandwidth (TB/s) Constant5.4 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

32 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Memory UnitRead Bandwidth (TB/s) Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

33 synergy.cs.vt.edu Background (GPUs) GPU Memory Hierarchy –Global Memory –Image Memory –Constant Memory –Local Memory –Registers Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Memory UnitRead Bandwidth (TB/s) Registers16.2 Constant5.4 Local2.7 L1/L2 Cache1.35 / 0.45 Global0.17 Table: Memory Read Bandwidth for Radeon HD 6970

34 synergy.cs.vt.edu Outline Motivation Introduction Background Approach –System-level optimizations –Algorithm-level optimizations Results –Optimizations in isolation –Optimizations in concert Conclusion Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

35 synergy.cs.vt.edu Approach Act as the “human compiler” Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

36 synergy.cs.vt.edu Approach Act as the “human compiler” 1.Derive a candidate set of optimizations for FFT on GPUs Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Candidate Optimizations

37 synergy.cs.vt.edu Approach Act as the “human compiler” 1.Derive a candidate set of optimizations for FFT on GPUs 2.Apply optimizations in isolation Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Candidate Optimizations Optimizations in Isolation

38 synergy.cs.vt.edu Approach Act as the “human compiler” 1.Derive a candidate set of optimizations for FFT on GPUs 2.Apply optimizations in isolation 3.Apply optimizations in concert Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Candidate Optimizations Optimizations in Concert Optimizations in Isolation

39 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Algorithm-level Optimizations 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

40 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Algorithm-level Optimizations 1.Transpose via LM 2.Compute/Transpose via LM 3.Compute/No Transpose via LM Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

41 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

42 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Algorithm-level Optimizations 1.Naïve Transpose (LM-CM) 2.Compute/Transpose via LM (LM-CC) 3.Compute/No Transpose via LM (LM-CT) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

43 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

44 synergy.cs.vt.edu System-level Optimizations 1.Register Preloading (RP) –Load to registers first Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

45 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); 1.Register Preloading (RP) –Load to registers first

46 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization With Register Preloading 79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); 1.Register Preloading (RP) –Load to registers first

47 synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

48 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0] 2. Vector Access (float{2, 4, 8, 16})

49 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1] 2. Vector Access (float{2, 4, 8, 16})

50 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] 2. Vector Access (float{2, 4, 8, 16})

51 synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3]

52 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] += 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float

53 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] += 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float

54 synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) float4 + float4 Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] +=

55 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] += + = 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) float4 + float4

56 synergy.cs.vt.edu System-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization a[0]a[1]a[2]a[3] += + = 2. Vector Access (float{2, 4, 8, 16}) –Scalar Math (VASM) float + float –Vector Math (VAVM) float4 + float4

57 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Algorithm-level Optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 1 C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

58 synergy.cs.vt.edu Approach System-level Optimizations (applicable to any application) 1.Register Preloading 2.Vector Access/{Vector,Scalar} Arithmetic 3.Constant Memory Usage 4.Dynamic Instruction Reduction 5.Memory Coalescing 6.Image Memory Algorithm-level Optimizations 1.Naïve Transpose (LM-CM) 2.Compute/Transpose via LM (LM-CC) 3.Compute/No Transpose via LM (LM-CT) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 1 C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

59 synergy.cs.vt.edu Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

60 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

61 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 4x4 matrixTransposed matrix

62 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 4x4 matrixTransposed matrix

63 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 4x4 matrixTransposed matrix

64 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 4x4 matrixTransposed matrix

65 synergy.cs.vt.edu Transpose – elements across the diagonal are exchanged Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 4x4 matrixTransposed matrix

66 synergy.cs.vt.edu Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed

67 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

68 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

69 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

70 synergy.cs.vt.edu 1.Naïve Transpose (LM-CM) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Local Memory t0t0 t1t1 t2t2 t3t3 OriginalTransposed Register File

71 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed

72 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed

73 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

74 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

75 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Perform computation on columns, Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

76 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Perform computation on columns, then rows. Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

77 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Perform computation on columns, then rows. –Advantage: Skips the transpose step Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

78 synergy.cs.vt.edu 3.The pseudo transpose (LM-CT) –Idea: Load data to local memory Perform computation on columns, then rows. –Advantage: Skips the transpose step –Disadvantage: Local memory has lower throughput than registers. Algorithm-level optimizations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization OriginalTransposed Local Memory

79 synergy.cs.vt.edu Outline Motivation Introduction Background Approach –System-level optimizations –Algorithm-level optimizations Results –Optimizations in isolation –Optimizations in concert Conclusion Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

80 synergy.cs.vt.edu Results (Experimental Testbed) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization GPU Testbed Device (AMD Radeon)Cores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) HD 797020483788264 HD 6970 (VLIW)15362703176 HD 5870 (VLIW)16002720154 Algorithm: –1D FFT (batched), N = 16 pts –Cooley-Tukey Decomposition

81 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

82 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

83 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 100% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

84 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 100% 160% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

85 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

86 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 40% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

87 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 40% 0% (No Change) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

88 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

89 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

90 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20%10% 41% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

91 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20% Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

92 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20% 0% (No Change) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

93 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

94 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20% Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

95 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 20%40% Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

96 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 0% (No Change) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

97 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2.0% - Dynamic instruction reduction (LU, CSE, IL) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

98 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 0% (No Change) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2.0% - Dynamic instruction reduction (LU, CSE, IL) Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

99 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2.0% - Dynamic instruction reduction (LU, CSE, IL) 3.18% - Avoid large vectors & vector math (VASM16, VAVM8/16) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

100 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2.0% - Dynamic instruction reduction (LU, CSE, IL) 3.18% - Avoid large vectors & vector math (VASM16, VAVM8/16) 61%39%50% AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

101 synergy.cs.vt.edu Results (in isolation) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. Improvements to Baseline (Max. % Increase) 1.160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2.40% - Coalesce memory accesses (CGAP) 3.20% - Use scalar math (VASM2/VASM4) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2.0% - Dynamic instruction reduction (LU, CSE, IL) 3.18% - Avoid large vectors & vector math (VASM16, VAVM8/16) 53%18%34% AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW)

102 synergy.cs.vt.edu Results (in concert) Improvements (Max. Increase) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

103 synergy.cs.vt.edu Results (in concert) Improvements (Max. Increase) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.9x 2.4x

104 synergy.cs.vt.edu Results (in concert) Improvements (Max. Increase) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.9x 2.4x 1.8x

105 synergy.cs.vt.edu Results (in concert) Improvements (Max. Increase) –{RP + LM-CM} best on-chip optimization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x

106 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x 6.5x 5.6x

107 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x 6.5x 5.6x

108 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x 6.5x 5.6x 6.5x

109 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x 6.5x 5.6x 6.5x 6.3x

110 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.1x 1.5x 2.9x 2.4x 1.8x 6.5x 5.6x 6.5x 6.3x 2.4x

111 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.9x 2.4x 2.1x 6.5x 2.4x 6.5x 6.3x 1.8x 1.5x 5.6x

112 synergy.cs.vt.edu Results (in concert) Improvements (Max. % Increase) –{RP + LM-CM} best on-chip optimization –Use Constant Memory (CM) for twiddle calculations –Use global memory (instead of image memory) –Optimal set for AMD GPUs RP – Register Preloading LM-CM – Transpose via local memory CM – Constant memory usage CGAP – Coalesced Global Access Pattern VASM2 – Vector Access, Scalar Math (float2) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2. *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). 2.9x 2.4x 2.1x 6.5x 2.4x 6.5x 6.3x 1.8x 1.5x 5.6x

113 synergy.cs.vt.edu Results (1D FFT 16-pts, GPU versions) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Optimized GPU faster by factors of 14.5 over baseline GPU

114 synergy.cs.vt.edu Results (1D FFT 16-pts, GPU versions) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization Optimized GPU faster by factors of 14.5 over baseline GPU

115 synergy.cs.vt.edu Conclusions Contributions: –A portable building block for FFT towards GPU-based radios –Architecture-aware insights for mapping and optimizing FFT across three generations of AMD GPUs Contact: –Carlo del Mundo –cdel@vt.educdel@vt.edu Optimal set for AMD GPUs –RP – Register Preloading –LM-CM – Transpose via local memory –CM – Constant memory usage –CGAP – Coalesced Global Access Pattern –VASM2 – Vector Access, Scalar Math (float2) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

116 synergy.cs.vt.edu Appendix Slides Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

117 synergy.cs.vt.edu Introduction (FFT) Fast Fourier Transform (FFT) –A spectral method Key computational idiom for present and future applications (dwarf) § List of Dwarfs 1.Finite State Machine 2.Circuits 3.Graph Algorithms 4.Structured Grid 5.Dense Matrix 6.Sparse Matrix 7.Spectral Methods 8.Dynamic Prog. 9.Particle Methods 10. Backtrack/B&B 11. Graphical Models 12. Unstructured Grids 13. Map Reduce Carlo del Mundo, cdel@vt.edu, carlodelmundo.com § Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009. Accelerating Fast Fourier Transform for Wideband Channelization

118 synergy.cs.vt.edu Background (Optimizing on GPUs) 1.RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU. Computation is facilitated solely on registers. 2.CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses memory element k) 3.VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic operations are scalar (float x float). 4.LM-CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for communication. Threads swap data elements solely in local memory. 5.LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for computation. The communication step is avoided by algorithm reorganization. 6.LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication. 7.CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up. 8.CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure. 9.IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called. 10.IM (Image Memory) – The use of a texture image replaces the use of global memory. Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

119 synergy.cs.vt.edu Motivation (GPU FFT vs. CPU FFT) Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization * Device-Host Data Transfer Not Included GPU FFT outperforms CPU FFT by factors as high as 6.5* –1D batched FFT, N = 16 pts

120 synergy.cs.vt.edu Introduction (Channelizer Architecture) Channelizer Architecture –FIR Filtering, FFT, and Channel Mapping. Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

121 synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization

122 synergy.cs.vt.edu S3: Constant Memory Fast cached lookup for frequently used data Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Accelerating Fast Fourier Transform for Wideband Channelization 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid];


Download ppt "Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department."

Similar presentations


Ads by Google