Download presentation

Presentation is loading. Please wait.

Published byCarla Reede Modified over 2 years ago

1
Acceleration of Cooley-Tukey algorithm using Maxeler machine Author: Nemanja Trifunović Mentor: Professor dr. Veljko Milutinović

2
Introduction Cooley-Tukey algorithm Fast Fourier Transform Divide and conquer Uses: Digital Signal Processing, Telecommunications, The analysis of sound signals, … Maxeler platform Data flow (vs Control flow) FPGA Example of Fourier transformation. (Source: https://en.wikipedia.org/wiki/File:Rectangular_function.svg; https://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg, Illustration is published under Creative Commons licencom)https://en.wikipedia.org/wiki/File:Rectangular_function.svghttps://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg 1/22

3
Problem statement Design and implementation of: The fastest possible system for calculating Fast Fourier Transform using Maxeler machine. System that will outperform currently existing solutions to this problem. 2/22

4
Problem statement Benefits Higher speed of calculation. Lower power consumption. Lower space consumption. Conditions Huge amounts of data. Benefits of calculating Fast Fourier Transform with Maxeler machines 3/22

5
Conditions and assumptions Used Maxeler machine Two Maxeler card type MAX3424A. In experiments with multiprocessor systems only one processor core was used. 4/22

6
Overview of existing solutions FFT algorithms: Prime-factor, Bruuns, Raders, Winograd, Bluesteins, … The time complexity: O(N log N). Performance comparison of publicly available implementations. Matteo Frigo and Steven G. Johnson (from MIT) 5/22

7
Illustration of Matteo Frigos and Steven G. Johnsons experiments. (Soruce: http://www.fftw.org/speed/Pentium4-3.60GHz-icc) http://www.fftw.org/speed/Pentium4-3.60GHz-icc 6/22

8
The proposed solution Parallelized radix 2 algorithm. Pipeline of depth O(log N), where N is the length of input sequence. Latency is proportional to the depth of pipeline. After initial delay (latency) one result in every cycle. 7/22

9
Formal analysis Radix 2 Cooley-Tukey algorithm operates as follows: 1.Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence. 2.Then, using the calculated DFT's of subsequences DFT of the whole sequence is calculated. 8/22

10
Formal analysis Detailed derivation of the following formula is given it the paper DFT of even sequence is denoted by E k, DFT of odd sequence is denoted by a O k and e -2πk/N is denoted by W k n. 9/22

11
Illustration of pipelined execution of radix 2 algorithm. 10/22

12
Measurment and analysis of the performance of proposed implementation Types of performed experiments Calculation of Fourier transform of 100, 1.000, 10.000, 1.000.000 and 10.000.000 consecutive input sequences of length 8, 16, 32 i 64 points. Maxeler implementation vs reference CPU implementation Maxeler implementation vs best publicly available implementations 11/22

13
Generated graphs: Maxeler vs best publicly available implementations of FFT algorithm. Run-times, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). Acceleration obtained using Maxeler machine, compared to the CPU execution, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). 12/22

14
The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures for input sequence of 8 elements. 13/22

15
Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence. 14/22

16
Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations. 15/22

17
Acceleration of Maxeler implementation compared to CPU implementation depending on the number of consecutive calculations.. 16/22

18
Analysis of scalability and bottlenecks of proposed solution Transfer of data to Maxeler card and from Maxeler card Limited number of hardware resources on single Maxeler card Limited number of Maxeler cards 17/22

19
Analysis of implementation Maxeler implementation of Cooley-Tukey algorithm consists of: 1.Rearrangement of the input sequence in bit reverse order and 2.Radix 2 algorithm. 18/22

20
Illustration of the kernel 19/22

21
Implementation details Two input and two output streams These streams are of type: arrayType DFEType floatType = dfeFloat(8, 24); DFEArrayType arrayType = new DFEArrayType (floatType, n); Ratios W n k arent calculated on Maxeler machine Parameters: N first_level last_level 20/22

22
Conclusion Its show that proposed solution has expected performance and that it works correctly. Performance of the proposed solution is better than performance of any publicly available implementation of Fast Fourier Transform. To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform 21/22

23
Q/A Thank you for attention

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google