Download presentation

Presentation is loading. Please wait.

Published byPeyton Keys Modified over 2 years ago

1
Rader’s FFT algorithm acceleration using Maxeler Author: Tadej Matek

2
Fourier Transform ●Fourier transform decomposes a signal into its frequency components ●Used in telecommunications, data compression, digital signal processing, fast multiplication of polynomials... Source: http://fweb.wallawalla.edu/class-wiki/index.php/DFT_example_using_MATLAB_-_HW11http://fweb.wallawalla.edu/class-wiki/index.php/DFT_example_using_MATLAB_-_HW11 1/17 Tadej Matek

3
Fourier Transform and computers ●Transformation: Discrete Fourier Transform Time: O(n 2 ) ●Algorithm(s): Fast Fourier Transform (FFT) (Cooley-Tukey, Bruun’s FFT, Rader’s FFT, Bluestein’s FFT …) Time: O(nlogn) 2/17Tadej Matek

4
●Divide & conquer + properties of primitive roots ●Primitive root of unity: ●Conquer step (butterfly): Why is FFT faster than DFT Source: http://mathworld.wolfram.com/images/gifs/rootsu. gif 3/17Tadej Matek

5
Rader’s FFT algorithm overview ●Primitive root defined as: ●Bit reversal rev k (i): rev 4 (3): 3 (10) = 0011 (2) → 1100 (2) = 12 (10) 4/17Tadej Matek

6
Example of calculation n = 4 k = log(n) = 2 z = 5 p = 13 i = 0 8,2, 2,4 i = 1 8+z 0 *2 % 13 = 10 10,6 6, 11 s = rev k (i) = 0 2+z 0 *4 % 13 = 6 s = rev k (i) = 2 8+z 2 *2 % 13 = 6 2+z 2 *4 % 13 = 11 3 4 9 3 i = 0 i = 1 i = 2 i = 3 s = 0 s = 2 s = 1 s = 3 5/17Tadej Matek

7
Example: fast multiplication ●How to multiply two large polynomials? ●Basic approach: multiply each component of 1st with each component of 2nd -> O(n 2 ) ●Using FFT: compute DFT transform of both polynomials, multiply in O(n) time and do inverse FFT -> O(nlogn) 6/17Tadej Matek

8
Dataflow implementation (1) 8,2,2,4 10,66, 11 3 4 9 3 Data dependency! Kernel needs updated data for each level! Solution: LMem 7/17 Tadej Matek

9
Dataflow implementation (2) CPU LMem Input sequence (1) Manager Kernel (2) Call kernel k times... Manager streams data in and out of Kernel Output sequence (3) 8/17Tadej Matek

10
Dataflow implementation (3) ●LMem works in bursts (example: 384 B, but depends on DFE) ●Good for consecutive calculations ●z s are calculated on CPU and written to LMem 9/17Tadej Matek

11
●CPU used for testing: Intel Core2 Quad Processor Q9400 2.86GHz ●Maxeler card of type MAX2336B was used for DFE testing Performance & results (1) 10/17Tadej Matek

12
Performance & results (2) ●Conditions: BIG data, 95% run time in loops ●Type of experiments: consecutive calculations starting from 10K and up to 10M ●Consecutive calculations for input sequences of length 32, 64, 128 and 256 11/17Tadej Matek

13
Performance & results (3) Execution time, N = 32, for CPU and DFE 12/17Tadej Matek

14
Performance & results (4) Speedup according to the number of consecutive calculations for N = 32 13/17 Tadej Matek

15
Performance & results (5) Speedup according to the number of consecutive calculations for N = 64 14/17 Tadej Matek

16
Performance & results (6) Speedup according to the number of consecutive calculations for N = 256 15/17 Tadej Matek

17
Performance & results (7) Speedup according to the size of input sequence (for 100K calculations) 16/17 Tadej Matek

18
Conclusion ●FFTs are one of the most used algorithms today ●There can be massive speedup but the requirement are consecutive calculations ●Power usage: reduced due to lower frequency (200Mhz vs 2.86GHz) 17/17Tadej Matek

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google