Download presentation

Presentation is loading. Please wait.

Published bySamson Parker Modified about 1 year ago

1
FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007

2
FPGA: Overview □Work done □Structure of a sample program □Ongoing Work □Next Step

3
FPGA : work done □Register handling and console IO □Modified simple.c □Implemented an adder □Used VirtualBase member of ADMXRC2_SPACE_INFO □Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga

4
Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …

5
Ongoing work : ZBT □Structure of zbt_main seems to be similar to simple.c □zbt.vhd is a wrapper for zbt_main.vhd □Same port names defined in the same way and port mapped to each other □Do not understand the reason for this wrapper □C code not available in ADMXRC2 demos □Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code

6
Next Step □To work with zbt and block RAMs □FFT implementation on the FPGA

7
Multiprocessor FFT Overview □Some improvements to the existing code □Improve the theoretical model □Compare theoretical run-time with actual run time □Statistics of each processor □Further refinement: Using BSP model □Pointers for Cache Analysis

8
Optimizations to the code □Removed other arrays (reducing memory references considerably) □Twiddle factors □Bit reversal addresses □Bit reversal faster using bit operations O(1) for each address calculation □All multiplications/divisions involving 2 implemented using shift operations O(1) □Power (2^n) in constant time using bit operations O(1)

9
Previously…

10
Now…

11
Improvement □For larger input size, our program (radix-2) is comparable to FFTW □Our program might surpass FFTW □Using SIMD □Higher radix (e.g. 4,8,16) □Coding in C

12
Redefining the execution time □For p processors, the total execution time is : (T N /p) + (1 – 1/p)(2N/B + K N ) □p is a power of 2 □This assumes “RAM Model” □Assumes a flat memory address space with unit- cost access to any memory location □We did not take into account the memory hierarchy □E.g. matrix multiplication actually takes O(n 5 ) instead of expected O(n 3 ) [Alpern et al. 1994]

13
Redefining the execution time □Some observations □If the #processors are p, then the actual FFT computed if FFT(N/p) time taken is T N/ p and NOT T N / p □Time taken to combine (O(n) in RAM model) should be taken as: Σ K N/2 i (i = 1 to log p) □NOT included the synchronization time □Currently looking execution time only from the perspective of master processor □The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible

14
New Theoretical Formula □Time taken for parallel execution with p processors is T N/p + (1-1/p)(2N/B) + ΣK N/2 i (i = 1 to log p)

15
Execution Time: 16777216

16
Input: 16777216 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=20.865T=26.579 T=26.591T=29.799 T=29.848 T=35.541 T=35.808T=35.555

17
Load Distribution: Processor 1

18
Load Distribution: Processor 2

19
Input:16777216 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=20.773T=26.464T=29.315 T=29.547 T=26.479 T=26.617 T=29.332 T=29.532T=30.816 T=30.835 T=31.032 T=31.045T=33.96 T=33.672 T=33.686 T=33.977 Recv(4) T=34.166 T=33.812 T=39.85 T=39.869 T=40.120

20
Load Distribution: Processor 1

21
Load Distribution: Processor 2

22
Load Distribution: Processor 3

23
Load Distribution: Processor 4

24
Execution Time: 33554432

25
Input: 33554432 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=103.558T=114.954 T=114.965T=121.558 T=121.921 T=133.322 T=133.851T=133.335

26
Load Distribution: Processor 1

27
Load Distribution: Processor 2

28
Input: 33554432 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=70.881T=91.281T=96.982 T=97.909 T=91.294 T=91.579 T=97.001 T=97.896T=100.128 T=100.164 T=101.052 T=101.043T=106.939 T=105.854 T=105.864 T=106.951 Recv(4) T=107.351 T=106.116 T=118.748 T=118.757 T=119.261

29
Load Distribution: Processor 1

30
Load Distribution: Processor 2

31
Load Distribution: Processor 3

32
Load Distribution: Processor 4

33
Execution Time:67108864

34
Input: 67108864 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=176.271T=199.081 T=199.092T=212.858 T=221.761 T=252.553 T=324.062T=252.656

35
Load Distribution: Processor 1

36
Load Distribution: Processor 2

37
Input: 67108864(p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=193.211T=220.211T=233.173 T=232.65 T=220.196 T=220.772 T=233.192 T=232.645T=262.629 T=239.773 T=239.257 T=239.238T=250.893 T=274.299 T=274.300 T=250.903 Recv(4) T=252.737 T=280.422 T=305.326 T=305.333 T=544.529

38
Load Distribution: Processor 1

39
Load Distribution: Processor 2

40
Load Distribution: Processor 3

41
Load Distribution: Processor 4

42
Inference □The idle time is very less (for processor 1) □The theoretical model matches with actual results □But, we need to find a closed form solution for T N and K N

43
Calculating T N and K N □Depends upon □N : Size of the input □A: Cache Associativity □L: Cost incurred for a miss □M: Size of the cache □B: Number of Bytes it can transfer at a time

44
Contd… □Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses □We have this table (computed in the summers) □We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number

45
Theoretical Verification □S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” □It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) □Still reading it

46
Modeling using BSP □BSP (Bulk Synchronous Parallel) model considers □The whole job as a series of supersteps □At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished

47
Modeling using BSP □BSP model uses the following parameters – □p the number of processors (p = ^2 for us) □w t the maximum local work performed by any processor □L the time machine needs for barrier synchronization (determined experimentally) □g the network bandwidth inefficiency (reciprocal of B,determined experimentally)

48
Modeling using BSP Send(2) Recv(1) Send(3) Send(4) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(1) Recv(3) Combine Recv(1) Send(1) Combine barrier step 0step 1step 2step 3step 4step 5step 6

49
Execution time □Step 0: L □Step1: L+max(time(Send(2)),time(Recv(1))) □Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) □Step 4: L+max(FFT i (N/p)) (0<=i<=p-1) □Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) □Step 6: L+max(time(combine i (N/4)) (i={1,2}) □Step 7: L+max(time(Send(1)),time(Recv(2))) □Step 8: L+ time(combine(N/2))

50
Generalizing this for p processors event(t) communications 0<= t < logp compute FFT(N/p) t = logp communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)

51
for t< logp Total # of steps = 2 t Sends and 2 t Recvs let time(send(N,i)) denote the time taken to send N data points to processor i let time(recv(N,j)) denote the time taken to receive N data points from parocessor j Total time taken for this group = ∑ max{time(send(N/(2 t+1 ),j-), time(send(N/(2 t+1 ), i-1))} +L(logp) 0

52
t = logp □Let time(FFT i (N/p)) denote the time taken to compute FFT of size N/p on processor i □thus, time taken to calculate FFT of size N/p is max{FFT i (N/p)} + L 0<= i<= p-1

53
for t>logp (t-logp is odd) Time taken is only for communications Total time taken is ∑ max{time(send(N/h,j-1),time(recv(N/h,i-1))} +L(logp) 0

54
for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0*
{
"@context": "http://schema.org",
"@type": "ImageObject",
"contentUrl": "http://images.slideplayer.com/4258095/14/slides/slide_53.jpg",
"name": "for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0 logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0
*

55
Execution Time □The total time is the sum of all the above steps □In general, there would be 3(logp) steps □The actual time depends upon how well a particular part of the program schedules on a particular processor □(i.e.) the processing time can vary

56
Further Work □Formalize the BSP model for p divisions □Combine Inplace (using realloc) □Compare parallel FFT against parallel FFTW

57
References □S.Sen, S.Chatterjee, N.Dumir, 2000.Towards a Theory of Cache- Efficient Algorithms □Michael J. Quinn, Parallel Programming in C with MPI and OpenMP □L.G. Valiant, 1990. A bridging model for parallel computation

58
Thank You

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google