Download presentation

Presentation is loading. Please wait.

Published bySamson Parker Modified over 2 years ago

1
FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007

2
FPGA: Overview □Work done □Structure of a sample program □Ongoing Work □Next Step

3
FPGA : work done □Register handling and console IO □Modified simple.c □Implemented an adder □Used VirtualBase member of ADMXRC2_SPACE_INFO □Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga

4
Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …

5
Ongoing work : ZBT □Structure of zbt_main seems to be similar to simple.c □zbt.vhd is a wrapper for zbt_main.vhd □Same port names defined in the same way and port mapped to each other □Do not understand the reason for this wrapper □C code not available in ADMXRC2 demos □Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code

6
Next Step □To work with zbt and block RAMs □FFT implementation on the FPGA

7
Multiprocessor FFT Overview □Some improvements to the existing code □Improve the theoretical model □Compare theoretical run-time with actual run time □Statistics of each processor □Further refinement: Using BSP model □Pointers for Cache Analysis

8
Optimizations to the code □Removed other arrays (reducing memory references considerably) □Twiddle factors □Bit reversal addresses □Bit reversal faster using bit operations O(1) for each address calculation □All multiplications/divisions involving 2 implemented using shift operations O(1) □Power (2^n) in constant time using bit operations O(1)

9
Previously…

10
Now…

11
Improvement □For larger input size, our program (radix-2) is comparable to FFTW □Our program might surpass FFTW □Using SIMD □Higher radix (e.g. 4,8,16) □Coding in C

12
Redefining the execution time □For p processors, the total execution time is : (T N /p) + (1 – 1/p)(2N/B + K N ) □p is a power of 2 □This assumes “RAM Model” □Assumes a flat memory address space with unit- cost access to any memory location □We did not take into account the memory hierarchy □E.g. matrix multiplication actually takes O(n 5 ) instead of expected O(n 3 ) [Alpern et al. 1994]

13
Redefining the execution time □Some observations □If the #processors are p, then the actual FFT computed if FFT(N/p) time taken is T N/ p and NOT T N / p □Time taken to combine (O(n) in RAM model) should be taken as: Σ K N/2 i (i = 1 to log p) □NOT included the synchronization time □Currently looking execution time only from the perspective of master processor □The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible

14
New Theoretical Formula □Time taken for parallel execution with p processors is T N/p + (1-1/p)(2N/B) + ΣK N/2 i (i = 1 to log p)

15
Execution Time: 16777216

16
Input: 16777216 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=20.865T=26.579 T=26.591T=29.799 T=29.848 T=35.541 T=35.808T=35.555

17
Load Distribution: Processor 1

18
Load Distribution: Processor 2

19
Input:16777216 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=20.773T=26.464T=29.315 T=29.547 T=26.479 T=26.617 T=29.332 T=29.532T=30.816 T=30.835 T=31.032 T=31.045T=33.96 T=33.672 T=33.686 T=33.977 Recv(4) T=34.166 T=33.812 T=39.85 T=39.869 T=40.120

20
Load Distribution: Processor 1

21
Load Distribution: Processor 2

22
Load Distribution: Processor 3

23
Load Distribution: Processor 4

24
Execution Time: 33554432

25
Input: 33554432 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=103.558T=114.954 T=114.965T=121.558 T=121.921 T=133.322 T=133.851T=133.335

26
Load Distribution: Processor 1

27
Load Distribution: Processor 2

28
Input: 33554432 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=70.881T=91.281T=96.982 T=97.909 T=91.294 T=91.579 T=97.001 T=97.896T=100.128 T=100.164 T=101.052 T=101.043T=106.939 T=105.854 T=105.864 T=106.951 Recv(4) T=107.351 T=106.116 T=118.748 T=118.757 T=119.261

29
Load Distribution: Processor 1

30
Load Distribution: Processor 2

31
Load Distribution: Processor 3

32
Load Distribution: Processor 4

33
Execution Time:67108864

34
Input: 67108864 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=176.271T=199.081 T=199.092T=212.858 T=221.761 T=252.553 T=324.062T=252.656

35
Load Distribution: Processor 1

36
Load Distribution: Processor 2

37
Input: 67108864(p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=193.211T=220.211T=233.173 T=232.65 T=220.196 T=220.772 T=233.192 T=232.645T=262.629 T=239.773 T=239.257 T=239.238T=250.893 T=274.299 T=274.300 T=250.903 Recv(4) T=252.737 T=280.422 T=305.326 T=305.333 T=544.529

38
Load Distribution: Processor 1

39
Load Distribution: Processor 2

40
Load Distribution: Processor 3

41
Load Distribution: Processor 4

42
Inference □The idle time is very less (for processor 1) □The theoretical model matches with actual results □But, we need to find a closed form solution for T N and K N

43
Calculating T N and K N □Depends upon □N : Size of the input □A: Cache Associativity □L: Cost incurred for a miss □M: Size of the cache □B: Number of Bytes it can transfer at a time

44
Contd… □Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses □We have this table (computed in the summers) □We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number

45
Theoretical Verification □S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” □It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) □Still reading it

46
Modeling using BSP □BSP (Bulk Synchronous Parallel) model considers □The whole job as a series of supersteps □At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished

47
Modeling using BSP □BSP model uses the following parameters – □p the number of processors (p = ^2 for us) □w t the maximum local work performed by any processor □L the time machine needs for barrier synchronization (determined experimentally) □g the network bandwidth inefficiency (reciprocal of B,determined experimentally)

48
Modeling using BSP Send(2) Recv(1) Send(3) Send(4) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(1) Recv(3) Combine Recv(1) Send(1) Combine barrier step 0step 1step 2step 3step 4step 5step 6

49
Execution time □Step 0: L □Step1: L+max(time(Send(2)),time(Recv(1))) □Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) □Step 4: L+max(FFT i (N/p)) (0<=i<=p-1) □Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) □Step 6: L+max(time(combine i (N/4)) (i={1,2}) □Step 7: L+max(time(Send(1)),time(Recv(2))) □Step 8: L+ time(combine(N/2))

50
Generalizing this for p processors event(t) communications 0<= t < logp compute FFT(N/p) t = logp communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)

51
for t< logp Total # of steps = 2 t Sends and 2 t Recvs let time(send(N,i)) denote the time taken to send N data points to processor i let time(recv(N,j)) denote the time taken to receive N data points from parocessor j Total time taken for this group = ∑ max{time(send(N/(2 t+1 ),j-), time(send(N/(2 t+1 ), i-1))} +L(logp) 0

52
t = logp □Let time(FFT i (N/p)) denote the time taken to compute FFT of size N/p on processor i □thus, time taken to calculate FFT of size N/p is max{FFT i (N/p)} + L 0<= i<= p-1

53
for t>logp (t-logp is odd) Time taken is only for communications Total time taken is ∑ max{time(send(N/h,j-1),time(recv(N/h,i-1))} +L(logp) 0

54
for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0*
{
"@context": "http://schema.org",
"@type": "ImageObject",
"contentUrl": "http://images.slideplayer.com/14/4258095/slides/slide_54.jpg",
"name": "for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0 logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0
*

55
Execution Time □The total time is the sum of all the above steps □In general, there would be 3(logp) steps □The actual time depends upon how well a particular part of the program schedules on a particular processor □(i.e.) the processing time can vary

56
Further Work □Formalize the BSP model for p divisions □Combine Inplace (using realloc) □Compare parallel FFT against parallel FFTW

57
References □S.Sen, S.Chatterjee, N.Dumir, 2000.Towards a Theory of Cache- Efficient Algorithms □Michael J. Quinn, Parallel Programming in C with MPI and OpenMP □L.G. Valiant, 1990. A bridging model for parallel computation

58
Thank You

Similar presentations

OK

1 Lecture 2: Parallel computational models. 2 Turing machine RAM (Figure ) Logic circuit model RAM (Random Access Machine) Operations supposed to.

1 Lecture 2: Parallel computational models. 2 Turing machine RAM (Figure ) Logic circuit model RAM (Random Access Machine) Operations supposed to.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt online shopping system Ppt on online banking Ppt on thermal power plant engineering Ppt on necessity is the mother of invention Ppt on current account convertibility Ppt on electronics and telecommunication Ppt on mobile shop management project Ppt on electricity generation from solar energy Ppt on leadership challenges Ppt on placement in hrm