Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Systolic FFT Architecture for Real Time FPGA Systems.

Similar presentations


Presentation on theme: "A Systolic FFT Architecture for Real Time FPGA Systems."— Presentation transcript:

1 A Systolic FFT Architecture for Real Time FPGA Systems

2

3

4

5 Radar Processing Application ADC 1.2 GSPS x y 32K Correlation ADC 1.2 GSPS I/QFFTFIFOConjugate I/QFFTFIFO × +×+ 8K FFT bottleneck Real-time Complex 0.6 GSPS input (16-bits) 1.2 GSPS output (12-bits)

6 Size168192  Pins??? Fly??? Mult??? Add??? Shift??? Evaluation Scorecard The design changes will be scored based on the following metrics: Length of FFT Butterflies Adder/subtractors IO pins Multipliers Shift registers

7 Outline Introduction Parallel architecture –Data flow graph –Effects of serial input Systolic architecture Performance summary Conclusions

8 Baseline Parallel Architecture 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Parallel FFT Butterfly structure Removes redundant calculation Size168192  Pins448229K Fly3253K Mult Add Shift00

9 Complex Butterfly u v x y × + - Butterfly contains –1 complex addition –1 complex subtraction –1 complex, constant multiply Size168192  Pins448229K Fly3253K Mult Add Shift00

10 Complex Addition Complex addition adds the real and imaginary parts separately: 2 adds a c b d + + real imag Size168192  Pins448229K Fly3253K Mult Add128213K Shift00

11 Complex Multiply The FOIL method of multiplying complex numbers: 4 multiplies and 2 adds a c - + b d × × × × real imag Size168192  Pins448229K Fly3253K Mult128213K Add192320K Shift00

12 Efficient Complex Multiply Another approach requires fewer multiplies: 3 multiplies and 5 adds a b - + c d - + - real imag × × × Size168192  Pins448229K Fly3253K Mult96159K75% Add288480K150% Shift00

13 Parallel-Pipelined Architecture 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A pipelined version IO Bound 100% Efficient Size168192  Pins448229K Fly3253K Mult96159K Add288480K Shift00

14 Serial Input 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A serial version IO-rate matches A/D 6.25% Efficient Size168192  Pins28.01% Fly3253K Mult96159K Add288480K Shift00

15 Outline Introduction Parallel architecture Systolic architecture –Serial implementation –Application specific optimizations Performance summary Conclusions

16 The parallel architecture can be collapsed –One butterfly per stage –Consumes 1 sample per cycle –Same latency and throughput –More efficient design Serial Architecture 50% Efficiency Stage 1Stage 2Stage 3Stage 4 Size168192  Pins28 Fly413.03% Mult1239.03% Add36117.03% Shift2212K

17 High Level View Stage 1Stage 2Stage 3Stage 4 4123 Replace complex structure with an abstract cell which contains: –FIFOs –Butterfly –Switch network Size168192  Pins28 Fly413 Mult1239 Add36117 Shift2212K

18 8192-Point Architecture Requires 13 stages Fixed point arithmetic Varies the dynamic range to increase accuracy Overflow replaced with saturated value 12345678910111213 4 int 14 frac 5 int 13 frac 6 int 12 frac 7 int 11 frac 8 int 10 frac 9 int 9 frac 10 int 8 frac Multipliers limit design to 18-bits and 150 MHz Achieves 70 dB of accuracy 4 int 4 frac 0110.0101 Size168192  Pins28 Fly413 Mult1239 Add36117 Shift2212K

19 Increase Parallelism Add more pipelines Design limited to 150 MHz by multipliers I/Q module generate 600 MSPS Meets real-time requirement through parallelism 12345678910111234567891011123456789101112345678910111213121312131213 Size168192  Pins112 400% Fly1652400% Mult48156400% Add144468400% Shift1612K100%

20 Simplification Target application allows a specific simplification Pads a 4096-point sequence with 4096 zeros Removes 1 st stage multipliers and adders Achieves 100% efficiency in steady state 12345678910111234567891011123456789101112345678910111213121312131213 Size168192  Pins160 143% Fly1652 Mult3614492% Add10843292% Shift48K67%

21 Outline Introduction Parallel architecture Systolic architecture Performance summary –Power, operations per second –FPGA resources, frequency –Latency, throughput Conclusions

22 Results The current implementation has been placed on a Virtex II 8000 and verified at 150 MHz Power: 22 Watts @ 65 C GOPS: 86 total @ 3.9 GOPS/Watt FPGA resources (XC2V8000) –Multipliers: 144 (85%) –LUTs and SRLs: 39,453 (42%) –BlockRAM: 56 (33%) –Filp flops: 35,861 (38%) Frequency: 150 MHz Latency: 1127 cycles Throughput: 1.2 GSPS

23 Outline Introduction Parallel architecture Systolic architecture Performance summary Conclusions –Applicability to other platforms –Future work

24 Conclusions Created a high performance, real-time FFT core –Low power (3.9 GOPS/Watt) –High throughput (1.2 GSPS), low latency (7.6 µsec/sample) –Fixed-point (18-bits), high accuracy (70 dB) General architecture –Extendable to a generic FPGA core –Retargetable to ASIC technology Future work –Develop a parameterizable IP core generator

25

26

27

28

29

30

31

32

33 Sources Fredrik Edman Preston Jackson, Cy Chan, Charles Rader, Jonathan Scalera, and Michael Vai HPEC 2004 29 September 2004


Download ppt "A Systolic FFT Architecture for Real Time FPGA Systems."

Similar presentations


Ads by Google