Presentation is loading. Please wait.

Presentation is loading. Please wait.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar (www.centar.net) HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Similar presentations


Presentation on theme: "An FFT for Wireless Protocols Dr. J. Greg Nash Centar (www.centar.net) HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile."— Presentation transcript:

1 An FFT for Wireless Protocols Dr. J. Greg Nash Centar (www.centar.net) jgregnash@centar.net HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile Computing Hardware Architectures January 2-5, 2006

2 FFT in Wireless Applications Modulation schemes –Orthogonal Frequency Division Multiplexing (OFDM) –Orthogonal Frequency Division Multiple Acess (OFDMA) Cell Phone and LAN Protocols (OFDM based) –802.11n (next generation wireless LAN-WiFi) –802.16/e (wireless fixed and mobile MAN-WiMax ) –802.20 (mobile broadband wireless access) –802.22 (wireless regional area networks) –Flash-OFDM (Fast Low-latency Access with Seamless Handoff OFDM) –3GPP LTE (3rd Generation Partnership Project, Long Term Evolution) –HiperMAN/LAN (European broadband fixed wireless) After ~2010 modulation schemes will be based primarily on FFT High throughput: ~2.5usec per 1024 point FFT (several data streams associated with multple antennas and high bandwidths) High dynamic range: 60-100db S/QN (high peak to average power)

3 Required FFT For wireless –Transform size N not restricted to powers of two (e.g., 3GPP LTE requires 128, 256, 1024, 1536, 2048 points) –“Run-time” choice of FFT size –Scaling (chose size of hardware to match system throughput) –Pruning (reducing computational complexity when the number of DFT outputs or inputs is small compared N) –High throughput: ~2.5usec per 1024 point FFT (several data streams associated with multple antennas and high bandwidths) –High dynamic range: 60-100db S/QN (high peak-to-average power) For added generality –1-D or 2-D transforms –Low computational latency –Simple, locally connected circuit architecture

4 Discreet Fourier Transform Mathematical form: C (M=16) : Multiplications = M 2

5 Inputs X and Outputs Z in Bit-reversed Form (N=16) “ ”= element by element multiply

6 Base-4 DFT Matrix Equation General Form: Coefficient matrices are where

7 Find Systolic Architecture Using SPADE † Mathematical Algorithm Automatic Search for Space-Time Transformations, T Input Code Simulator, Graphical Outputs for j to M/4 do for k to M/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..M/4); od od; † Symbolic Parallel Algorithm Development Environment -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory FPGA Architectural Constraints Objective Functions

8 DFT Architecture Base-4 DFT Equations: Base-4 DFT Architecture:

9 Base-4 DFT Array (M=16)

10 Base-4 DFT Array (M= 32)

11 Processing flow for DFT of length N = N r N c 1N c column DFTs (X ci ) of length N r – Array length is N r /4 – N/4 clock cycles 2Twiddle multiplication – Only multipliers used – 4 N c clock cyles – Without this step a 2-D FFT is done 3N r row DFTs (X ri ) of length N c – (N c ) 2 /4 clock cylces

12 Possible Transform Sizes Base-4 –Matrix derivation requires M = 16, 32, 48,... –N = N r N c = (16p) (16q) = 256n Base-2: –Matrix derviation assumes M = 4, 8, 12,... –N = N r N c = (4p) (4q) = 16n Base-2 (No row/column factorization) –N = M = 4n (n,p,q = 1,2,3,..)

13 FFT Performance Comparisons Based on “Streaming” FFT (continuous data in and out) Benchmark against radix-4 Altera FFT (Block Floating Point) –Base-4 16-bit circuit –Choose Altera circuit with comparable signal to (roundoff) noise ratio –Circuits mapped to same Altera Stratix II FPGA (90nm) Same compiler used (Altera Quartus)

14 Power Dissipation Low Power Architecture –Use of many small memories (one per PE), so that they are both low power and fast (memory accounts for only 14% of the dissipation in the 256-point FFT) –Reuse of data flowing through registers (systolic processing) so that unnecessary memory reads and writes are avoided –Localized interconnects to minimize wiring overhea (total interconnect dynamic power is only 46% of the total power for the 256-point circuit.) Performance (256-point FFT) Expect ~15-20% improvement for optimized circuit

15 Block Floating Point Usage Each row has separate BFP support circuitry Row DFT inputs normalized to same exponent Row DFT outputs use FP One exponent for each ouput point Comparison of “single tone” data sets: N=1024

16 Figure of Merit Estimates vs Transform Size FOM = Area (ALMs) x Throughput (Cycles/DFT) x Mem (Kbits)/Clock(Hz) “Streaming” circuits: Altera (20-bit) and base-4 (16-bit)

17 Scaling Option (1) Trade-off between throughput and resouces used FOM = Area (ALMs) x Throughput (Cycles/DFT)/ Clock (MHz)/1000 Nominal clock = 350MHz Estimates

18 Non-Power-of-Two Comparison FOM = Area (ALMs) x Throughput (cycles/DFT) x Memory(Kbit)/Clock (MHz) Nominal clock = 350MHz Non-power-of-two

19 Scaling (2) Use same circuit to do different transform sizes (e.g., run-time) Base-4 matrix equation: Process each C B multiplication separately using blocks of 4 rows Example: 1024-point transform (N r =N c =32)

20 Scaling (2) Cycle input twice Option 1 –All column DFTs –All twiddles –All row DFTs Option 2 –Normal ordering 1(half Z values) 2(other half) N = 1024 N r =N c =32

21 Pruning Goal –Compute sub-set of transform outputs –Compute complete transform output with subset of inputs Example –N=1024 (N r =N c =32; nominal array N r /4 x 4 = 8 x 4) –Calculate only elements z0, z1, z2, z3 of Z –Only 4 row DFTs required (nominally 32 are required) Less than half the computing resources and half the computation time required

22 Summary Transform size N can be any multiple of 256 (or 16 or 4 with different base) Scalable, partitionable circuit –Any DFT size can be computed on the same circuit with sufficient memory –Larger circuits constructed by replication of identical 4x4 PE array blocks –Choose N r and N c for speed-area tradeoff Fine grained pruning options BFP/FP options reduces word length by ~4-bits High throughput (higher clock frequency, fewer clock cycles/DFT) Low computational latency –Pipeline depth small, vs for traditional pipelined FFTs 1-D and 2-D transforms possible on the same circuit Simple circuit (mesh array of identical adder cells)

23 Precision 1024-point transform Random real and complex inputs 18 data sets


Download ppt "An FFT for Wireless Protocols Dr. J. Greg Nash Centar (www.centar.net) HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile."

Similar presentations


Ads by Google