An FFT for Wireless Protocols Dr. J. Greg Nash Centar (www.centar.net) HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Qi Wang July 3rd, Mobile Communication Seminar.
Image Compression System Megan Fuller and Ezzeldin Hamed 1.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.
Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
A Systolic FFT Architecture for Real Time FPGA Systems.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
DSP in FPGA.
© 2010 Altera Corporation—Public DSP Innovations in 28-nm FPGAs Danny Biran Senior VP of Marketing.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
OFDM(A) Competence Development – part II Per Hjalmar Lehne, Frode Bøhagen, Telenor R&I R&I seminar, 23 January 2008, Fornebu, Norway
Computational Technologies for Digital Pulse Compression
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Efficient FPGA Implementation of QR
Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett.
Chapter One Introduction to Pipelined Processors.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Space-Time and Space-Frequency Coded Orthogonal Frequency Division Multiplexing Transmitter Diversity Techniques King F. Lee.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
ESPL 1 Wordlength Optimization with Complexity-and-Distortion Measure and Its Application to Broadband Wireless Demodulator Design Kyungtae Han and Brian.
Implementation of OFDM Transmitter based on the IEEE d Standard Presented by: Altamash Janjua Group Partner: Umar Chohan Supervisors: Dr. Brian L.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
OFDM Based WLAN System Song Ziqi Zhang Zhuo.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Speaker: Darcy Tsai Advisor: Prof. An-Yeu Wu Date: 2013/10/31
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
VLSI Design of 2-D Discrete Wavelet Transform for Area-Efficient and High- Speed Image Computing - End Presentation Presentor: Eyal Vakrat Instructor:
March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence.
 presented by- ARPIT GARG ISHU MISHRA KAJAL SINGHAL B.TECH(ECE) 3RD YEAR.
EE345S Real-Time Digital Signal Processing Lab Fall 2006 Lecture 17 Fast Fourier Transform Prof. Brian L. Evans Dept. of Electrical and Computer Engineering.
Low Power Design for a 64 point FFT Processor
1 Paper reading A New Approach to FFT Processor Speaker: 吳紋浩 第六組 洪聖揚 吳紋浩 Adviser: Prof. Andy Wu Mentor: 陳圓覺.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
244-6: Higher Generation Wireless Techniques and Networks
Pipelining and Retiming 1
Length 1344 LDPC codes for 11ay
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Cache Memory Presentation I
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Centar ( Global Signal Processing Expo
High Throughput LDPC Decoders Using a Multiple Split-Row Method
ESE532: System-on-a-Chip Architecture
C Model Sim (Fixed-Point) -A New Approach to Pipeline FFT Processor
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
DSP Architectures for Future Wireless Base-Stations
Fast Fourier Transform
Presentation transcript:

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile Computing Hardware Architectures January 2-5, 2006

FFT in Wireless Applications Modulation schemes –Orthogonal Frequency Division Multiplexing (OFDM) –Orthogonal Frequency Division Multiple Acess (OFDMA) Cell Phone and LAN Protocols (OFDM based) –802.11n (next generation wireless LAN-WiFi) –802.16/e (wireless fixed and mobile MAN-WiMax ) – (mobile broadband wireless access) – (wireless regional area networks) –Flash-OFDM (Fast Low-latency Access with Seamless Handoff OFDM) –3GPP LTE (3rd Generation Partnership Project, Long Term Evolution) –HiperMAN/LAN (European broadband fixed wireless) After ~2010 modulation schemes will be based primarily on FFT High throughput: ~2.5usec per 1024 point FFT (several data streams associated with multple antennas and high bandwidths) High dynamic range: db S/QN (high peak to average power)

Required FFT For wireless –Transform size N not restricted to powers of two (e.g., 3GPP LTE requires 128, 256, 1024, 1536, 2048 points) –“Run-time” choice of FFT size –Scaling (chose size of hardware to match system throughput) –Pruning (reducing computational complexity when the number of DFT outputs or inputs is small compared N) –High throughput: ~2.5usec per 1024 point FFT (several data streams associated with multple antennas and high bandwidths) –High dynamic range: db S/QN (high peak-to-average power) For added generality –1-D or 2-D transforms –Low computational latency –Simple, locally connected circuit architecture

Discreet Fourier Transform Mathematical form: C (M=16) : Multiplications = M 2

Inputs X and Outputs Z in Bit-reversed Form (N=16) “ ”= element by element multiply

Base-4 DFT Matrix Equation General Form: Coefficient matrices are where

Find Systolic Architecture Using SPADE † Mathematical Algorithm Automatic Search for Space-Time Transformations, T Input Code Simulator, Graphical Outputs for j to M/4 do for k to M/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..M/4); od od; † Symbolic Parallel Algorithm Development Environment -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory FPGA Architectural Constraints Objective Functions

DFT Architecture Base-4 DFT Equations: Base-4 DFT Architecture:

Base-4 DFT Array (M=16)

Base-4 DFT Array (M= 32)

Processing flow for DFT of length N = N r N c 1N c column DFTs (X ci ) of length N r – Array length is N r /4 – N/4 clock cycles 2Twiddle multiplication – Only multipliers used – 4 N c clock cyles – Without this step a 2-D FFT is done 3N r row DFTs (X ri ) of length N c – (N c ) 2 /4 clock cylces

Possible Transform Sizes Base-4 –Matrix derivation requires M = 16, 32, 48,... –N = N r N c = (16p) (16q) = 256n Base-2: –Matrix derviation assumes M = 4, 8, 12,... –N = N r N c = (4p) (4q) = 16n Base-2 (No row/column factorization) –N = M = 4n (n,p,q = 1,2,3,..)

FFT Performance Comparisons Based on “Streaming” FFT (continuous data in and out) Benchmark against radix-4 Altera FFT (Block Floating Point) –Base-4 16-bit circuit –Choose Altera circuit with comparable signal to (roundoff) noise ratio –Circuits mapped to same Altera Stratix II FPGA (90nm) Same compiler used (Altera Quartus)

Power Dissipation Low Power Architecture –Use of many small memories (one per PE), so that they are both low power and fast (memory accounts for only 14% of the dissipation in the 256-point FFT) –Reuse of data flowing through registers (systolic processing) so that unnecessary memory reads and writes are avoided –Localized interconnects to minimize wiring overhea (total interconnect dynamic power is only 46% of the total power for the 256-point circuit.) Performance (256-point FFT) Expect ~15-20% improvement for optimized circuit

Block Floating Point Usage Each row has separate BFP support circuitry Row DFT inputs normalized to same exponent Row DFT outputs use FP One exponent for each ouput point Comparison of “single tone” data sets: N=1024

Figure of Merit Estimates vs Transform Size FOM = Area (ALMs) x Throughput (Cycles/DFT) x Mem (Kbits)/Clock(Hz) “Streaming” circuits: Altera (20-bit) and base-4 (16-bit)

Scaling Option (1) Trade-off between throughput and resouces used FOM = Area (ALMs) x Throughput (Cycles/DFT)/ Clock (MHz)/1000 Nominal clock = 350MHz Estimates

Non-Power-of-Two Comparison FOM = Area (ALMs) x Throughput (cycles/DFT) x Memory(Kbit)/Clock (MHz) Nominal clock = 350MHz Non-power-of-two

Scaling (2) Use same circuit to do different transform sizes (e.g., run-time) Base-4 matrix equation: Process each C B multiplication separately using blocks of 4 rows Example: 1024-point transform (N r =N c =32)

Scaling (2) Cycle input twice Option 1 –All column DFTs –All twiddles –All row DFTs Option 2 –Normal ordering 1(half Z values) 2(other half) N = 1024 N r =N c =32

Pruning Goal –Compute sub-set of transform outputs –Compute complete transform output with subset of inputs Example –N=1024 (N r =N c =32; nominal array N r /4 x 4 = 8 x 4) –Calculate only elements z0, z1, z2, z3 of Z –Only 4 row DFTs required (nominally 32 are required) Less than half the computing resources and half the computation time required

Summary Transform size N can be any multiple of 256 (or 16 or 4 with different base) Scalable, partitionable circuit –Any DFT size can be computed on the same circuit with sufficient memory –Larger circuits constructed by replication of identical 4x4 PE array blocks –Choose N r and N c for speed-area tradeoff Fine grained pruning options BFP/FP options reduces word length by ~4-bits High throughput (higher clock frequency, fewer clock cycles/DFT) Low computational latency –Pipeline depth small, vs for traditional pipelined FFTs 1-D and 2-D transforms possible on the same circuit Simple circuit (mesh array of identical adder cells)

Precision 1024-point transform Random real and complex inputs 18 data sets