Efficient FPGA Implementation of QR

Slides:



Advertisements
Similar presentations
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Advertisements

The 3D FDTD Buried Object Detection Forward Model used in this project was developed by Panos Kosmas and Dr. Carey Rappaport of Northeastern University.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.
Computational Technologies for Digital Pulse Compression
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Department of Computer Systems Engineering, N-W.F.P. University of Engineering & Technology. DSP Presentation Computing Multiplication & division using.
Variable Precision Floating Point Division and Square Root Albert Conti Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University,
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Implementation of Finite Field Inversion
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Adaptive beamforming using QR in FPGA Richard Walke, Real-time System Lab Advanced Processing Centre S&E Division.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
FPGA Implementations for Volterra DFEs
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
An Efficient FPGA Implementation of IEEE e LDPC Encoder Speaker: Chau-Yuan-Yu Advisor: Mong-Kai Ku.
1 Reconfigurable Acceleration of Microphone Array Algorithms for Speech Enhancement Ka Fai Cedric Yiu, Yao Lu, Xiaoxiang Shi The Hong Kong Polytechnic.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Wang Chen, Dr. Miriam Leeser, Dr. Carey Rappaport Goal Speedup 3D Finite-Difference Time-Domain.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs Direction Filter Design FPGA FIREBIRD BOARD Framegrabber PCI Bus Host Data Packing Design.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Sherman Braganza, Miriam Leeser Goal Accelerate the performance of the minimum L P Norm phase unwrapping algorithm.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.
Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.
JET Algorithm Attila Hidvégi. Overview FIO scan in crate environment JET Algorithm –Hardware tests (on JEM 0.2) –Results and problems –Some VHDL tips.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
Philipp Gysel ECE Department University of California, Davis
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
Backprojection Project Update January 2002
Improved Resource Sharing for FPGA DSP Blocks
Design and Validation of a UWB Transmitter for FPGA Implementation
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Instructor: Dr. Phillip Jones
Centar ( Global Signal Processing Expo
Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara
Presentation transcript:

Efficient FPGA Implementation of QR Decomposition Using a Systolic Array Architecture Xiaojun Wang (x-wang@airvana.com), Miriam Leeser (mel@ece.neu.edu) Abstract Design Architecture QR decomposition is used in many applications including adaptive beamforming, phased-array radar & sonar, 3G wireless communication, channel equalization, smart antennas and WiMAX. We have implemented a systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. We support any general floating-point format including IEEE single precision. Our design uses straightforward floating-point divide and square root implementations. This makes it more standard and portable to different systems, thus easier to fit into a larger design. The latency of our implementation is very small and scales well for large matrix sizes. It is also fully pipelined with a throughput of over 130 MHz for IEEE single precision floating-point format. 35GFlops peak performance is achieved. Scheduling Comparison of 2D and 1D Implementation for a 6x3 Matrix (numbers are cycles) Two Dimensional Systolic Array 2D systolic array architecture for QR decomposition Diagonal PE: calculate the rotation parameters c and s Off-diagonal PE: update the matrix elements using rotation parameters 2D Implementation State of the Art Three types of previous implementations of Givens rotation for QR: [1] Square root free using Squared Givens Rotation (SGR) algorithm [2] Logarithmic Number System (LNS) algorithm [3] CORIDC (Coordinate Rotation by Digital Computer) algorithm All previous work avoids divide and square root steps in the Givens rotation algorithm. In our work, we implement the Givens rotation algorithm with the floating-point divide and square root from the Northeastern University Reconfigurable Computing Laboratory Variable Precision Floating-Point (VFloat) Library: http://www.ece.neu.edu/groups/rcl/projects/floatingpoint/index.html Off-diagonal Processing Element Diagonal Processing Element c + s × ─ r1 r2 1D Implementation + / x y c s × Conclusions A truly 2D systolic array architecture was implemented while others have implemented a 1D array only. The divide and square root operations in Givens rotations are not avoided in our QR decomposition. Instead, we implement the Givens rotations algorithm using the floating-point divide and square root from our VFloat library. Input, output and all operations are in floating-point arithmetic; any size floating-point format including IEEE standard formats are supported. Maximum level of parallelism is explored for a Xilinx XC5VLX220 FPGA with balanced usage of hardware resources such as slices and embedded DSPs. The QR decomposition is fully pipelined with high throughput, fast clock rate and high maximum frequency. The latency of our systolic array implementation increases linearly with matrix size, making it scale well for larger matrices and suitable for high-speed FPGA implementation. The input matrix size can be configured at compile-time to virtually any size. We support square, tall and short matrices. Our design is easily scalable to future larger FPGAs or over multiple FPGAs. The largest matrix that can fit depends on the degree of parallelism, the data wordlength, and the targeted FPGA. Algorithm – Givens Rotations Experimental Results Givens Rotation Targeting a XC5VLX220 FPGA 138240 slices 192 blockRAMs 128 embedded DSPs An m>n tall matrix needs about the same resources as an n x n square matrix XC5VLX220 FPGA can fit matrix Up to 7 columns for 23-bit format Up to 12 columns for 11-bit format Resources and Speed: XC5VLX220 Format Matrix Size Slices Block RAMs DSPs Freq. (MHz) 2D Latency (cycles) 1D Latency (cycles) 32(8,23) 7x7 126585 56 102 132 954 1512 20(8,11) 12x12 120094 30 106 139 1412 4290 c: cosine; s: sine Latency of QR using IEEE Single-precision Floating-point, Square Matrix Divide and square root are required The estimated latency of QR decomposition for a 7x7 matrix using 1D array is 1512 clock cycles, longer than 954 cycles of our 2D systolic array implementation The difference is more significant as matrix size grows Compared to the 1D implementation with , the latency of our 2D implementation increases linearly with matrix size, thus scaling well for larger matrices Example