Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser

Slides:



Advertisements
Similar presentations
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Advertisements

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
The 3D FDTD Buried Object Detection Forward Model used in this project was developed by Panos Kosmas and Dr. Carey Rappaport of Northeastern University.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.
1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date:
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.
Computer Arithmetic Integers: signed / unsigned (can overflow) Fixed point (can overflow) Floating point (can overflow, underflow) (Boolean / Character)
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Fernando Ortiz EM Photonics, Inc. Newark, DE
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
Sherman Braganza, Miriam Leeser, W.C. Warger II, C.M. Warner, C. A. DiMarzio Goal Accelerate the performance of the.
An Introduction to high precision calculations of well-known mathematical constants by Kurt Calder Calculating Euler’s Number “e” using continued fractions.
Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett.
Variable Precision Floating Point Division and Square Root Albert Conti Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University,
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
Adaptive beamforming using QR in FPGA Richard Walke, Real-time System Lab Advanced Processing Centre S&E Division.
Hyperspectral Imaging is the process by which image data is obtained simultaneously in dozens or hundreds of narrow, adjacent spectral bands. These bands.
Dept. Electrónica y Computación Univ. Santiago de Compostela Lab. de l’Informatique du Parallélisme. ENS-Lyon FPGA IMPLEMENTATION of a FAITHFUL POLYNOMIAL.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
By V. Koutsoumpos, C. Kachris, K. Manolopoulos, A. Belias NESTOR Institute – ICS FORTH Presented by: Kostas Manolopoulos.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Wang Chen, Dr. Miriam Leeser, Dr. Carey Rappaport Goal Speedup 3D Finite-Difference Time-Domain.
Implementing and Optimizing a Direct Digital Frequency Synthesizer on FPGA Jung Seob LEE Xiangning YANG.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs Direction Filter Design FPGA FIREBIRD BOARD Framegrabber PCI Bus Host Data Packing Design.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
Sherman Braganza, Miriam Leeser Goal Accelerate the performance of the minimum L P Norm phase unwrapping algorithm.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Implementing Fast Image Processing Pipelines in a Codesign Environment Accelerate image processing tasks through efficient use of FPGAs. Combine already.
The Image Space Reconstruction Algorithm (ISRA) is an iterative method used to solve the abundance estimation problem in the analysis of hyperspectral.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
Philipp Gysel ECE Department University of California, Davis
CORDIC Based 64-Point Radix-2 FFT Processor
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Backprojection Project Update January 2002
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Dynamo: A Runtime Codesign Environment
Parallel Beam Back Projection: Implementation
Instructor: Dr. Phillip Jones
Spartan FPGAs مرتضي صاحب الزماني.
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Abelardo Jara-Berrocal Joseph Antoon Ph.D. Students
A Comparison of Field Programmable Gate
ADSP 21065L.
Presentation transcript:

Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser This work was supported in part by Gordon- CenSSIS, the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems, under the Engineering Research Centers Program of the National Science Foundation (Award Number EEC ). Division and square root are important operations in many high performance signal processing applications. We have implemented floating point division and square root based on Taylor series for the variable precision floating point library developed at the Reconfigurable Computing Laboratory at Northeastern. Our result shows that they are very well suited to FPGA implementations, and lead to a good tradeoff of area and latency. We implemented a floating-point K-means clustering algorithm and applied it to multispectral satellite images. The mean update is moved from host to FPGA hardware with the new fp_div module to reduce the communication between host and FPGA board and further accelerate the runtime. We are also working on QR factorization using both floating point divide and square root. An Application: K-Means Clustering Each cluster has a center (mean value) -Initialized on host -Initialization done once for complete image processing Cluster assignment - Distance (Manhattan norm) of each pixel and cluster center Accumulation of pixel value of each cluster Mean update via dividing the accumulator value by number of pixels (done once per iteration) -Previously done on host -Moved to FPGA with fp_div Abstract Conclusions The library includess fully pipelined and parameterized hardware modules for floating point arithmetic New module fp_div and fp_sqrt have small area and low latency, are easily pipelined Applications using fp_div and fp_sqrt show great speedup vs. software implementation Reconfigurable Hardware Further Information us: Research Level 1 Thrust R3A This work is a part of CenSSIS Research Thrust R3A. Due to inherent limitations of the fixed-point representation, it is desirable to perform arithmetic operation in the floating-point format for many image and signal processing algorithms. Our goal is to develop a parameterized floating-point library with reconfigurable hardware to speed up those image and signal processing algorithms such as remote sensing application. State of the Art [1] P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn, “Fast division algorithm with a small lookup table," Asilomar Conference,1999 [2] M. D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand, “Reciprocation, square root, inverse square root, and some elementary functions using small multipliers," IEEE Transactions on Computers, vol. 2, pp , 2000 Both algorithms are simple and elegant Based on Taylor series Use small table-lookup method with small multipliers Very well suited to FPGA implementations BlockRAM, distributed memory, embedded multiplier Lead to a good tradeoff of area and latency Can be fully pipelined Clock speed similar to all other components in the floating point library R1 R2 Fundamental Science Validating TestBEDs L1 L2 L3 R3 S1 S5 S4 S3S2 Bio-Med Enviro-Civil This project is funded by Mercury Computer Systems, Inc. Reconfigurable Computing Laboratory Subtraction Addition Comparison DATAPATH Abs. Value Validity Memory Acknowledge Data Valid Datapath Pixel Shift Accumulator Cluster Assignment Mean Update Division Experimental Results 8 cluster, 8 channel, 8 bit per channel 37% slices, 81% blockRAMs, 44% embedded multipliers of Virtex2V6000 More than 2150x faster than software implementation for core computation only 11x faster than software implementation, including time to configure FPGA and move data between board and host PC. Floating Point Divider and Square RootK-means Clustering Floating Point Format8 (2,5) 16 (4,11) 32 (8,23) 64 (11,52) # of slices1% 4% # of BlockRAM2% 80% # of Embedded Multiplier2%4%6%16% Clock period (ns)67810 Maximum frequency (MHz) Latency # of clock cycles Latency (ns) Throughput (million results/sec) FP Square Root on a XC2V6000 ( The last two are IEEE single/double precision floating point format) FP square root is small, has small latency and high throughput The result for FP Divider is similar Features of Mercury Atlanta Board: One Xilinx Virtex II XC2V FPGA (144 on-board BlockRAMs, 144 embedded multipliers) 12MB DDR SRAM and 256 MB DDR SDRAM dual-processor PCI module with two PowerPCs QR Factorizaton c: cosine; s: sine Givens Rotation Divide and square root are required Example Technology Transfer The floating-point library has been used by many users such as Los Alamos National Laboratory, Sandia National Laboratory, Kodak, Systron Donner, L3 Communications, and Magnetic Analysis Corp since it was provided on the web Clustered Output Image