Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Advertisements

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*
Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
Image Compression System Megan Fuller and Ezzeldin Hamed 1.
© 2003 Xilinx, Inc. All Rights Reserved Looking Under the Hood.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Short Vector SIMD Code Generation for DSP Algorithms
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
High Performance Linear Transform Program Generation for the Cell BE
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
High Performance, Pipelined, FPGA-Based Genetic Algorithm Machine A Review Grayden Smith Ganga Floora 1.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Automated Design of Custom Architecture Tulika Mitra
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
200/MAPLD 2004 Craven1 Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Carnegie Mellon Program Generation with Spiral: Beyond Transforms This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Parallel Implementation of Fast Fourier Transform on a Multi-core System Tao Liu Chi-Li Yu Nov. 29, 2007.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Mapping into LUT Structures
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Anne Pratoomtong ECE734, Spring2002
Centar ( Global Signal Processing Expo
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Presentation transcript:

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University This project is supported in part by NSF awards ITR/NGS and SYS and a DARPA DESA program

Slide 2 The Paradox of Reusable IPs Boon to productivity  zero effort required  zero knowledge required  zero chance to introduce new bugs Why repeat what has already been done? Bane to optimality  finding the right functionality with the right interface  design tradeoff -- performance, area, power, accuracy..... Are you getting what you really wanted? Solution: Solution: parameterized automatic IP generators  zero effort, knowledge or bugs  allows application specific customization  facilitates design exploration

Slide 3 Our Work: Discrete Fourier Transform IPs Discrete Fourier Transform (DFT)  important building block in DSP applications  numerous design “cores” available Current IP libraries support:  various sizes, number formats, data orderings small number  only a small number of microarchitecture choices  (Xilinx LogiCore DFT gives 3 choices) We generate IPs with custom design tradeoffs  degree of parallelism in microarchitecture (min  max)  resource preference (e.g. BRAM vs. slices in FPGAs) Extensible to other common linear DSP transforms

Slide 4 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Slide 5 Transforms as Formulas [ Transform computation is represented as matrix-vector multiplication  Matrix-vector multiplication is O(n 2 ) operations “Fast” algorithms factor the transform into a sequence of structured sparse matrices  O(n log n) operations DFT: FFT: Datapath easily formed from factorized formulas

Slide 6 Formula to Datapath Given where is:  apply, then  is a permutationpermute  apply, times in parallel  is a diagonalscale A A B A ×4×4 ×2×2 ×7×7 ×8×8

Slide 7 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Slide 8 Simple regular structure embodied in formula Example: Pease DFT diagonal permutation butterfly parallel k stages stage 1 stage 2 stage 3

Slide 9 Pease DFT Example: DFT 8 x x x x x x x x x x x x stage 1 stage 2 stage 3 (formula is applied from right to left) (datapath is built left to right) Repeating column structure  hardware reuse without performance penalty without performance penalty

Slide 10 x x x x Horizontal folding x x x x x x x x our baseline design degree of freedom: vertical parallelism p  parameter p input bypass register p

Slide 11 Vertical (V-)folding according to p latency Fine-grained control over cost/latency tradeoff cost

Slide 12 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Slide 13 User Interface common DFT options customization options

Slide 14 Outline Introduction Formula-Driven Design Generation Microarchitecture Parameterization Generator User Interface Experimental Results Conclusions

Slide 15 We compare Xilinx’s fixed design against our variable generated designs Evaluation We compare against Xilinx LogiCore DFT Ver. 3.1  radix-4 burst I/O interface XilinxSPIRAL datapathfixed, one radix- 4 basic block variable, p radix-2 basic blocks cost-performance tradeoff fixed user-controlled, varies with p Comparison  DFT n = {64, 1024, 2048}; width = 16; bit-reversed output  Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6

Slide 16 DFT 1024 relative to Xilinx Xilinx Performance and resources scale with p 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance

Slide p relative slices p relative BRAMs Resource usage preferences Xilinx 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance p speedup

Slide 18 Resource usage preferences Can control tradeoff between slices and BRAMs Xilinx exchange BRAM for slices  very little change in performance 1.0 = 1955 slices 1.0 = 7 BRAMs1.0 = 1 / 5.6 µsec logic storage performance

Slide 19 DFT 64 and DFT = 2140 slices 1.0 = 7 BRAMs 1.0 = 1 transform / µsec Trends hold for sizes 64, = 1743 slices 1.0 = 8 BRAMs 1.0 = 1 transform / µsec 64 Xilinx

Slide 20 Related Work Kumhom, Johnson, Nagvajara, ASIC/SOC 2000  universal FFT processor microarchitecture based on processing elements interconnected by on-chip reconfigurable network  microarchitecture is scalable in the number of elements  supports both Cooley Tukey and Pease Choi, Scrofano, Prasanna, Jang, FPGA’2003  mapped radix-4 Cooley-Tukey algorithm onto log 2 (n)/2 DFT 4 primitives  scalable datapath between 1 element and 4 elements at a time  show energy and performance improvements from scaling

Slide 21 Conclusions Parameterized DFT IP generator formula-driven  matrix formula-driven synthesis  performance/cost tradeoff resources vs. latency  fine-grained control over resources vs. latency  resource usage preference slices and BRAM  can balance tradeoff between slices and BRAM Key results  efficient:  efficient: the Xilinx design point can be matched  customizable: design tradeoffs  customizable: design tradeoffs directly controllable  easy to use: simple yet powerful web interface

Slide 22 Web Generator SPIRAL This work is part of the SPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms. For more information visit:

Slide 23 V-folding according to p (continued)  4  2  0 7  5  3  1 p max = n/2 p min = 1

Slide 24 V-Folding of Permutations [Takala, et al. ICASSP’2001] where