Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Slides:



Advertisements
Similar presentations
WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
Advertisements

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher:
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
CHANNEL ESTIMATION FOR MIMO- OFDM COMMUNICATION SYSTEM PRESENTER: OYERINDE, OLUTAYO OYEYEMI SUPERVISOR: PROFESSOR S. H. MNENEY AFFILIATION:SCHOOL OF ELECTRICAL,
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
Presenter:梁正勳 Number:
Author : Weirong Jiang, Yi-Hua E. Yang, and Viktor K. Prasanna Publisher : IPDPS 2010 Presenter : Jo-Ning Yu Date : 2012/04/11.
Memory-Efficient and Scalable Virtual Routers Using FPGA Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,
A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Presenter: Darshika G. Perera Assistant Professor
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
Lecture 4 Sorting Networks
Hiba Tariq School of Engineering
School of Engineering University of Guelph
Low-power Digital Signal Processing for Mobile Phone chipsets
High-throughput Online Hash Table on FPGA
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
FPGAs in AWS and First Use Cases, Kees Vissers
Chapter 8 Arrays Objectives
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Supporting Fault-Tolerance in Streaming Grid Applications
Verilog to Routing CAD Tool Optimization
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Scalable Memory-Less Architecture for String Matching With FPGAs
COMP60621 Fundamentals of Parallel and Distributed Systems
Final Project presentation
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Power-efficient range-match-based packet classification on FPGA
Design and Analysis of Algorithms
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering Presented by: Ajitesh Srivastava, Department of Computer Science University of Southern California Ganges.usc.edu/wiki/TAPAS

 Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work Outline 2

3 Permutation Sorting networkFFT network

 Key Algorithms: FFT, sorting, Viterbi decoding, etc. 4 Related Applications Frequency domain in images Image filtering Audio analysis Bitonic sort Partial differential equations OFDM System

5 Data Permutation in Conventional Architectures  Permutation by wires  Parallel architecture  Permutation by memory or registers  Pipeline architecture  Shared memory architecture

6 Data Permutation in Streaming Architectures  Streaming architecture  High data parallelism  High design throughput  Simple control scheme  No requirement on data input/ output order

7 Problem Definition

 Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work Outline 8

9 Related Work (JVSP ’07, T. JARVINEN)  For stride permutation on array processor  Flexible data parallelism  Mathematical formulation

10 Related Work (DAC ’12, M. Zuluaga and M. Püschel)  Domain-specific language based  Hardware generator for data permutations in sorting

11 Proposed Design Approach

 Introduction  Background and Related Work  High Throughput and Energy Efficient Design  Experimental Results  Conclusion and Future Work Outline 12

13 Benes Network

14 Architecture Overview

15 Generating the Datapath (1)

16 Generating the Datapath (2)

17 Optimization of Interconnection Complexity (1)

18 Optimization of Interconnection Complexity (2)  Key Idea  A null switch can be replaced by wires  A null switch: always in either pass or cross state

19 Optimization of Interconnection Complexity (3)

20 Optimization of Interconnection Complexity (4)

21 Optimization of Interconnection Complexity (5)  Heuristic Routing Algorithm  Call RT procedure to compute configuration bits  Search for the solution having maximum number of null switches

22 Resource Consumption Summary

Introduction Background and Related Work High Throughput and Energy Efficient Design Experimental Results Conclusion and Future Work Outline 23

 Throughput  Defined as the number of bits permuted per second (Gbits/s)  Product of number of data elements permuted per second and data width per element  Energy efficiency  Defined as the number of bits permuted per unit energy consumption (Gbits/Joule)  Calculated as the throughput divided by the average power consumption 24 Performance metrics

 Platform and tools  Xilinx Virtex-7 XC7VX980T, speed grade -2L  Xilinx Vivado and Vivado Power Analyzer  Input vectors for simulation randomly generated  Performance metrics  Resource consumption  Throughput  Energy efficiency 25 Experimental Setup

 BRAM consumption of the proposed design  [9] uses BRAM-LUT mostly for realizing the 2p memory blocks  Our design consumes the same amount of BRAMs (O(p)) compared with [8]  Note designs in [8] only realize bit index permutations 26 Experimental Results (1)

27 Experimental Results (2)

28 Experimental Results (3)

 Throughput performance of the proposed design  Our designs achieve  Up to 73.2% throughput improvement compared with [8]  Up to 129% throughput improvement compare with [9] 29 Experimental Results (5)

 Energy efficiency comparison  2.1x~3.5x energy efficiency improvement compared with the state-of-the-art in [9]  1.2x~1.5x energy efficiency improvement compared with the state-of-the-art in [8] 30 Experimental Results (6)

Conclusion and Future Work 31  Conclusion  Streaming data permutation architecture  Scalable with data parallelism and problem size  Efficient data permutation realization  Highly optimized with interconnection complexity  High throughput and resource efficient  Future work  Automatic generation of high throughput resource efficient signal and data processing kernels

32 Thanks! Questions? Ganges.usc.edu/wiki/TAPAS