Folding Technique: Compromising in Special Purpose Hardware Design

Slides:



Advertisements
Similar presentations
Programmable FIR Filter Design
Advertisements

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
Lecture 15 Finite State Machine Implementation
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Chapter 4 Retiming.
Give qualifications of instructors: DAP
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
CS 151 Digital Systems Design Lecture 37 Register Transfer Level
Digital Kommunikationselektronik TNE064 Lecture 1 1 TNE064 Digital Communication Electronics Qin-Zhong Ye ITN Linköping University
Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Why Systolic Architecture ? VLSI Signal Processing 台灣大學電機系 吳安宇.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
EE 141 Project 2May 8, Outstanding Features of Design Maximize speed of one 8-bit Division by: i. Observing loop-holes in 8-bit division ii. Taking.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ELEC692 VLSI Signal Processing Architecture Lecture 6
Algorithmic Transformations
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Electronic Design Automation. Course Outline 1.Digital circuit design flow 2.Verilog Hardware Description Language 3.Logic Synthesis –Multilevel logic.
ECE 2372 Modern Digital System Design
COE4OI5 Engineering Design. Copyright S. Shirani 2 Course Outline Design process, design of digital hardware Programmable logic technology Altera’s UP2.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
1 H ardware D escription L anguages Modeling Digital Systems.
Chap 8. Sequencing and Control. 8.1 Introduction Binary information in a digital computer –data manipulated in a datapath with ALUs, registers, multiplexers,
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
COE 405 Design and Modeling of Digital Systems
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
Area: VLSI Signal Processing.
COE 202 Introduction to Verilog Computer Engineering Department College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals.
EE3A1 Computer Hardware and Digital Design
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
EE5970 Computer Engineering Seminar Spring 2012 Michigan Technological University Based on: A Low-Power FPGA Based on Autonomous Fine-Grain Power Gating.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
ELEC692 VLSI Signal Processing Architecture Lecture 3
ALU (Continued) Computer Architecture (Fall 2006).
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.
Class Report 林常仁 Low Power Design: System and Algorithm Levels.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms T. Arslan A.T. Erdogan S. Masupe C. Chun-Fu D. Thompson.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
Introduction to the FPGA and Labs
Programmable Logic Devices
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Introduction to cosynthesis Rabi Mahapatra CSCE617
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Multiplier-less Multiplication by Constants
HIGH LEVEL SYNTHESIS.
Presentation transcript:

Folding Technique: Compromising in Special Purpose Hardware Design Ivan Milentijevic Faculty of Electronic Engineering University of Nis Serbia and Montenegro milentijevic@elfak.ni.ac.yu

Outline Special purpose hardware design and DSP DSP application demands and technologies Representations of DSP algorithms and architectures Compromising Folding technique Simple example – ad hoc folding Folding equation and mathematical background Preparation of source architecture for folding – problems Case study 1: Folded Bit-Serial Multiplier Case study 2: Configurable Folded FIR Filter Architecture

Special purpose hardware design and DSP DSP systems can be realized using programmable processors or custom designed hardware circuits fabricated using very-large-scale-integrated (VLSI) circuit technology Two import features that distinguish DSP from other general purpose computations are real-time throughput requirement and data driven property.

DSP application demands and technologies

Representations of DSP algorithms and architectures DSP algorithms are initially described by mathematical formulas. System architecture can be described by Behavioral languages Graphical representations - BD - SFG - DFG - DG Applicative - set of equations (not actions) Prescriptive - describe assigments Descriptive -VHDL, Verilog,...

Representations of DSP algorithms and architectures Block diagram of a 3-tap FIR filter

Representations of DSP algorithms and architectures Signal Flow Graph of a 3-tap FIR filter

Representations of DSP algorithms and architectures Data Flow Graph of a 3-tap FIR filter

Representations of DSP algorithms and architectures Dependence Graph of a 3-tap FIR filter

Compromising Area – Time – Power Area – Time Goal: to achieve time (throughput) requirements with optimal chip area or optimal using of chip resources

Folding technique Performances and cost of any digital circuit depend on circuit design style. Therefore, creating a given architecture, to establish optimal area-time-power tradeoff, a careful choice of circuit design style to use is necessary. In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units (such as multipliers and adders), registers, multiplexers, and interconnection wires.

Folding technique How? By executing multiple algorithm operations on a single functional unit, the number of functional units in the implementation is reduced, resulting in integrated circuit with low silicon area.

Simple example – ad hoc folding Two addition operations are folded to a single adder: Folding factor* N=2 *Folding factor - the number of algorithm operations folded to a single functional unit

Clk L.input Up.input Output a(0) b(0) - 1 a(0)+b(0) c(0) 2 a(1) b(1) a(0)+b(0)+c(0) 3 a(1)+b(1) c(1) 4 a(2) b(2) a(1)+b(1)+c(1) 5 a(2)+b(2) c(2)

Folding equation and math. background K. K. Parhi, VLSI Digital Signal Processing Systems (Design and Implementation), John Wiley & Sons, In., New York, 2000. T. C. Denk, K. K. Parhi, Synthesis of Folded Pipelined Architectures for Multirate DSP Algorithms, IEEE Transaction on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, Dec. 1998, pp. 595-607. The folding transformation provides a systematic technique for designing of control circuits in folded systems.

Folding equation and math. background An edge with w(e) delays The corresponding folded data path The data begin at the functional unit , which has pipelining stages, pass through delays, and are switched into the functional unit at the time instances , where N is the number of operations folded to a single functional unit (folding factor), while u and v are the folding orders of nodes U and V that satisfy .

Folding equation and math. background A folding set, S, is defined as an ordered set of operations, which contains N entries, executed by the same functional unit. For a folded system to be realizable must hold for all of the edges in the DFG.

Preparation of source architecture for folding – problems Important question: How to prepare the source architecture / DFG for the successful application of folding technique?

Preparation of source architecture for folding – problems Once valid folding sets have been assigned, retiming can be used to satisfy this property or determine that the folding sets are not feasible. Retiming is a transformation technique used to change the locations of delay elements without affecting the I/O characteristics of the circuit. Retiming in synchronous circuit design can be directed towards: Reducing the clock period, Reducing the number of registers, Reducing the power consumption, etc.

Preparation of source architecture for folding – problems Using folding equations, a set of retiming inequalities can be obtained Solution for architecture retiming can be found by mapping of set of inequalities onto constraint graph. Algorithms: Bellman-Ford or Floyd-Warshall Assigment of folding sets on functional units of retimed graph Rechecking of folding condition

Case study 1: Folded Bit-Serial Multiplier Public-key cryptography special features are required for multiplier units. RSA encryption and decryption, large integers (typically 1024 bits or more) must be multiplied, Elliptic curve cryptosystems, a multiplication in finite fields is required.

Source architecture: basic serial-parallel-serial multiplier Case study 1: Folded Bit-Serial Multiplier Source architecture: basic serial-parallel-serial multiplier

Case study 1: Folded Bit-Serial Multiplier a3b0 a2b0 a1b0 a0b0

Case study 1: Folded Bit-Serial Multiplier a3b0 a2b0 a1b0 a0b0 + + + a3b1 a2b1 a1b1 a0b1

Case study 1: Folded Bit-Serial Multiplier a3b0 a2b0 a1b0 + + + a3b1 a2b1 a1b1 a0b1 + + + a3b2 a2b2 a1b2 a0b2

Case study 1: Folded Bit-Serial Multiplier a3b0 a2b0 + + a3b1 a2b1 a1b1 + + + a3b2 a2b2 a1b2 a0b2 + + + a3b3 a2b3 a1b3 a0b3

Case study 1: Folded Bit-Serial Multiplier

Case study 1: Folding Set Assigment (Sk-1|N-1) (Sk-1|N-2) (Sk-1|0) (Sk-2|N-1) (S0|1) (S0|0)

Case study 1: Folding Equations Two neighboring nodes that are folded onto one node Df(i®i+1)=N×1-0+0-1=N-1, i= N-1, 2N-1, … , La-N-1 Neighboring nodes that are folded onto different nodes Df(i®i+1)=N×1-0+(N-1)-0=2N-1, i¹ N-1, 2N-1, … , La-N-1 Carry data paths Df(i®i)=N×1-0+0-1=N-1, i= 0, 1, … , La-1 Df(U®V)³0 Folded architecture contains max(N-1,2N-1)=2N-1 latches between nodes j and j+1 that will be used for data buffering

Case study 1: Folded Architecture

Case study 1: Functional Description for case N=2, k=2

Case study 1: Functional Description a2b0 a0b0

Case study 1: Functional Description a3b0 a2b0 a1b0 a0b0

Case study 1: Functional Description a2b1 + a3b0 a3b0 a2b0 a0b1 + a1b0 a1b0 a0b0

Case study 1: Functional Description a3b1 a2b1 + a3b0 a3b0 a1b1 + a2b0 a0b1 + a1b0 a1b0

Case study 1: Functional Description a2b2 + a3b1 a3b1 a2b1 + a3b0 a0b2 + a2b0 a1b1 a2b0 + a1b1 a0b1 + a1b0

Case study 1: Implementation of Folded Architecture (Spartan II xc2s2000-5pq208 ) Basic multiplier Folded multiplier 7.841 12 2 64 7.855 4 32 8.056 16 8 8.129 128 7.003 6 7.202 7.211 6.682 4.367 4.214 4.590 4.580 4.502 4.785 4.898 4.706 3.803 4.105 4.088 1 3.957 Clock period [ns] Slices used No. of PEs Folding factor Op. length

Case study 1: Conclusions It provides the finding of optimal area-time solution for the given requirements. Saprtan II “shift register” property was used to relax the constraints caused by relatively large number of lathes in folded architecture. Generated architecture has kept almost all desirable features of source Bit-Serial architecture. The hardware reduction of active arithmetic elements for the factor N is done at the cost of execution time.

Case study 2: Cellular-phone technology is changing rapidly. There is an increasing number of wireless-communications standards, including variants of the IEEE 802.11 wireless LAN specification, etc… Traditionally, devices need a separate chip to work with each standard. Providers differentiate themselves by offering new features, such as multimedia capabilities. Providing each feature typically requires a separate chip, or essence, multiple circuitry systems physically joined on a peace of silicon

Case study 2: The additional circuitry adds cost, takes up space, increases power usage in mobile devices, and increase product-design time.

Case study 2: Configurable Folded FIR Filter Architecture The synthesis of configurable folded bit-plane architecture for FIR filtering. Why? Wider application area Finding of suitable A-T tradeoffs Increasing of versatility of folded systems

Case study 2: FIR filtering Output words {yi} of FIR filter are computed as where are coefficients while {xi} are input words. m – coefficient word length, k – number of taps, – bit of coefficient (with weight ) n – input word length. The BPA is obtained by resorting of partial products of different multipliers.

Case study 2: Bit-plane FIR filter architecture highly regular architecture allows extensive pipelining regular layout high computational throughput truncation of LSBs of intermediate results without any loss of accuracy programmability of coefficients [Noll 1986], [Reuver & Klar 1992]

Case study 2: DFG for the source BPA for case k=3 and m=4 The DFG for the basic BPA with k=3 and m=4

Case study 2: Assignment of folding sets (Ss , r) s= p mod k r= p mod N

Case study 2: Assignment of folding sets Folding set assignment enables the changing of operations in folding sets. Different operations can be mapped onto the different hardware units in fixed array structure. There are k folding sets where each folding set contains N operations. For the coefficients, kc, and the coefficient length, mc, the total number of operations, L, is: L=kc mc=k N

Case study 2: Folding equations and retiming General form of Folding Equations: Df (pp+1)=Nw(e)-0+[(p+1) mod N]-[p mod N] = = The condition Df (UV)  0 is not satisfied for neighboring nodes U and V when for the position p of node U the following is valid p mod (N-1) = 0 or p mod N = 0.

Case study 2: Folding equations and retiming General form of retiming inequalities: The general form of solution for r(p) is: The existence of this solution provides the retiming of DFG and allows the application of folding technique.

Case study 2: Graphical representations of retiming a) kc=1, mc=L; b) kc=3, mc=L/3

Case study 2: Life cycle analysis

Case study 2: General allocation table

Case study 2: Module for input data entering

Case study 2: Folded FIR filter architecture

Case study 2: Functional block diagram k=3, N=4, kc=2 and mc=6

Case study 2: Data flow for folded architecture k=3, N=4, kc =2, mc=6 y0= 20 c00x0 + 21 c01x0 +22 c02x0 + 23 c03x0 +24 c04x0 + 25 c05x0 = c0x0 y1= 20 c10x0 + 21 c11x0 +22 c12x0 + 23 c13x0 +24 c14x0 + 25 c15x0 +20 c00x1 + 21c01x1 +22c02x1 + 23 c03x1 +24 c04x1 + 25 c05x1 = c1x0 +c0x1 y2= 20 c10x1 + 21 c11x1 +22 c12x1 + 23 c13x1 +24 c14x1 + 25 c15x1 +24 c14x1 + 25 c15x1 + … + = c1x1+ c0x2 y3= + … + = c0x2+ c1x3 20 c00x2 + 21 c01x2 +22 c02x2 + 23 c03x2

Case study 2: Implementation - Chip occupation as a function of maximal folding factor Nmax (Spartan II xc2s2000-5pq208 )

Case study 2: Implementation - Throughput as a function of chosen folding factor

Case study 2: Conclusions Folding set assignment supports the changing of operations in folding sets. The prerequisites for application of folding technique are satisfied. Using of proposed folding set assignment, different operations can be mapped onto the different hardware units in the fixed structure array.

Case study 2: Conclusions The derived folded processing array can be configured to perform FIR filtering with different number of taps and length of coefficients. Synthesized architecture has kept desirable features of source architecture such as extensive pipelining, high regularity, truncation of LSBs of intermediate results without any loss of accuracy. The number of basic cells is reduced to the number of basic cells in one plane of source architecture. The obtained folded semi-systolic architecture is presented by DFG, allocation table, and data flow diagram.