Folding Technique: Compromising in Special Purpose Hardware Design

Name: Folding Technique: Compromising in Special Purpose Hardware Design
Uploaded: 2017-08-13T22:11:36+00:00
Duration: PTM23S27
Channel: Madison Howard
Description: Folding Technique: Compromising in Special Purpose Hardware Design

Folding Technique: Compromising in Special Purpose Hardware Design
Ivan Milentijevic Faculty of Electronic Engineering University of Nis Serbia and Montenegro

Outline Special purpose hardware design and DSP
DSP application demands and technologies Representations of DSP algorithms and architectures Compromising Folding technique Simple example – ad hoc folding Folding equation and mathematical background Preparation of source architecture for folding – problems Case study 1: Folded Bit-Serial Multiplier Case study 2: Configurable Folded FIR Filter Architecture

Special purpose hardware design and DSP
DSP systems can be realized using programmable processors or custom designed hardware circuits fabricated using very-large-scale-integrated (VLSI) circuit technology Two import features that distinguish DSP from other general purpose computations are real-time throughput requirement and data driven property.

DSP application demands and technologies

Representations of DSP algorithms and architectures
DSP algorithms are initially described by mathematical formulas. System architecture can be described by Behavioral languages Graphical representations - BD - SFG - DFG - DG Applicative - set of equations (not actions) Prescriptive - describe assigments Descriptive -VHDL, Verilog,...

Block diagram of a 3-tap FIR filter

Signal Flow Graph of a 3-tap FIR filter

Data Flow Graph of a 3-tap FIR filter

Dependence Graph of a 3-tap FIR filter

Compromising Area – Time – Power Area – Time Goal:
to achieve time (throughput) requirements with optimal chip area or optimal using of chip resources

Folding technique Performances and cost of any digital circuit depend on circuit design style. Therefore, creating a given architecture, to establish optimal area-time-power tradeoff, a careful choice of circuit design style to use is necessary. In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units (such as multipliers and adders), registers, multiplexers, and interconnection wires.

Folding technique How? By executing multiple algorithm operations on a single functional unit, the number of functional units in the implementation is reduced, resulting in integrated circuit with low silicon area.

Simple example – ad hoc folding
Two addition operations are folded to a single adder: Folding factor* N=2 *Folding factor - the number of algorithm operations folded to a single functional unit

Clk L.input Up.input Output a(0) b(0) - 1 a(0)+b(0) c(0) 2 a(1) b(1) a(0)+b(0)+c(0) 3 a(1)+b(1) c(1) 4 a(2) b(2) a(1)+b(1)+c(1) 5 a(2)+b(2) c(2)

Folding equation and math. background
K. K. Parhi, VLSI Digital Signal Processing Systems (Design and Implementation), John Wiley & Sons, In., New York, 2000. T. C. Denk, K. K. Parhi, Synthesis of Folded Pipelined Architectures for Multirate DSP Algorithms, IEEE Transaction on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, Dec. 1998, pp The folding transformation provides a systematic technique for designing of control circuits in folded systems.

An edge with w(e) delays The corresponding folded data path The data begin at the functional unit , which has pipelining stages, pass through delays, and are switched into the functional unit at the time instances , where N is the number of operations folded to a single functional unit (folding factor), while u and v are the folding orders of nodes U and V that satisfy

A folding set, S, is defined as an ordered set of operations, which contains N entries, executed by the same functional unit. For a folded system to be realizable must hold for all of the edges in the DFG.

Preparation of source architecture for folding – problems
Important question: How to prepare the source architecture / DFG for the successful application of folding technique?

Once valid folding sets have been assigned, retiming can be used to satisfy this property or determine that the folding sets are not feasible. Retiming is a transformation technique used to change the locations of delay elements without affecting the I/O characteristics of the circuit. Retiming in synchronous circuit design can be directed towards: Reducing the clock period, Reducing the number of registers, Reducing the power consumption, etc.

Using folding equations, a set of retiming inequalities can be obtained Solution for architecture retiming can be found by mapping of set of inequalities onto constraint graph. Algorithms: Bellman-Ford or Floyd-Warshall Assigment of folding sets on functional units of retimed graph Rechecking of folding condition

Case study 1: Folded Bit-Serial Multiplier
Public-key cryptography special features are required for multiplier units. RSA encryption and decryption, large integers (typically 1024 bits or more) must be multiplied, Elliptic curve cryptosystems, a multiplication in finite fields is required.

Source architecture: basic serial-parallel-serial multiplier
Case study 1: Folded Bit-Serial Multiplier Source architecture: basic serial-parallel-serial multiplier

a3b0 a2b0 a1b0 a0b0

a3b0 a2b0 a1b0 a0b0 + + + a3b1 a2b1 a1b1 a0b1

a3b0 a2b0 a1b0 + + + a3b1 a2b1 a1b1 a0b1 + + + a3b2 a2b2 a1b2 a0b2

a3b0 a2b0 + + a3b1 a2b1 a1b1 + + + a3b2 a2b2 a1b2 a0b2 + + + a3b3 a2b3 a1b3 a0b3

Case study 1: Folding Set Assigment
(Sk-1|N-1) (Sk-1|N-2) (Sk-1|0) (Sk-2|N-1) (S0|1) (S0|0)

Case study 1: Folding Equations
Two neighboring nodes that are folded onto one node Df(i®i+1)=N× =N-1, i= N-1, 2N-1, … , La-N-1 Neighboring nodes that are folded onto different nodes Df(i®i+1)=N×1-0+(N-1)-0=2N-1, i¹ N-1, 2N-1, … , La-N-1 Carry data paths Df(i®i)=N× =N-1, i= 0, 1, … , La-1 Df(U®V)³0 Folded architecture contains max(N-1,2N-1)=2N-1 latches between nodes j and j+1 that will be used for data buffering

Case study 1: Folded Architecture

Case study 1: Functional Description for case N=2, k=2

Case study 1: Functional Description
a2b0 a0b0

a3b0 a2b0 a1b0 a0b0

a2b1 + a3b0 a3b0 a2b0 a0b1 + a1b0 a1b0 a0b0

a3b1 a2b1 + a3b0 a3b0 a1b1 + a2b0 a0b1 + a1b0 a1b0

a2b2 + a3b1 a3b1 a2b1 + a3b0 a0b2 + a2b0 a1b1 a2b0 + a1b1 a0b1 + a1b0

Case study 1: Implementation of Folded Architecture (Spartan II xc2s2000-5pq208 )
Basic multiplier Folded multiplier 7.841 12 2 64 7.855 4 32 8.056 16 8 8.129 128 7.003 6 7.202 7.211 6.682 4.367 4.214 4.590 4.580 4.502 4.785 4.898 4.706 3.803 4.105 4.088 1 3.957 Clock period [ns] Slices used No. of PEs Folding factor Op. length

Case study 1: Conclusions
It provides the finding of optimal area-time solution for the given requirements. Saprtan II “shift register” property was used to relax the constraints caused by relatively large number of lathes in folded architecture. Generated architecture has kept almost all desirable features of source Bit-Serial architecture. The hardware reduction of active arithmetic elements for the factor N is done at the cost of execution time.

Case study 2: Cellular-phone technology is changing rapidly. There is an increasing number of wireless-communications standards, including variants of the IEEE wireless LAN specification, etc… Traditionally, devices need a separate chip to work with each standard. Providers differentiate themselves by offering new features, such as multimedia capabilities. Providing each feature typically requires a separate chip, or essence, multiple circuitry systems physically joined on a peace of silicon

Case study 2: The additional circuitry adds cost, takes up space, increases power usage in mobile devices, and increase product-design time.

Case study 2: Configurable Folded FIR Filter Architecture
The synthesis of configurable folded bit-plane architecture for FIR filtering. Why? Wider application area Finding of suitable A-T tradeoffs Increasing of versatility of folded systems

Case study 2: FIR filtering
Output words {yi} of FIR filter are computed as where are coefficients while {xi} are input words. m – coefficient word length, k – number of taps, – bit of coefficient (with weight ) n – input word length. The BPA is obtained by resorting of partial products of different multipliers.

Case study 2: Bit-plane FIR filter architecture
highly regular architecture allows extensive pipelining regular layout high computational throughput truncation of LSBs of intermediate results without any loss of accuracy programmability of coefficients [Noll 1986], [Reuver & Klar 1992]

Case study 2: DFG for the source BPA for case k=3 and m=4
The DFG for the basic BPA with k=3 and m=4

Case study 2: Assignment of folding sets
(Ss , r) s= p mod k r= p mod N

Case study 2: Assignment of folding sets
Folding set assignment enables the changing of operations in folding sets. Different operations can be mapped onto the different hardware units in fixed array structure. There are k folding sets where each folding set contains N operations. For the coefficients, kc, and the coefficient length, mc, the total number of operations, L, is: L=kc mc=k N

Case study 2: Folding equations and retiming
General form of Folding Equations: Df (pp+1)=Nw(e)-0+[(p+1) mod N]-[p mod N] = = The condition Df (UV)  0 is not satisfied for neighboring nodes U and V when for the position p of node U the following is valid p mod (N-1) = 0 or p mod N = 0.

Case study 2: Folding equations and retiming
General form of retiming inequalities: The general form of solution for r(p) is: The existence of this solution provides the retiming of DFG and allows the application of folding technique.

Case study 2: Graphical representations of retiming
a) kc=1, mc=L; b) kc=3, mc=L/3

Case study 2: Life cycle analysis

Case study 2: General allocation table

Case study 2: Module for input data entering

Case study 2: Folded FIR filter architecture

Case study 2: Functional block diagram
k=3, N=4, kc=2 and mc=6

Case study 2: Data flow for folded architecture
k=3, N=4, kc =2, mc=6 y0= 20 c00x0 + 21 c01x0 +22 c02x0 + 23 c03x0 +24 c04x0 + 25 c05x0 = c0x0 y1= 20 c10x0 + 21 c11x0 +22 c12x0 + 23 c13x0 +24 c14x0 + 25 c15x0 +20 c00x1 + 21c01x1 +22c02x1 + 23 c03x1 +24 c04x1 + 25 c05x1 = c1x0 +c0x1 y2= 20 c10x1 + 21 c11x1 +22 c12x1 + 23 c13x1 +24 c14x1 + 25 c15x1 +24 c14x1 + 25 c15x1 + … + = c1x1+ c0x2 y3= + … + = c0x2+ c1x3 20 c00x2 + 21 c01x2 +22 c02x2 + 23 c03x2

Case study 2: Implementation - Chip occupation as a function of maximal folding factor Nmax (Spartan II xc2s2000-5pq208 )

Case study 2: Implementation - Throughput as a function of chosen folding factor

Folding set assignment supports the changing of operations in folding sets. The prerequisites for application of folding technique are satisfied. Using of proposed folding set assignment, different operations can be mapped onto the different hardware units in the fixed structure array.

The derived folded processing array can be configured to perform FIR filtering with different number of taps and length of coefficients. Synthesized architecture has kept desirable features of source architecture such as extensive pipelining, high regularity, truncation of LSBs of intermediate results without any loss of accuracy. The number of basic cells is reduced to the number of basic cells in one plane of source architecture. The obtained folded semi-systolic architecture is presented by DFG, allocation table, and data flow diagram.

Folding Technique: Compromising in Special Purpose Hardware Design

Similar presentations

Presentation on theme: "Folding Technique: Compromising in Special Purpose Hardware Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Folding Technique: Compromising in Special Purpose Hardware Design

Similar presentations

Presentation on theme: "Folding Technique: Compromising in Special Purpose Hardware Design"— Presentation transcript:

Similar presentations

About project

Feedback