L7: Pipelining and Parallel Processing VADA Lab..

Slides:



Advertisements
Similar presentations
1 A latch is a pair of cross-coupled inverters –They can be NAND or NOR gates as shown –Consider their behavior (each step is one gate delay in time) –From.
Advertisements

Tutorial 2 Sequential Logic. Registers A register is basically a D Flip-Flop A D Flip Flop has 3 basic ports. D, Q, and Clock.
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Chapter 4 Retiming.
CP208 Digital Electronics Class Lecture 11 May 13, 2009.
Digital Logic Circuits (Part 2) Computer Architecture Computer Architecture.
Digital Logic Design Brief introduction to Sequential Circuits and Latches.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Design of a Power-Efficient Interleaved CIC Architecture for Software Defined Radio Receivers By J.Luis Tecpanecatl-Xihuitl, Ruth Aguilar-Ponce, Ashok.
Minimizing Clock Skew in FPGAs
Synchronous Digital Design Methodology and Guidelines
ELEC692 VLSI Signal Processing Architecture Lecture 4
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Dr. Shi Dept. of Electrical and Computer Engineering.
Sequential Logic 1  Combinational logic:  Compute a function all at one time  Fast/expensive  e.g. combinational multiplier  Sequential logic:  Compute.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Nov. 29, 2005ELEC Power Minimization Using Voltage Reduction and Parallel Processing By Sudheer Vemula.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Algorithmic Transformations
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
CS 151 Digital Systems Design Lecture 28 Timing Analysis.
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
ECE 331 – Digital System Design Power Dissipation and Propagation Delay.
Lecture 9: Structure for Discrete-Time System XILIANG LUO 2014/11 1.
ECE 448: Lab 4 FIR Filters.
ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
Low-Power Wireless Sensor Networks
Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Chapter 6 Digital Filter Structures
Professor A G Constantinides 1 Signal Flow Graphs Linear Time Invariant Discrete Time Systems can be made up from the elements { Storage, Scaling, Summation.
Implementation of Finite Field Inversion
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Improving Timing, Area, and Power Speaker: 黃乃珊 Adviser: Prof.
Dr. Elwin Chandra Monie Department of ECE, RMK Engineering College
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
ELEC692 VLSI Signal Processing Architecture Lecture 3
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
Introduction to Clock Tree Synthesis
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Dynamic Logic Circuits Static logic circuits allow implementation of logic functions based on steady state behavior of simple nMOS or CMOS structures.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Clocking System Design
Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms T. Arslan A.T. Erdogan S. Masupe C. Chun-Fu D. Thompson.
REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Chapter 3 – Diode Circuits – Part 3
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
Chapter 4 Structures for Discrete-Time System Introduction The block diagram representation of the difference equation Basic structures for IIR system.
REGISTER TRANSFER LANGUAGE (RTL)
Digital Logic Design Alex Bronstein Lecture 2: Pipelines.
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
102-1 Under-Graduate Project Techniques in VLSI design
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Adaptation Behavior of Pipelined Adaptive Filters
{ Storage, Scaling, Summation }
Timing Analysis 11/21/2018.
101-1 Under-Graduate Project Techniques in VLSI design
Multiplier-less Multiplication by Constants
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
ECE 448: Lab 4 FIR Filters.
Real time signal processing
Presentation transcript:

L7: Pipelining and Parallel Processing VADA Lab.

Introduction (1) qPipelining transformation leads to a reduction in the critical path, which can be exploited to increase the clock speed (sample speed), or to reduce power consumption at same speed. qIn the parallel processing, multiple outputs are computed in parallel in a clock period. Therefore, the effective sampling speed is increased by the level of parallelism.

Introduction (2) q3-tap FIR digital filter y(n) = ax(n)+bx(n-1)+cx(n-2) q Sample Period q Sampling frequency

Pipelining of FIR digital filter qPipelined implementation of the 3-tap FIR filter is obtained by placing 2 additional latches. qThe critical path is reduced from T M +2T A to T M +T A. qThe two main drawbacks of the pipelining are increase in the number of latches and in system latency.

Pipelining of FIR digital filter (2) qThe critical path (longest path) can be reduced by suitably placing the pipelining latches in the architecture. qThe pipelining latches can only be placed across any feed-forward cutset of the graph qIntroduce 2 definitions of graph for pipelining. q Cutset A cutset is a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint. q Feed-forward Cutset A cutset is called a feed-forward cutset if the data move in the forward direction on all the edges of the cutset. qTo obtain an appropriate pipelining circuit, pipelining latches should be inserted on all the edges in the feed-forward cutset !!

Pipelining of FIR digital filter (3) qSignal-flow graph example

Pipelining of FIR digital filter (4) qData-Broadcast Structures q The critical path of the original 3-tap FIR filter can be reduced without introducing any pipelining latches by transposing the structure. q Transposition theorem “ Reversing the direction of all the edges in a given SFG (signal- flow graph) and interchanging the input and output ports preserves the functionality of the system.”

Pipelining of FIR digital filter (5) < SFG representation of the FIR filter> < Transposed SFG representation of the FIR filter>

Pipelining of FIR digital filter (6) qTransposed SFG representation leads to the data-broadcast structure where data are not stored but are broadcast to all the multipliers simultaneously.

Pipelining of FIR digital filter (7) qFine-Grain Pipelining q Let T M =10 units and T A units, and the desired clock period be (T M +T A )/2=6 units. q In this case the multiplier is broken into 2 smaller units with processing times of 6 units and 4 units, respectively. q By placing the latches on the horizontal cutset across the multiplier, the desired clock speed can be achieved.

Parallel Processing (1) qDesigning a Parallel FIR System q To obtain a parallel processing structure, the SISO(single-input single-output) system must be converted into a MIMO(multiple- input multiple-output) system. y(3k) = ax(3k)+bx(3k-1)+cx(3k-2) y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1) y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k) o Parallel Processing systems are also referred to as block processing systems.

Parallel Processing (2) qParallel processing architecture for a 3-tap FIR filter (with block size 3)

Parallel Processing (3) qThe critical path of the parallel processing system has remained unchanged and the clock period (T clk ) must satisfy : qBut since 3 samples are processed in 1 clock cycle instead of 3, the iteration period is given by qIn a Pipelined system : T clk = T sample

Parallel Processing (4) qComplete parallel processing system with block size 4

Parallel Processing (5) qWhy do we use parallel processing when we can use pipelining ? q Due to a fundamental limit to pipelining imposed by the I/O bottlenecks. q Pipelining can be combined with parallel processing to further increase the speed of the architecture. q By combining parallel processing and pipelining, the sample period has been reduced to q Parallel processing is also used for reduction of power consumption while using slow clocks.

Parallel Processing (6)

Parallel Processing (7) <Combined fine-grain pipelining and parallel processing for 3-tap FIR filter>

Pipelining and Parallel processing for Low power qThere are two main advantages of using pipelining and parallel processing : q Higher speed q Lower power qFor CMOS circuit, the propagation delay can be written as : qPower consumption of a CMOS circuit can be estimated as :

Pipelining for Low power (1) q represent the power consumed in the original filter. (where T seq is the clock period of the original sequential filter) qIn the M-level pipelined system, the critical path is reduced to 1/M of its original length and the capacitance to be charged/discharged in a single clock cycle is reduced to C charge / M. q supply voltage can be reduced to

Pipelining for Low power (2) qThe power consumption factor,, can be determined by examining the relationship between the propagation delay of the original filter and the pipelined filter.

Parallel processing for Low power (1) qParallel processing, like pipelining, can reduce the power consumption of a system by allowing the supply voltage to be reduced. qIn an L-parallel system, the charging capacitance does not change while the total capacitance is increased by L times. qIn order to maintain the same sample rate, the clock period of the L- parallel circuit must be increased to LT seq, where T seq is the propagation delay of the sequential circuit. qThere is more time to charge the same capacitance => supply voltage can be reduced to supply voltage can be reduced to

Parallel processing for Low power (2) qThe propagation delay of the L-parallel system is given by :

Conclusions §The pipelining l Pipelining latches are placed across the feed-forward cutsets in the SFG and computation time of the critical path is reduced l The clock frequency can be increased and hence the sampling rate is increased. §Parallel processing l The hardware for the original serial system is duplicated and the resulting system is MIMO parallel system. l The clock freq. Stays the same, and the sampling freq. is increased. §Two scheme is used for higher speed and lower power design (using lower supply voltage).