Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Slides:

Advertisements

Similar presentations

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Advertisements

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Chapter 4 Retiming.

Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.

Modern VLSI Design 4e: Chapter 5 Copyright  2008 Wayne Wolf Topics n Performance analysis of sequential machines.

Clock Skewing EECS 290A Sequential Logic Synthesis and Verification.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.

Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)

1 Lecture 28 Timing Analysis. 2 Overview °Circuits do not respond instantaneously to input changes °Predictable delay in transferring inputs to outputs.

Assume array size is 256 (mult: 4ns, add: 2ns)

Introduction To Algorithms CS 445 Discussion Session 8 Instructor: Dr Alon Efrat TA : Pooja Vaswani 04/04/2005.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

All Pairs Shortest Paths and Floyd-Warshall Algorithm CLRS 25.2

Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.

Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.

Sequential Logic 1  Combinational logic:  Compute a function all at one time  Fast/expensive  e.g. combinational multiplier  Sequential logic:  Compute.

Spring 08, Feb 28 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2008 Retiming Vishwani D. Agrawal James J. Danaher.

Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.

Digital Design – Optimizations and Tradeoffs

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

Retiming. Consider the Following Circuit Suppose T XOR = 3 ns, T pcq = 1 ns, T setup = 1 ns, then this circuit can be clocked at 1 ns + (3 x 3 ns) + 1.

VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.

Retiming with Interconnect and Gate Delay CUHK CSE CAD Group Dennis Tong 29 th Sept., 2003.

CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.

Spring 07, Apr 5 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Retiming Vishwani D. Agrawal James J. Danaher Professor.

Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

1 Retiming Outline: ProblemProblem FormulationFormulation Retiming algorithmRetiming algorithm.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

ECE Synthesis & Verification 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Retiming.

Algorithmic Transformations

EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

ELEC692 VLSI Signal Processing Architecture Lecture 3

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

1 Retiming and Re-synthesis Outline: RetimingRetiming Retiming and Resynthesis (RnR)Retiming and Resynthesis (RnR) Resynthesis of PipelinesResynthesis.

Pipelining and Retiming

CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.

Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.

Retiming EECS 290A Sequential Logic Synthesis and Verification.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Digital Logic Design Alex Bronstein Lecture 2: Pipelines.

Lecture 27 Logistics Last lecture Today: HW8 due Friday

Pipelining and Retiming 1

CS Spring 2008 – Lec #17 – Retiming - 1

CS137: Electronic Design Automation

The minimum cost flow problem

James D. Z. Ma Department of Electrical and Computer Engineering

Lecture 27 Logistics Last lecture Today: HW8 due Friday

ELEC 7770 Advanced VLSI Design Spring 2012 Retiming

ESE535: Electronic Design Automation

Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.

Resource Sharing and Binding

CS184a: Computer Architecture (Structures and Organization)

ESE535: Electronic Design Automation

ECE 352 Digital System Fundamentals

Algorithms (2IL15) – Lecture 7

EE5900 Advanced Embedded System For Smart Infrastructure

ELEC 7770 Advanced VLSI Design Spring 2016 Retiming

Fast Min-Register Retiming Through Binary Max-Flow

Lecture 3: Timing & Sequential Circuits

Presentation transcript:

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase throughput  increase latency

Pipelining and Retiming 2 Pipelining  Delay, d, of slowest combinational stage determines performance  clock period = d  Throughput = 1/d : rate at which outputs are produced  Latency = nd : number of stages * clock period  Pipelining increases circuit utilization  Registers slow down data, synchronize data paths  Wave-pipelining  no pipeline registers - waves of data flow through circuit  relies on equal-delay circuit paths - no short paths

Pipelining and Retiming 3 When and How to Pipeline?  Where is the best place to add registers?  splitting combinational logic  overhead of registers (propagation delay and setup time requirements)  What about cycles in data path?  Example: 16-bit adder, add 8-bits in each of two cycles

Pipelining and Retiming 4 Retiming  Process of optimally distributing registers throughout a circuit  minimize the clock period  minimize the number of registers

Pipelining and Retiming 5 Retiming (cont’d)  Fast optimal algorithm (Leiserson & Saxe 1983)  Retiming rules:  remove one register from each input and add one to each output  remove one register from each output and add one to each input

Pipelining and Retiming 6 Optimal Pipelining  Add registers - use retiming to find optimal location

Pipelining and Retiming 7 Optimal Pipelining  Add registers - use retiming to find optimal location

Pipelining and Retiming 8 Example - Digital Correlator  y t =  (x t, a 0 ) +  (x t-1, a 1 ) +  (x t-2, a 2 ) +  (x t-3, a 3 )   (x t, a 0 ) = 0 if x  a, 1 otherwise (and passes x along to the right) ++   +   host ytyt xtxt a0a0 a1a1 a2a2 a3a3

Pipelining and Retiming 9 Example - Digital Correlator (cont’d)  Delays: adder, 7; comparator, 3; host, 0 ++   +   host cycle time = 24

Pipelining and Retiming 10 Example - Digital Correlator (cont’d)  Delays: adder, 7; comparator, 3; host, 0 ++   +   host ++   +   cycle time = 24 cycle time = 13

Pipelining and Retiming 11 Retiming: One Step at a Time

Pipelining and Retiming 12 Retiming: One Step at a Time (cont’d) and after a few more...

Pipelining and Retiming 13 Retiming Algorithm  Representation of circuit as directed graph  nodes: combinational logic  edges: connections between logic that may or may not include registers  weights: propagation delay for nodes, number of registers for edges  path delay (D): sum of propagation dealys along path nodes  path weight (W): sum of edge weights along path  always > 0, no asynchronous feedback  Problem statement  given: cycle time, T, and a circuit graph  adjust edge weights (number of registers) so that all path delays < T, unless their path weight  1, and the outputs to the host are the same (in both function and delay) as in the original graph

Pipelining and Retiming 14 Retiming Algorithm Approach  Compute path weights and delays between each pair of nodes  W and D matrices  Choose a cycle time T  Determine if it is possible to assign new weights so that all paths with delays greater than T have a weight that is 1 or greater (use linear programming)  Choose a smaller cycle time and repeat until the smallest T is found

Pipelining and Retiming 15 Computing W and D  W matrix: number of registers on path from u  v  D matrix: total delay along path from u  v Wh h Wh h D h h v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 vhvh 000

Pipelining and Retiming 16 Computing W and D  W[u,v] = number of registers on the minimum weight path from u  v  Any retiming changes the weight of all paths by the same constant  i.e. Retiming cannot change which is the minimum weight path  D[u,v] = maximum delay over all paths with W[u,v] registers  Retiming does not affect D[u,v]  These matrices contain all the required register and delay information  If retiming removes all registers from the path u  v, then D[u,v] is the largest delay path that results

Pipelining and Retiming 17 Retiming: Problem Formulation  r(v): number of registers pushed through a node in the forward direction  w new (u, v) = w old (u, v) + r(u) - r(v)  Problem statement  r(v h ) = 0 (host is not retimed)  w new (u, v) = w old (u, v) + r(u) - r(v)  0, for all u, v  r(u) - r(v)  - w old (u, v) (no negative registers!)  For all D[u,v] > Tclk, w new (u, v) = w old (u, v) + r(u) - r(v)  1  r(u) - r(v)  - w old (u, v) + 1 (every long path has at least 1 reg)  Difference constraints like this can be solved by generating a graph that represents the constraints and using a shortest path algorithm like Bellman-Ford to find a set of r(v) values that meets all the constraints  The value of r(v) returned by the algorithm can be used to generate the new positions of the registers in the retimed circuit

Pipelining and Retiming 18 Retimed Correlator r = 2 r = 1 r = 0

Pipelining and Retiming 19 Extensions to Retiming  Host interface  add latency  multiple hosts  Area considerations  limit number of registers  optimize logic across register boundaries  peripheral retiming  incremental retiming  pre-computation  Generality  different propagation delays for different signals  widths of interconnections

Pipelining and Retiming 20 a b c d x DQDQ a b d x DQDQ DQDQ a b x c DQDQ DQDQ DQDQ x c a b DQDQ DQDQ Retiming examples  Shortening critical paths  Create simplification opportunities

Pipelining and Retiming 21 Digital Correlator Revisited  Optimally retimed circuit (clock cycle 13)  How do we know this is optimal?  Max-Ratio Theorem: T c  D cycle /R cycle for all cycles in circuit  D cycle = total delay on cycle, including register t pd, t su  R cycle = number of registers on cycle  We know we can never do better than this  Can’t always do this well ++   +   host

Pipelining and Retiming 22 Going Faster: C-slow’ing a Circuit  Replace every register with C registers  Now retime: (clock cycle now 7) ++   +   host ++   +  

Pipelining and Retiming 23 C-slow’ing a Circuit  Note that we get one value every c clock cycles  But clock period decreases  Throughput remains the same at best  The trick: Interleave data sets  Example: Stereo audio  Interleave the data for the two channels  Doubles the throughput! ++   +   host

Pipelining and Retiming 24 Using C-Slowing For Time-Multiplexing  Clock period is for this circuit is 40 [ ]  Min clock period after pipelining/retiming is at best 25  Max ratio cycle: [ ]/1 mult: 10, add: 5, Tpd: 2, Tsu: 3, Th: 1

Pipelining and Retiming 25 Using C-Slowing For Time-Multiplexing  Pipelined/Retimed Circuit  Let’s reschedule for 2 clock cycles/iteration mult: 10, add: 5, Tpd: 2, Tsu: 3, Th: 1

Pipelining and Retiming 26 Using C-Slowing For Time-Multiplexing  Start by C-slowing mult: 10, add: 5, Tpd: 2, Tsu: 3, Th: 1

Pipelining and Retiming 27 Using C-Slowing For Time-Multiplexing  Now retime  Note: 3 multiplers are red, 3 are white: share 2 adders are red, 2 are white: share mult: 10, add: 5, Tpd: 2, Tsu: 3, Th: 1

Pipelining and Retiming 28 Using C-Slowing For Time-Multiplexing  Result  Cost: 1/2  clock period: 25 -> 15  Throughput: 1/25 -> 1/30 mult: 10, add: 5, Tpd: 2, Tsu: 3, Th: 1

Pipelining and Retiming 29 * + * + * + * + 0 C-slowing/Retiming for Resource Sharing  FIR Filter

Pipelining and Retiming 30 * + * + * + * +

* + * + * + * + C-slowed by 4

* + * + * + * + Insert Data every 4 cycles (one data set)

* + * + * + * +

* + * + * + * + Computation Active only every 4 Cycles

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * + Retime and remove extra Pipelining

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * +

* + * + * + * + Computation spread over time  Only need one multiplier and one adder  We can use this method to schedule for any number of resources