Lecture 14: Performance Optimization

Lecture 14: Performance Optimization
UCSD ECE 111 Prof. Farinaz Koushanfar Fall 2017 UCSD ECE 111, Prof. Koushanfar, Fall 2017 Some slides are courtesy of Prof. Patrick Schaumont

UCSD ECE 111, Prof. Koushanfar, Fall 2017
Area-Delay Product Area (eg. Slices) Suboptimal Points Optimal Points Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

Area-Delay Product Area (eg. Slices) Area Constraint Best Choice Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

Area-Delay Product Area (eg. Slices) Best Choice Delay Constraint Delay = Performance-1 (e.g. cycles) UCSD ECE 111, Prof. Koushanfar, Fall 2017

Optimizing Performance - Overview
Timing constraint and timing analysis Performance factors of a digital design Latency and Throughput Delay = Clock Period * Cycle Count What determines the minimum clock period? Performance Optimizations you can do in Verilog Parallel Computations Pipelining Retiming Summary Optimizing Area & Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017

Timing Flip-flop samples D at clock edge D must be stable when sampled
Similar to a photograph, D must be stable around clock edge If not, metastability can occur

Input Timing Constraints
Setup time: tsetup = time before clock edge data must be stable (i.e. not changing) Hold time: thold = time after clock edge data must be stable Aperture time: ta = time around clock edge data must be stable (ta = tsetup + thold)

Output Timing Constraints
Propagation delay: tpcq = time after clock edge that the output Q is guaranteed to be stable (i.e., to stop changing) Contamination delay: tccq = time after clock edge that Q might be unstable (i.e., start changing)

Dynamic Discipline Synchronous sequential circuit inputs must be stable during aperture (setup and hold) time around clock edge Specifically, inputs must be stable at least tsetup before the clock edge at least until thold after the clock edge

Dynamic Discipline The delay between registers has a minimum and maximum delay, dependent on the delays of the circuit elements

Setup Time Constraint Tc ≥
Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥

Setup Time Constraint Tc ≥ tpcq + tpd + tsetup tpd ≤
Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥ tpcq + tpd + tsetup tpd ≤

Setup Time Constraint Tc ≥ tpcq + tpd + tsetup
Depends on the maximum delay from register R1 through combinational logic to R2 The input to register R2 must be stable at least tsetup before clock edge Tc ≥ tpcq + tpd + tsetup tpd ≤ Tc – (tpcq + tsetup) (tpcq + tsetup): sequencing overhead

Hold Time Constraint thold <
Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold <

Hold Time Constraint thold < tccq + tcd tcd >
Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold < tccq + tcd tcd >

Hold Time Constraint thold < tccq + tcd tcd > thold - tccq
Depends on the minimum delay from register R1 through the combinational logic to R2 The input to register R2 must be stable for at least thold after the clock edge thold < tccq + tcd tcd > thold - tccq

Timing Analysis Timing Characteristics tccq = 30 ps tpcq = 50 ps
tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = tcd = Setup time constraint: Tc ≥ fc = Hold time constraint: tccq + tcd > thold ?

Timing Analysis Timing Characteristics tccq = 30 ps tpcq = 50 ps
tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = 3 x 35 ps = 105 ps tcd = 25 ps Setup time constraint: Tc ≥ ( ) ps = 215 ps fc = 1/Tc = 4.65 GHz Hold time constraint: tccq + tcd > thold ? ( ) ps > 70 ps ? No!

Timing Analysis Timing Characteristics Add buffers to the short paths:
tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = tcd = Setup time constraint: Tc ≥ fc = Hold time constraint: tccq + tcd > thold ?

Timing Analysis Timing Characteristics Add buffers to the short paths:
tccq = 30 ps tpcq = 50 ps tsetup = 60 ps thold = 70 ps tpd = 35 ps tcd = 25 ps tpd = 3 x 35 ps = 105 ps tcd = 2 x 25 ps = 50 ps Setup time constraint: Tc ≥ ( ) ps = 215 ps fc = 1/Tc = 4.65 GHz Hold time constraint: tccq + tcd > thold ? ( ) ps > 70 ps ? Yes!

Clock Skew The clock doesn’t arrive at all registers at same time
Skew: difference between two clock edges Perform worst case analysis to guarantee dynamic discipline is not violated for any register – many registers in a system!

Setup Time Constraint with Skew
In the worst case, CLK2 is earlier than CLK1 Tc ≥

In the worst case, CLK2 is earlier than CLK1 Tc ≥ tpcq + tpd + tsetup + tskew tpd ≤

In the worst case, CLK2 is earlier than CLK1 Tc ≥ tpcq + tpd + tsetup + tskew tpd ≤ Tc – (tpcq + tsetup + tskew)

Hold Time Constraint with Skew
In the worst case, CLK2 is later than CLK1 tccq + tcd >

In the worst case, CLK2 is later than CLK1 tccq + tcd > thold + tskew tcd >

In the worst case, CLK2 is later than CLK1 tccq + tcd > thold + tskew tcd > thold + tskew – tccq

Violating the Dynamic Discipline
Asynchronous (for example, user) inputs might violate the dynamic discipline

Metastability Bistable devices: two stable states, and a metastable state between them Flip-flop: two stable states (1 and 0) and one metastable state If flip-flop lands in metastable state, could stay there for an undetermined amount of time

Flip-Flop Internals Flip-flop has feedback: if Q is somewhere between 1 and 0, cross-coupled gates drive output to either rail (1 or 0) Metastable signal: if it hasn’t resolved to 1 or 0 If flip-flop input changes at random time, probability that output Q is metastable after waiting some time, t: P(tres > t) = (T0/Tc ) e-t/τ tres : time to resolve to 1 or 0 T0, τ : properties of the circuit

Metastability Intuitively:
T0/Tc: probability input changes at a bad time (during aperture) P(tres > t) = (T0/Tc ) e-t/τ τ: time constant for how fast flip-flop moves away from metastability In short, if flip-flop samples metastable input, if you wait long enough (t), the output will have resolved to 1 or 0 with high probability.

Synchronizers Asynchronous inputs are inevitable (user interfaces, systems with different clocks interacting, etc.) Synchronizer goal: make the probability of failure (the output Q still being metastable) low Synchronizer cannot make the probability of failure 0

Synchronizer Internals
Synchronizer: built with two back-to-back flip-flops Suppose D is transitioning when sampled by F1 Internal signal D2 has (Tc - tsetup) time to resolve to 1 or 0

Synchronizer Probability of Failure
For each sample, probability of failure is: P(failure) = (T0/Tc ) e-(Tc - tsetup)/τ

Synchronizer Mean Time Between Failures
If asynchronous input changes once per second, probability of failure per second is P(failure). If input changes N times per second, probability of failure per second is: P(failure)/second = (NT0/Tc) e-(Tc - tsetup)/τ Synchronizer fails, on average, 1/[P(failure)/second] Called mean time between failures, MTBF: MTBF = 1/[P(failure)/second] = (Tc/NT0) e(Tc - tsetup)/τ

Example Synchronizer Suppose: Tc = 1/500 MHz = 2 ns τ = 200 ps
T0 = 150 ps tsetup = 100 ps N = 10 events per second What is the probability of failure? MTBF?

Example Synchronizer Suppose: Tc = 1/500 MHz = 2 ns τ = 200 ps
T0 = 150 ps tsetup = 100 ps N = 10 events per second What is the probability of failure? MTBF? P(failure) = (150 ps/2 ns) e-(1.9 ns)/200 ps = 5.6 × 10-6 P(failure)/second = 10 × (5.6 × 10-6 ) = 5.6 × 10-5 / second MTBF = 1/[P(failure)/second] ≈ 5 hours

Performance (Delay) of a design
Two common definitions for the performance of a design The time it takes to compute an output starting from a given input: Latency The rate at which new outputs are produced (or the rate at which new inputs are consumed): Throughput Depending on the application, you will need to optimize one or the other The unit of Throughput and Latency is time (seconds). If you find a number in cycles, you have to find the clock period T before you know the throughput or latency. Performance = 1 / Throughput or 1 / Latency We will use the generic term Delay Delay = 1 / Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017

Digital Synchronous Design
Delay of a design Latency = (cycles from I to O) * (clock period) Example: System clock frequency = 20MHz ( = 50 ns period) Each output is available 10 clock cycles after a new input is provided Latency = 10 * 50 ns = 500 ns 10 cycles Digital Synchronous Design Input Output fCLK UCSD ECE 111, Prof. Koushanfar, Fall 2017

Digital Synchronous Design
Delay of a design Throughput = (cycles between I) * (clock period) Example: System clock frequency = 20MHz Each 5 clock cycles a new input is accepted Throughput = 5 * 50 ns = 250 ns Digital Synchronous Design Input Output 5 cycles fCLK UCSD ECE 111, Prof. Koushanfar, Fall 2017

Delay = Cycle Count * Clock Period
Delay (either latency or throughput) has two components Cycle Count. This quantity is controlled by the kind of Verilog that the designer writes Clock Period. This quantity is determined by the synthesis tools that map the Verilog into an implementation Cycle Count is fixed by Verilog Code Verilog Tools For a given design (Verilog), Clock Period is fixed by Technology Parameters, e.g. gate delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Delay = Cycle Count * Clock Period
Delay (either latency or throughput) has two components Cycle Count. This quantity is controlled by the kind of Verilog that the designer writes Clock Period. This quantity is determined by the synthesis tools that map the Verilog into an implementation Since a designer cannot choose the clock period directly, the designer will constrain it "Tools, I want you to implement this circuit with a clock period smaller then 25 ns" The tools will try to match this constraint The tools may or may not achieve the desired clock period If the tools cannot achieve the constraint, the tools report a timing violation UCSD ECE 111, Prof. Koushanfar, Fall 2017

Optimizing the Delay Delay = Cycle Count * Clock Period So to decrease the system delay, we can do two things: Decrease the cycle count Decrease the clock period Both options can be influenced by the Verilog Designer We will focus on how to influence the clock period (which seems the least obvious to do) Let's first consider the factors in a digital design that determine the minimum clock period UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup UCSD ECE 111, Prof. Koushanfar, Fall 2017

a clock edge has occured
Minimum Clock Period Combinational Logic A CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A Tclk->Q This is the time need for the output of a flip-flop to switch to a new value after a clock edge has occured UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A B TLogic + TRouting This is the time need for the logic to calculate a new output. The delay is caused by gates as well as wires. UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup A B TSetup This is the time need for the flipflop to capture stable input data at the next clock edge. The next clock edge cannot come earlier then the dashed line UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tclk,min B In this case, the timing of the system is OK, since the actual Tclk > Tclk,min UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period A B Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tslack Tclk,min B The margin between the actual clock period and the minimal clock period is called slack. Tslack = Tclk - Tclk,min UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period Combinational Logic A B CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Tclk CLK A Tslack Tclk,min B If the slack is negative, the system has a timing violation. This system will not perform as expected, since its clock frequency is too high. ECE 4514 Digital Design II Patrick Schaumont Lecture 19: Optimizing Performance UCSD ECE 111, Prof. Koushanfar, Fall 2017 Spring 2008

An example from the Spartan 3E datasheet is shown on the right.
Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup Once the technology is chosen, Tclk->Q and Tsetup are fixed. An example from the Spartan 3E datasheet is shown on the right. Patrick Schaumon t Spring 2008 UCSD ECE 111, Prof. Koushanfar, Fall 2017

Minimum Clock Period Combinational Logic CLK Tclk,min = Tclk->Q + TLogic + TRouting + TSetup However, even after the technology is chosen, the designer can still influence Tlogic and Trouting by making modifications to the Verilog code. Thus, if we want to decrease the minimum clock period, we need to consider these terms. UCSD ECE 111, Prof. Koushanfar, Fall 2017

Delay optimization We will consider three techniques that can be used to decrease the combinational delay of a system If used properly, these may decrease the delay of a digital design. Technique 1: Parallel Computations Technique 2: Pipelining Technique 3: Retiming UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 1: Parallel Computations
Hardware is concurrent, it's easy to do several things in parallel However, different combinational networks can have a different delay while still implementing the same function a b c d + Parallel + + + + Not Parallel + q q UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Datapath Example
We need to write our Verilog such that as much gates as possible will work in parallel Example: Karatsuba Multiplication A 16-bit x 16-bit multiplication can be written using 8- bit multiplications as follows q = a * b // a, b are 16 bit a = a1 << 8 + a2 // a1, a2 are 8 bit b b1 << 8 + b2 b1, b2 a2) * (b1 << 8 + b2) q = (a1 * b1) << 16 + (a1 * b2 + a2 * b1) << 8 + (a2 * b2) q = (a1 << 8 + UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Datapath Example
Karatsuba Multiplication creates a structure that enables parallel addition of partial products Note that this can be applied recursively: 8 bit * 8 bit into 4 times 4 bit * 4 bit, etc .. 16 16 8 8 8 8 8 8 8 8 16 bit X 16 bit 8 bit X 8 bit 8 bit X 8 bit 8 bit X 8 bit 8 bit X 8 bit 32 16 16 16 16 + + + UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Control Example
Nested if-then-else statements in Verilog create a priority decoder module selection(q, a, b); output [3:0] q; reg input a, b; begin q <= 0; if (a[0]) q[0] <= b[0]; else if (a[1]) else if (a[2]) else if (a[3]) end q[1] q[2] q[3] <= b[1]; <= b[2]; <= b[3]; endmodule gates with more inputs are slower UCSD ECE 111, Prof. Koushanfar, Fall 2017

Parallel Computations: Control Example
If you know (from the design specs) that this priority encoding is not needed, you can skip the else-if module selection(q, a, b); output [3:0] q; q; a, b; reg input [3:0] q <= 0; begin if (a[0]) if (a[1]) if (a[2]) if (a[3]) end q[0] q[1] q[2] q[3] <= b[0]; <= b[1]; <= b[2]; <= b[3]; Four 2-input AND in parallel endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining
Cut a long combinational path in half by inserting a register Increases the latency cycle count of the design to get form the input to the output, you will need an extra clock cycle a b c d a b c d + + + + insert register here + + q q Tlogic, before Tlogic, after UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining in Verilog
It's easy to pipeline a Verilog program, but you have to pay attention to insert pipeline registers consistently ! Before Pipelining: After Pipelining: module add4a(q, a, b, c, d); module add4b(q, a, b, c, d, clk); output input [127:0] q; a, b, c, d; output input input reg [127:0] clk; [127:0] q; a, b, c, d; assign q = a + endmodule b + c + d; pipe1, pipe2; always @(posedge clk) begin pipe1 <= pipe2 <= end a + b; c + d; assign q = pipe1 + pipe2; endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 2: Pipelining in Verilog
The following is example is WRONG: Before Pipelining: After Inconsistent Pipelining: module add4c(q, a, b, c, d, clk); module add4a(q, a, b, c, d); output [127:0] q; input clk; a, b, c, d; reg pipe1, pipe2; output input [127:0] q; a, b, c, d; assign q = a + endmodule b + c + d; clk) begin pipe1 <= a + b; end This will add inputs from cycle N, and a partial result from cycle N-1 assign q = pipe1 + c + d; endmodule Inconsistent pipelining means: the pipelined module will never generate the same outputs as the original module, even after accounting for the latency effects of pipeline registers UCSD ECE 111, Prof. Koushanfar, Fall 2017

Simple rules for consistent pipelining
Assume a network of modules as follows (the modules can be combinational or sequential). We will demonstrate how to move pipeline registers around while avoiding inconsistent pipelining UCSD ECE 111, Prof. Koushanfar, Fall 2017

You can always add a register in front. It increases the latency of the network with one cycle, but the network will have the same functionality UCSD ECE 111, Prof. Koushanfar, Fall 2017

You can absorb a register at a single input if you recreate it at ALL the outputs of the module. This transformation will not change the latency nor the functionality of the network. in out1 out2 Think of moving a registers over a fork: UCSD ECE 111, Prof. Koushanfar, Fall 2017

Move it over another module - absorb register at the module inputs, recreate it to the module outputs in out1 UCSD ECE 111, Prof. Koushanfar, Fall 2017

Move it over the last module - absorb register at the module inputs, recreate it at the module output in1 out in2 Think of joining two registers: UCSD ECE 111, Prof. Koushanfar, Fall 2017

All of these have the same behavior
UCSD ECE 111, Prof. Koushanfar, Fall 2017

We can add multiple registers at the front ... UCSD ECE 111, Prof. Koushanfar, Fall 2017

and redistribute them using consistent pipelining UCSD ECE 111, Prof. Koushanfar, Fall 2017

Tclk,min = 90ns Latency = 1 cycle Throughput = 1 cycle 30 ns 30 ns 30 ns Tclk,min = 30ns Latency = 3 cycles Throughput = 1 cycle 30 ns 30 ns 30 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Following these rules, you'll find that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) This is a feedback path There is a single register present in this path UCSD ECE 111, Prof. Koushanfar, Fall 2017

Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 To pipeline, add a register at the front UCSD ECE 111, Prof. Koushanfar, Fall 2017

Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 To move the pipeline register to the module output, ALL the inputs need to absorb a register UCSD ECE 111, Prof. Koushanfar, Fall 2017

Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 In the resulting network, there is still only one register in the loop UCSD ECE 111, Prof. Koushanfar, Fall 2017

Following these rules, you'll notice that you cannot pipeline loops (i.e. increase the number of registers in a feedback path) in1 out in2 Before another register can be added, the red register will need to be moved around the loop UCSD ECE 111, Prof. Koushanfar, Fall 2017

Technique 3: Retiming (or Register Balancing)
A pipeline register can cut a piece of combinational logic in smaller pieces. This reduces the Tclk,min for the entire design. 100 ns 50 ns 50 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Sometimes, you will find that this partitioning is not nicely 50/50. In that case the benefit of pipeline registers to reduce Tclk,min is small, since the design has to be operated at the speed of the slowest stage 100 ns 90 ns 10 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

To maximize the benefit of the (pipeline) registers in a module, they should be balanced so that each stage of combinational logic takes the same amount of logic delay. This is called retiming or register balancing 90 ns 10 ns Move some logic from first stage to second stage 50 ns 50 ns UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis tools
You only need to make sure you have sufficient registers available qa qb module add4d(q, a, b, c, d, clk); output [15:0] q; reg [15:0] q, qa, qb, qc, qd; input [15:0] b, c, d; input clk; + qc qd clk) begin qa <= a; qb <= b; qc <= c; qd <= d; q <= qa end This is a 16-bit adder for four numbers, but registers have been added at the input and the output + qb + qc + qd; endmodule UCSD ECE 111, Prof. Koushanfar, Fall 2017

Retiming is supported by synthesis
Results without register balancing: qa 6.708 ns qb 46 LUTs 80 Flip-flops (= 5 * 16) + qc qd UCSD ECE 111, Prof. Koushanfar, Fall 2017

To enable register balancing, select synthesis- properties, enable register balancing: UCSD ECE 111, Prof. Koushanfar, Fall 2017

Results with register balancing: 6.708 ns 46 LUTs 80 Flip-flops (= 5 * 16) qa qb + qc qd 5.816 ns 46 LUTs 93 Flip-flops That Area-Delay trade-off again ... ! UCSD ECE 111, Prof. Koushanfar, Fall 2017

Summary: Optimizing Area and Performance
Optimization Trade-off Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Trade-off (reduce area, increase delay) Resource Sharing Combinational Logic f1 out Area 1 in Optimization (reduce area, reduce delay) Constant-propagation 1 a a Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Resource Sharing Constant-propagation Area Rewrite Verilog for parallelism Add Pipeline Registers Redistribute Pipeline Registers Optimization or Tradeoff Tradeoff Tradeoff Parallel Computation Pipelining Retiming Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Designer needs to control this design space, by modifying Verilog, and adjusting tool options as needed Resource Sharing Area Constant-propagation Verilog Designer Tools Parallel Computation Pipelining Retiming Delay UCSD ECE 111, Prof. Koushanfar, Fall 2017

Lecture 14: Performance Optimization

Similar presentations

Presentation on theme: "Lecture 14: Performance Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 14: Performance Optimization

Similar presentations

Presentation on theme: "Lecture 14: Performance Optimization"— Presentation transcript:

Similar presentations

About project

Feedback