Reading Textbook II, Chapter 10 Textbook I, Chapters 12 and 13
Motivation Time is the essence! –We do things in order, do does the processors Procedural dependency Resource Reusability Synchronous architectures are preferred –Ease of implementation –Predictability –Compatibility with well known arithmetic algorithms A reference clock plays a key role –We usually neglect the non-idealities in the clock in the design cycle
Pelsichronous two interacting modules have independent clocks generated from separate crystal oscillators
Asynchronous Interconnect No clock is needed Speed is determined by job completion
Hand Shaking The four-phase handshake is level-sensitive while the two- phase handshake is edge- triggered (lower transitions at the expense of edge triggered circuitry). System A places data on the bus. It then raises Req to indicate that the data is valid. System B samples the data when it sees a high value on Req and raises Ack to indicate that the data has been captured. System A lowers Req, then system B lowers Ack. Req is not synch to clkB synchronizer is needed Req is not synch to clkB synchronizer is needed
Latch Parameters D Q Clk t c-q t hold PW m t su t d-q Delays can be different for rising and falling data transitions T Transparent Opaque
Register Parameters D Q Clk t c-q t hold T t su Delays can be different for rising and falling data transitions
Clock Uncertainties Sources of clock uncertainty
Clock Nonidealities Clock skew –Spatial variation in temporally equivalent clock edges; deterministic + random, t SK Clock jitter –Temporal variations in consecutive edges of the clock signal; modulation + random noise –Cycle-to-cycle (short-term) t JS –Long term t JL Variation of the pulse width –Important for level sensitive clocking
Clock Skew and Jitter Both skew and jitter affect the effective cycle time Only skew affects the race margin Clk t SK t JS
Clock Skew and Jitter Do not touch the clock signal if not necessary! –Sometimes the simplest architecture is the safest –But not necessarily the lowest power! Clk t SK t JS
Clock skew and Jitter Data and state independent clock distribution is desired Enabled FF is a popular choice in the design Consider clock load on power!
Clock Skew # of registers Clk delay Insertion delay Max Clk skew Earliest occurrence of Clk edge Nominal – /2 Latest occurrence of Clk edge Nominal + /2
Positive Skew Launching edge arrives before the receiving edge
Negative Skew Receiving edge arrives before the launching edge
Timing Constraints (positive skew) Minimum cycle time: T + > t c-q + t su + t logic Worst case is when receiving edge arrives early (positive ) More time to process the data
Timing Constraints (positive skew) Hold time constraint: t (c-q, cd) + t (logic, cd) > t hold + Worst case is when receiving edge arrives late Race between data and clock (positive skew) Otherwise it can not latch In1 before it changes after CLK1 edge 1 t hold independent of the T
Considerations δ > 0—This corresponds to a clock routed in the same direction as the flow of the data through the pipeline. The skew has to be strictly controlled. If this constraint is not met, the circuit does malfunction independent of the clock period.
Question Would there be any race if the skew is negative? What would you do to avoid race?
Negative Skew δ < 0—When the clock is routed in the opposite direction of the data, the skew is negative and condition to avoid race is unconditionally met. The circuit operates correctly independent of the skew. The skew reduces the time available for actual computation so that the clock period, T, has to be increased by |δ|. If race (hold time) is a problem, route the clock in the opposite direction
Impact of Jitter Both skew and jitter should be accounted for in feedback structures
Longest Logic Path in Edge-Triggered Systems Clk T T SU T Clk-Q T LM Latest point of launching considering jitter Earliest arrival of next cycle T JI +
Clock Constraints in Edge-Triggered Systems If launching edge is late and receiving edge is early, the data will not be too late if: Minimum cycle time is determined by the maximum delays through the logic T c-q + T LM + T SU < T – T JI,1 – T JI,2 - T c-q + T LM + T SU + + 2 T JI < T Skew can be either positive or negative
Shortest Path Clk T Clk-Q T Lm Earliest point of launching Data must not arrive before this time Clk THTH Nominal clock edge
Clock Constraints in Edge-Triggered Systems Minimum logic delay If launching edge is early and receiving edge is late: T c-q + T LM – T JI,1 > T H + T JI,2 + T c-q + T LM > T H + 2T JI +
False path Path 1 (5 tgate) never exercised. If A = 1, the critical path goes through OR1 and OR2. If A = 0 and B = 0, the critical path is through I1,OR1 and OR2 (corresponding to a delay of 3 tgate). For the case when A= 0 and B =1, the critical path is through I1,OR1, AND3 and OR2. Does not depend on C,D.
Pattern and ILD correlation Use of fillers is necessary
Temp. and Power Temp. –Time varying (milisecond) –Effect of clock gating –Has a gradient systematic compensated for Power –Instantaneous IR Drop (switching activity) –Jitter (short pulses, data dependent) –Can not be compensated for (only decoupling caps)
Data dependent loading It is modeled as a form of jitter due to its random nature Capacitive coupling and X-talk works the same way.
Clock Distribution Clock is distributed in a tree-like fashion H-tree
Example Clock H-Tree –Clock skew: time difference between the arrival time of the clock signal between two leaves –Identical branches and leaves
Example Considering three parameters: –Both FETs and wires; 64 samples + main buffer –All deterministic factors are nulled out only within chip variation is considered –Random ΔL of FET with distribution stat: N(0, 0.035um) –Random ΔW of wires with N(0,0.25um) –Spatial ΔL; ΔL = w 0 +w x.x+w y.y
Results –In case of Random ΔL 139ps vs. 171ps without considering spatial constraints –In case of Random ΔW 41ps vs. 49ps –Without considering spatial constraints; worst case is too pessimistic
More realistic H-tree [Restle98] 10 Balanced segments Each segments contain 580 drivers All-RC matched If we leave Clock Tree for last minute we may end-up with multiple timing constraints violations!
The Grid System No rc-matching Large power Absolute delay is minimized Allows late design changes
21164 Clocking 2 phase single wire clock, distributed globally 2 distributed driver channels –Reduced RC delay/skew –Improved thermal distribution –3.75nF clock load –58 cm final driver width Local inverters for latching Conditional clocks in caches to reduce power More complex race checking Device variation Skew: 90pSec (65pSec effective) t rise = 0.35ns t skew = 150ps t cycle = 3.3ns Clock waveform Location of clock driver on die pre-driver final drivers
Clock buffers carefully sized to minimize the skew The direction of the clock is considered One gate between the latches Dummy fillers (increase cap) –Dummies are shielded 21164 Clocking
Reducing Skew 1. balance clock paths from a central distribution source to individual clocking elements using H-tree structures 2. The use of local clock grids (instead of routed trees) can reduce skew at the cost of increased capacitive load and power dissipation. 3. If data dependent clock load variations causes significant jitter, differential registers that have a data independent clock load should be used. –The use of gated clocks to save also results in data dependent clock load and increased jitter. In clock networks where the fixed load is large (e.g., using clock grids), the data dependent variation might not be significant. 4. If data flows in one direction, route data and clock in opposite directions. This eliminates races at the cost of performance. 5. shielding clock wires from adjacent signal wires 6. ILD: Dummy fills 7. Temperature: delay locked loops as discussed later in this chapter can easily compensate for temperature variations. 8. Power supply variation : on-chip decoupling capacitors. Unfortunately, decoupling capacitors require a significant amount of area and efficient packaging solutions must be leveraged to reduce chip area.
2 Phase, with multiple conditional buffered clocks –2.8 nF clock load –40 cm final driver width Local clocks can be gated “off” to save power Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking t rise = 0.35nst skew = 50ps t cycle = 1.67ns EV6 (Alpha 21264) Clocking 600 MHz – 0.35 micron CMOS Global clock waveform
21264 Clocking Hierarchical clocking Trade-off between power and skew Flexibility in types of clocks at each reagion Not shielded
EV7 Clock Hierarchy + widely dispersed drivers + DLLs compensate static and low- frequency variation + divides design and verification effort - DLL design and verification is added work + tailored clocks Active Skew Management and Multiple Clock Domains
Latch based timing We can have comb. Circuits between the two latches of a FF –More flexibility in terms of timing
Flip-Flop – Based Timing Flip -flop Logic Flip-flop delay Skew Logic delay T SU T Clk-Q Representation after M. Horowitz, VLSI Circuits 1996.
Latch timing D Clk Q t D-Q t Clk-Q When data arrives to transparent latch When data arrives to closed latch Data has to be ‘re-launched’ Latch is a ‘soft’ barrier
Single-Phase Clock with Latches Latch Logic Clk P PW T skl T skt latch transparent
Preventing late arrivals Case 1: - The LM can start ahead of time - c2q limits Case 2: d2q limits Lgk can still operate
Latch-Based Design L1 Latch Logic L2 Latch L1 latch is transparent when = 0 L2 latch is transparent when = 1
Latch-Based Timing L1 Latch Logic Path1 Logic L2 Latch L1 latch L2 latch Skew Can tolerate skew! Long Path 1 Short Path 1 Static logic L2 trans. L1 trans. Hits L2 latch has to wait till L2 becomes transparent Hits L2 transparent goes through L2
Latch based timing Trans. when high Trans. when low
Slack-borrowing tpdA tpdB Trans. when high CLB_B starts before (3) kicks to latch its input. ie, since CLB_A finished earlier than (3), the extra time is passed to CLB_B again e is valid before (4) to latch the input of the next CLB
Example T=125 L4 Becomes transp. at edge no problem when exactly f arrives L4
Design consideration If the falling edge of clk2 comes with too much skew, THL might not be able to latch the previous data because of hold time violation (ie, D2 is overwritten too quickly after the edge) Data available for CLL Hold time violation
Self-timed and Asynchronous Design Functions of clock in synchronous design 1) Acts as completion signal 2) Ensures the correct ordering of events Truly asynchronous design 2) Ordering of events is implicit in logic 1) Completion is ensured by careful timing analysis Self-timed design 1) Completion ensured by completion signal 2) Ordering imposed by handshaking protocol
Synchronous Pipelined Datapath What clock does is that: 1- physical timing constraints are met 2- Clock events serve as a logical ordering mechanism for the global system events If we guarantee these two items, we can remove the clock: -power, area, complexity of clock tree…
Synch. design It assumes that all clock events or timing references happen simultaneously over the complete circuit. This is not the case in reality, because of effects such as clock skew and jitter. significant current flows over a very short period of time linking of physical and logical constraints has some obvious effects (e.g. throughput)
Self-Timed Pipelined Datapath Hand shaking blocks What each signal does? The logical ordering of the operations is ensured by the acknowledge-request scheme, often called a handshaking protocol.
Asynch. properties Timing signals are generated locally… no high precision clock distribution over the chip (skew, etc) Separating the physical and logical ordering Performance (data dependency and no worst case design) The automatic shut-down of blocks that are not in use can result in power savings.(power) Robust to variations in manufacturing and operating conditions such as temperature.
Completion Signal Using Current Sensing Minimum delay Data independent reference!
Hand-Shaking Protocol Two Phase Handshake Every transition means that the action is valid! The four events, data change, request, data acceptance, acknowledge proceed in a cyclic order.
Event Logic – The Muller-C Element Seq. element
2-Phase Handshake Protocol Advantage : FAST - minimal # of signaling events (important for global interconnect) Disadvantage : requires the detection of transitions that may occur in either direction initialization is important Start from DataReady, Ack=0,0. when go to 1,0, Req=1. The C-element is blocked (and locked), and no new data is sent to the data bus (Req stays high) as long as the transmitted data is not processed by the receiver, no matter what DataReady is.
Problem: Self-timed FIFO All 1s or 0s -> pipeline empty Alternating 1s and 0s -> pipeline full
Example From [Horowitz] Assume there is a register at the input which loads the data at the beginning of Eval phase
Example DataReady1 is asserted. Req to the second block is asserted, First C-element is locked. The second block loads data and starts the evaluation process.
Example DataReady2 is asserted. Req to the third block is asserted, Second C-element is locked. The third block loads data and starts the evaluation process. The first C-element is released. Can accept a DataReady from the previous stage. (If Req has already come, the first Req is unleashed and goes to eval phase.) DataReady2 is asserted. Req to the third block is asserted, Second C-element is locked. The third block loads data and starts the evaluation process. The first C-element is released. Can accept a DataReady from the previous stage. (If Req has already come, the first Req is unleashed and goes to eval phase.)
Synchronizers and Arbiters Arbiter: Circuit to decide which of 2 events occurred first Synchronizer: Arbiter with clock as one of the inputs Problem: Circuit HAS to make a decision in limited time - which decision is not important Caveat: It is impossible to ensure correct operation But, we can decrease the error probability at the expense of delay
A Simple Synchronizer Data sampled on rising edge of the clock Latch will eventually resolve the signal value, but... this might take infinite time!
Synchronizer: Output Trajectories Single-pole model for a flip-flop