Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries.

Interconnect Optimizations

A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries shrink by  R/  :  /(w .h  ) = r/  2 Cc/  : (h  ).  /(S  ) = Cc C/  : similar R/  doubles, C/  and Cc/  unchanged SGD h w l S ll hh SS ww

Interconnect role Short interconnect –Used to connect nearby cells –Minimize wire C, i.e., use short minwidth wires Medium to long-distance (“global”) interconnect –Size wires to tradeoff area vs. delay –Increasing width  Capacitance increases, Resistance decreases Need to find acceptable tradeoff - wire sizing problem “Fat” wires –Thicker cross-sections in higher metal layers –Useful for reducing delays for global wires –Inductance issues, sharing of limited resource

Cross-Section of A Chip

Block scaling Block area often stays same –# cells, # nets doubles –Wiring histogram shape invariant Global interconnect lengths don’t shrink Local interconnect lengths shrink by 

Interconnect delay scaling Delay of a wire of length l :  int = (rl)(cl) = rcl 2 (first order) Local interconnects :  int : (r/  2 )(c)(l  ) 2 = rcl 2 –Local interconnect delay unchanged (compare to faster devices) Global interconnects :  int : (r/  2 )(c)(l) 2 = (rcl 2 )/  2 –Global interconnect delay doubles – unsustainable! Interconnect delay increasingly more dominant

Buffer Insertion For Delay Reduction

Analysis of Simple RC Circuit state variable Input waveform ± v(t) C R v T (t) i(t)

Analysis of Simple RC Circuit Step-input response: match initial state: output response for step-input: v0v0 v 0 u(t) v 0 (1-e -t/RC )u(t)

Delays of Simple RC Circuit v(t) = v 0 (1 - e -t/RC ) -- waveform under step input v 0 u(t) v(t)=0.5v 0  t = 0.7RC –i.e., delay = 0.7RC (50% delay) v(t)=0.1v 0  t = 0.1RC v(t)=0.9v 0  t = 2.3RC –i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd) Commonly used metric T D = RC (= Elmore delay)

Elmore Delay Delay

Elmore Delay Driver is modeled as R Driver intrinsic gate delay t(B) Delay =  all Ri  all Cj downstream from Ri Ri*Cj Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2 Elmore delay at n1 R(B)*(C1+C2) R(B) C1 R(w) C2 n1 B n2

Elmore Delay For uniform wire No matter how to lump, the Elmore delay is the same x C unit wire capacitance c unit wire resistance r

Delay for Buffer v C u C(b) u Intrinsic buffer delay Driver resistance Input capacitance

R Buffers Reduce Wire Delay x/2 cx/4 rx/2 t_unbuf = R( cx + C ) + rx( cx/2 + C ) t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + t b t_buf – t_unbuf = RC + t b – rcx 2 /4 x/2 cx/4 rx/2 C C R x ∆t

Combinational Logic Delay Combinational logic delay <= clock period Combinational Logic Register Primary Input Register Primary Output clock

Example of Static Timing Analysis Arrival time: input -> output, take max Required arrival time: output -> input, take min Slack = required arrival time – arrival time 2 3 4 3 7 11 2 3 7/4/-3 5/3/-2 4/7/34/7/3 8/8/08/8/0 9/6/-3 20/17/-3 11/11/0 18/18/0 23/20/-3

Buffers Improve Slack RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack = 50 RAT = 700 Delay = 400 Slack = 300 slack min = -50 slack min = 50 Decouple capacitive load from critical path RAT = Required Arrival Time Slack = RAT - Delay

ITRS projections

Buffered global interconnects: Intuition Interconnect delay = r.c.l 2 Now, interconnect delay =  r.c.l i 2 < r.c.l 2 (where l =  l j ) since  (l j 2 ) < (  l j ) 2 (Of course, account for buffer delay also) l1l1 lnln l3l3 l2l2 l

Optimal inter-buffer length First order (lumped parasitic, Elmore delay) analysis Assume N identical buffers with equal inter-buffer length l (L = Nl) For minimum delay, L R d – On resistance of inverter C g – Gate input capacitance r,c – Resistance, cap. per micron … … l

Optimal interconnect delay Substituting l opt back into the interconnect delay expression: Delay grows linearly with L (instead of quadratically)

Optimized interconnect delay scaling Rewriting the optimal interconnect delay expression, With optimally sized buffers (using dT/dh = 0),

Optimized interconnect delay scaling After scaling, (instead of ) Even with optimal (re-)buffering, interconnects scale worse than devices For global interconnects, L doesn’t shrink. So

Buffered nets 0 5 10 15 20 25 30 35 90nm65nm45nm32nm % buffered nets M3M6

Total buffer count Ever-increasing fractions of total cell count will be buffers –70% in 32nm 0 10 20 30 40 50 60 70 80 90nm65nm45nm32nm % cells used to buffer nets clk-buf buf tot-buf

Buffer Insertion Timing optimization Slew optimization

Timing Driven Buffering Problem Formulation Given –A Steiner tree –RAT at each sink –A buffer type –RC parameters –Candidate buffer locations Find buffer insertion solution such that the slack at the driver is maximized

Candidate Buffering Solutions

Candidate Solution Characteristics Each candidate solution is associated with –v i : a node –c i : downstream capacitance –q i : RAT v i is a sink c i is sink capacitance v is an internal node

Van Ginneken’s Algorithm Candidate solutions are propagated toward the source Dynamic Programming

Solution Propagation: Add Wire c 2 = c 1 + cx q 2 = q 1 – rcx 2 /2 – rxc 1 r: wire resistance per unit length c: wire capacitance per unit length (v 1, c 1, q 1 ) (v 2, c 2, q 2 ) x

33 Solution Propagation: Insert Buffer c 1b = C b q 1b = q 1 – R b c 1 – t b C b : buffer input capacitance R b : buffer output resistance t b : buffer intrinsic delay (v 1, c 1, q 1 ) (v 1, c 1b, q 1b )

Solution Propagation: Merge c merge = c l + c r q merge = min(q l, q r ) (v, c l, q l )(v, c r, q r )

Solution Propagation: Add Driver q 0d = q 0 – R d c 0 = slack min R d : driver resistance Pick solution with max slack min (v 0, c 0, q 0 ) (v 0, c 0d, q 0d )

Example of Solution Propagation (v 1, 1, 20) 22 v1v1 v1v1 (v 2, 3, 16) r = 1, c = 1 R b = 1, C b = 1, t b = 1 R d = 1 (v 2, 1, 12) v1v1 (v 3, 5, 8) v1v1 (v 3, 3, 8) slack = 5slack = 3 Add wire Insert buffer Add wire Add driver

37 Example of Merging Left candidates Right candidates Merged candidates

Solution Pruning Two candidate solutions –(v, c 1, q 1 ) –(v, c 2, q 2 ) Solution 1 is inferior if –c 1 > c 2 : larger load –and q 1 < q 2 : tighter timing

Pruning When Insert Buffer They have the same load cap C b, only the one with max q is kept

40 Generating Candidates (1) (2) (3) From Dr. Charles Alpert

41 Pruning Candidates (3) (a) (b) Both (a) and (b) “look” the same to the source. Throw out the one with the worst slack (4)

42 Candidate Example Continued (4) (5)

43 Candidate Example Continued After pruning (5) At driver, compute which candidate maximizes slack. Result is optimal.

44 Merging Branches Right Candidates Left Candidates

45 Pruning Merged Branches Critical With pruning

46 Van Ginneken Example (20,400) (30,250) (5, 220) Wire C=10,d=150 Buffer C=5, d=30 (20,400) Buffer C=5, d=50 C=5, d=30 Wire C=15,d=200 C=15,d=120 (30,250) (5, 220) (45, 50) (5, 0) (20,100) (5, 70)

47 Van Ginneken Example Cont’d (20,400) (30,250) (5, 220) (45, 50) (5, 0) (20,100) (5, 70) (5,0) is inferior to (5,70). (45,50) is inferior to (20,100) (20,400) (30,250) (5, 220) (20,100) (5, 70) (30,10) (15, -10) Pick solution with largest slack, follow arrows to get solution Wire C=10

Basic Data Structure (c 1, q 1 )(c 2, q 2 )(c 3, q 3 ) Sorted list such that c 1 < c 2 < c 3 If there is no inferior candidates q 1 < q 2 < q 3 Worse load cap Better timing

49 Prune Solution List (c 1, q 1 )(c 2, q 2 )(c 3, q 3 ) Increasing c q 1 < q 2 ? (c 4, q 4 ) q 3 < q 4 ? Y N Prune 2 q 1 < q 3 ? q 2 < q 3 ? Y q 3 < q 4 ? Y Prune 3 q 1 < q 4 ? N Prune 3 N N Prune 4 N q 2 < q 4 ?

50 Pruning In Merging (c l1, q l1 ) (c l2, q l2 ) (c l3, q l3 ) (c r1, q r1 ) (c r2, q r2 ) q l1 < q l2 < q r1 < q l3 < q r2 Merged candidates (c l1 +c r1, q l1 ) (c l2 +c r1, q l2 ) (c l3 +c r1, q r1 ) (c l3 +c r2, q l3 ) (c l1, q l1 ) (c l2, q l2 ) (c l3, q l3 ) (c r1, q r1 ) (c r2, q r2 ) (c l1, q l1 ) (c l2, q l2 ) (c l3, q l3 ) (c r1, q r1 ) (c r2, q r2 ) (c l1, q l1 ) (c l2, q l2 ) (c l3, q l3 ) (c r1, q r1 ) (c r2, q r2 ) Left candidates Right candidates

Van Ginneken Complexity Generate candidates from sinks to source Quadratic runtime –Adding a wire does not change #candidates –Adding a buffer adds only one new candidate –Merging branches additive, not multiplicative –Linear time solution list pruning Optimal for Elmore delay model

Multiple Buffer Types (v 1, 1, 20) 22 v1v1 v1v1 (v 2, 3, 16) r = 1, c = 1 R b = 1, C b = 1, t b = 1 R b2 = 0.5, C b2 = 2, t b2 = 0.5 R d = 1 (v 2, 1, 12) v1v1 (v 2, 2, 14) http://vlsitechnology.org/html/cells/vsclib013

53 Handle Polarity - - - - Negative - - - Positive

Consider Cost/Power A solution is also characterized by cost w A solution is inferior if it is poor on all of c, q and w At source, a set of solutions with tradeoff of q and w w can be –total capacitance –or the number of buffers

55 Cost-Slack Trade-off

56 Data Organization 0 1 2 3 4 (c 1, q 1 )(c 2, q 2 )(c 3, q 3 ) (c 4, q 4 )(c 5, q 5 )(c 6, q 6 ) (c 7, q 7 )(c 8, q 8 ) (c 9, q 9 )(c 10, q 10 ) (c 11, q 11 ) #buffers inserted Sorted in ascending order of (c, q)

Pruning Considering Cost 0 1 2 (c 1, q 1 )(c 2, q 2 )(c 3, q 3 ) (c 4, q 4 )(c 5, q 5 )(c 6, q 6 ) (c 7, q 7 )(c 8, q 8 ) (c 9, q 9 ) (c i, q i, w i ) is inferior to (c k, q k, w k ) if c i > c k, q i w k w Prune order Pruning within a list is same as before How to prune a solution with w k from a set of solutions with w  w k ?

58 Blockage Recognition Delete insertion points that run over blockages

References L.P.P.P. van Ginneken, Buffer placement in distributed RC- tree networks for minimal Elmore delay, ISCAS 1990, 865 - 868. J. Lillis, C.-K. Cheng, and T. T. Lin, “Optimal wire sizing and buffer insertion for low power and generalized delay model”, IEEE J. Solid- State Circuits, 31(3), pp. 437-447, 1996. W. Shi and Z. Li, “An O(nlogn) time algorithm for optimal buffer insertion”, Proc. DAC 2003, pp. 580-585.

Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries.

Similar presentations

Presentation on theme: "Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries.

Similar presentations

Presentation on theme: "Interconnect Optimizations. A scaling primer Ideal process scaling: –Device geometries shrink by  = 0.7x) Device delay shrinks by  –Wire geometries."— Presentation transcript:

Similar presentations

About project

Feedback