# Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer.

## Presentation on theme: "Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer."— Presentation transcript:

Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer information may also be appear in this area. Place flush left, aligned at bottom, 8-10pt Arial Regular, white Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Copyright: 10pt Arial Regular, white A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Professor Shiyan Hu, Ph.D. Department of Electrical and Computer Engineering Michigan Technological University

Moore’s law 2 Twice the number of transistors, approximately every two years

Interconnect Delay Dominates Gate Delay 3

Technology Scaling 4 130nm65nm  Global interconnect lengths does not shrink  Local interconnect lengths shrink  Delay ∝ RC  Resistance R =  L/S, where S is reduced  Capacitance C slightly changes

Interconnect Delay Scaling 5  Scaling factor s=0.7 per generation  Emore Delay of a wire of length l  int = (rl)(cl)/2= rcl 2 /2 (first order)  Local interconnects  int : (r/s 2 )(c)(ls) 2 /2 = rcl 2 /2 –Local interconnect delay is roughly unchanged  Global interconnects  int : (r/s 2 )(c)(l) 2 /2= rcl 2 –Global interconnect delay doubles which is unsustainable  Interconnect delay increasingly more dominant

Timing Driven Buffer Insertion 6

Buffers Reduce RC Wire Delay 7 R x/2 cx/4 rx/2 ∆t = t_buf – t_unbuf = RC + t b – rcx 2 /4 x/2 cx/4 rx/2 C C R x ∆t∆t x/2 x

Intuitive Analysis 8 Interconnect Elmore delay = rcL 2 /2 l=2 lll L ( Of course, we need to consider buffer delay)

 The delay of a wire of length L is T=rcL 2 /2 Detailed Analysis 9 L r,c – Resistance, cap. per unit length R d – On resistance of inverter C g – Gate input capacitance l  Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize delay

Quadratic Delay -> Linear Delay 10  Substituting l opt back into the interconnect delay expression: Delay grows linearly with L instead of quadratically. This is why buffer insertion is highly effective and thus widely used for reducing circuit delay.

25% Gates are Buffers 11 Saxena, et al. [TCAD 2004]

ITRS Projections 12

Problem Formulation 13 Minimal cost (area/power) solution 1.Steiner Tree 2.n candidate buffer locations T

Solution Characterization 14 To model effect to downstream, a candidate solution is associated with To model effect to downstream, a candidate solution is associated with v: a node v: a node C: downstream capacitance C: downstream capacitance Q: required arrival time Q: required arrival time W: cumulative buffer cost W: cumulative buffer cost

Candidate Buffering Solutions 15

Dynamic Programming (DP) 16 Candidate solutions are propagated toward the source  Start from sinks  Candidate solutions are generated  Three operations –Add Wire –Insert Buffer –Merge  Solution Pruning

Solution Propagation: Add Wire 17  c 2 = c 1 + cx  q 2 = q 1 - (rcx 2 /2 + rxc 1 )  r: wire resistance per unit length  c: wire capacitance per unit length (v 1, c 1, w 1, q 1 ) (v 2, c 2, w 2, q 2 ) x

Solution Propagation: Insert Buffer 18 (v 1, c 1, w 1, q 1 ) (v 1, c 1b, w 1b, q 1b )  q 1b = q 1 - d(b)  c 1b = C(b)  w 1b = w 1 + w(b)  d(b): buffer delay

Solution Propagation: Merge 19  c merge = c l + c r  w merge = w l + w r  q merge = min(q l, q r ) (v, c l, w l, q l )(v, c r, w r, q r )

Example of Solution Propagation 20 (v 1, 1, 20, 0) 22 v1v1 v1v1 (v 2, 3, 16, 0) r = 1, c = 1 R b = 1, C b = 1, t b = 1 R d = 1 (v 2, 1, 12, 1) v1v1 (v 3, 5, 8, 0) v1v1 (v 3, 3, 8, 1) slack = 5slack = 3 Add wire Insert buffer Add wire Add driver (v, C, Q, W)

Solution Propagation 21 (1) (2) (3)

Exponential Runtime 22 2 solutions 4 solutions 8 solutions 16 solutions n candidate buffer locations lead to 2 n solutions

Too Many Solutions 23  Needs solution pruning for acceleration  Two candidate solutions –(v, c 1, q 1,w 1 ) –(v, c 2, q 2,w 2 )  Solution 1 is inferior to Solution 2 if –c 1  c 2 : larger load –and q 1  q 2 : tighter timing –and w 1  w 2 : larger cost

Car Race - Speed 24 END Car Speed RAT

Faster & Smaller Load 26 END Faster & smaller load (larger RAT, smaller capacitance): Good Slower & larger load (smaller RAT, larger capacitance): Inferior

Faster & Larger Load: Result 1 27 END

Faster & Larger Load: Result 2 28 END Who will be the winner? Cannot tell at this moment, so keep both of them.

Pruning 29 (Q 1,C 1,W 1 ) (Q 2,C 2,W 2 ) inferior/dominated if C 1  C 2, W 1  W 2 and Q 1  Q 2 Non-dominated solutions are maintained: for the same Q and W, pick min C Non-dominated solutions are maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values # of solutions depends on # of distinct W and Q, but not their values

Generating Candidates 30 (1) (2) (3)

Pruning Candidates 31 (3) (a) (b) Both (a) and (b) look the same to the source. Remove the one with the worse slack and cost (4)

Candidate Example Continued 32 (4) (5)

Candidate Example Continued 33 After pruning (5) At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal.

Branch Merge 34 Right Candidates Left Candidates

Pruning During Branch Merge 35 With pruning  (n 1 n 2 ) solutions after each branch merge. Worst-case  ((n/m) m ) solutions.

Selected Milestone Works on Timing Buffering 36 19901991…….1996…….20032004…….20082009 van Ginneken ’ s algorithm Lillis ’ algorithm Shi and Li’s algorithm NP-hardness proof Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution? This is a major open problem for a decade!

Bridging The Gap 37 We are bridging the gap! A Fully Polynomial Time Approximation Scheme (FPTAS)  Provably good  Computes a solution with cost at most (1+ ɛ ) of the optimal cost for any ɛ >0  Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ  Best solution for an NP-hard problem in theory  Highly practical

The Rough Picture 38 W*: the cost of optimal solution Make guess on W* Good (close to W*) Not Good Key 2: Smart guess Key 1: Efficient checking Check it Return the solution

Key 1: Efficient Checking 39 Benefit of guess  Only maintain the solutions with cost no greater than the guessed cost  This is the first reason for acceleratation

The Oracle 40 Oracle (x): the checker, able to decide whether x>W* or not Oracle (x): the checker, able to decide whether x>W* or not – Without knowing W* – Answer efficiently

Construction of Oracle(x) 41 Only interested in whether there is a solution with cost up to x satisfying timing constraint Dynamic Programming Perform DP to scaled problem with cost upper bound n/ ɛ. Time polynomial in n/ ɛ

Scaling and Rounding 42 x ɛ /n2x ɛ /n 3x ɛ /n4x ɛ /n Buffer cost 0

Scaling and Rounding 43 Buffer cost 1 2 3 4 0 # distinct buffer costs is at most O(n/ε) since only solutions with W bounded by n/ ɛ are propagated. Rounding error at each buffer  x ɛ /n, total rounding error  x ɛ. Larger x ɛ /n: larger error, fewer distinct costs and faster Larger x ɛ /n: larger error, fewer distinct costs and faster Smaller x ɛ /n: smaller error, more distinct costs and slower Smaller x ɛ /n: smaller error, more distinct costs and slower Rounding is the second reason for acceleration Rounding is the second reason for acceleration

Oracle Construction 44 Yes, there is a solution satisfying timing constraint No, no such solution With cost rounded and scaled back, the solution has cost at most n/ ɛ x ɛ /n + x ɛ = (1+ ɛ )x > W* With cost rounded and scaled back, the solution has cost at least n/ ɛ x ɛ /n = x  W* Run dynamic programming with cost  n/ ɛ

Rounding on Q 45  # solutions bounded by # distinct W and Q  # W = O(n/ ɛ 1 ), ɛ 1 is used for W –Rounding before DP  # Q –Round up Q to nearest value in {0, ɛ 2 T/m, 2 ɛ 2 T/m, 3 ɛ 2 T/m,…,T }, in branch merge (m is # sinks) –Rounding during DP –# Q = O(m/ ɛ 2 ), ɛ 2 is used for Q –Rounding error bounded by ɛ 2 T/m per branch merge, by ɛ 2 T for the whole tree  # non-dominated solutions is O(mn/ ɛ 1 ɛ 2 ) 3 ɛ 2 T/m2 ɛ 2 T/m ɛ 2 T/m4 ɛ 2 T/m 0

Q-W Rounding Before Branch Merge 46 W Q n/ ɛ 1 T ɛ 2 T/m 01234 2 ɛ 2 T/m 3 ɛ 2 T/m 4 ɛ 2 T/m

Buffer Insertion Runtime 47

Branch Merge Runtime - 1 48 Target Q=0 When merging W l =2 with W r =1, previously we need to try quadratic # of combinations, now only linear # of combinations.

Branch Merge Runtime - 2 49 Target Q= ɛ 2 T/m

Branch Merge Runtime - 3 50 Target Q= 2 ɛ 2 T/m

Branch Merge Runtime - 4 51

Timing-Cost Approximate DP 52  Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time

 U (L): upper (lower) bound on W*  Naive binary search style approach  Runtime (# iterations) depends on the initial bounds U and L Key 2: Geometric Sequence Based Guess 53 Oracle (x) x=(U+L)/2 Set U and L on W* U= (1+ ɛ )x L= x W*<(1+ ɛ )x W*  x

Adapt ɛ 1 54  Rounding factor x ɛ 1 /n for W  Larger ɛ 1 : faster with rough estimation  Smaller ɛ 1 : slower with accurate estimation  Adapt ɛ 1 according to U and L

U/L Related Scale and Round 55 Buffer cost 0 U/L x ɛ /n

Conceptually 56  Begin with large ɛ 1 and progressively reduce it (towards ɛ ) according to U/L as x approaches W*  Fix ɛ 2 = ɛ in rounding T for limiting timing violation Set ɛ 1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ Suppose that one run of DP takes O(n/ ɛ 1 ) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ ɛ ) = O(n/ ɛ ).

Oracle Query Till U/L<2 57

Mathematically 58

When U/L<2 59 At least one feasible solution, otherwise no solution with cost 2n/ ɛ L ɛ /n = 2L  U L ɛ /n rounding error per buffer and L ɛ in a solution A single DP runtime Pick min cost solution satisfying timing at driver W=2n/ ɛ Scale and round each cost by L ɛ /n Run DP

U/L<2 The Algorithmic Flow 60 Oracle (x) Adapting ɛ 1 =[U/L-1] 1/2 Set U and L of W* Set x=[UL/(1+ ɛ 1 )] 1/2 Update U or L Compute final solution

Main Theorem 61  Theorem: a (1+ ɛ ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time for 0< ɛ <1 and in O(m 2 n 2 b/ ɛ +mn 2 b+n 3 b) time for ɛ  1

Experiments 62  Experimental Setup – 1000 industrial nets – 48 industrial buffer types including non-inverting buffers and inverting buffers  Compared to Dynamic Programming which is the state of the art technique and is widely used in industry

Cost Ratio Compared to DP 63 Buffer Cost Ratio

Speedup Compared to DP 64 Speedup

Observations 65  FPTAS always achieves the theoretical guarantee  Larger ɛ leads to more speedup  On average about 5x faster than dynamic programming  Can run 4.6x faster with 0.57% solution degradation  <5% nets with timing violations which can be fixed by a simple timing recovery procedure

Our Bridge 66 NP-Hardness Complexity Exponential Time Algorithm

Conclusion 67  Propose a (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09) –Runs in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time –Timing-cost approximate dynamic programming –Double- ɛ geometric sequence based oracle search –5x speedup in experiments –Few percent additional buffers as guaranteed theoretically  The first provably good approximation algorithm on this problem which is a major open problem in the field

Thanks

Download ppt "Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer."

Similar presentations