Download presentation

Presentation is loading. Please wait.

Published byMelina Youman Modified over 2 years ago

1
Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer information may also be appear in this area. Place flush left, aligned at bottom, 8-10pt Arial Regular, white Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Copyright: 10pt Arial Regular, white A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Professor Shiyan Hu, Ph.D. Department of Electrical and Computer Engineering Michigan Technological University

2
Moore’s law 2 Twice the number of transistors, approximately every two years

3
Interconnect Delay Dominates Gate Delay 3

4
Technology Scaling 4 130nm65nm Global interconnect lengths does not shrink Local interconnect lengths shrink Delay ∝ RC Resistance R = L/S, where S is reduced Capacitance C slightly changes

5
Interconnect Delay Scaling 5 Scaling factor s=0.7 per generation Emore Delay of a wire of length l int = (rl)(cl)/2= rcl 2 /2 (first order) Local interconnects int : (r/s 2 )(c)(ls) 2 /2 = rcl 2 /2 –Local interconnect delay is roughly unchanged Global interconnects int : (r/s 2 )(c)(l) 2 /2= rcl 2 –Global interconnect delay doubles which is unsustainable Interconnect delay increasingly more dominant

6
Timing Driven Buffer Insertion 6

7
Buffers Reduce RC Wire Delay 7 R x/2 cx/4 rx/2 ∆t = t_buf – t_unbuf = RC + t b – rcx 2 /4 x/2 cx/4 rx/2 C C R x ∆t∆t x/2 x

8
Intuitive Analysis 8 Interconnect Elmore delay = rcL 2 /2 l=2 lll L ( Of course, we need to consider buffer delay)

9
The delay of a wire of length L is T=rcL 2 /2 Detailed Analysis 9 L r,c – Resistance, cap. per unit length R d – On resistance of inverter C g – Gate input capacitance l Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize delay

10
Quadratic Delay -> Linear Delay 10 Substituting l opt back into the interconnect delay expression: Delay grows linearly with L instead of quadratically. This is why buffer insertion is highly effective and thus widely used for reducing circuit delay.

11
25% Gates are Buffers 11 Saxena, et al. [TCAD 2004]

12
ITRS Projections 12

13
Problem Formulation 13 Minimal cost (area/power) solution 1.Steiner Tree 2.n candidate buffer locations T

14
Solution Characterization 14 To model effect to downstream, a candidate solution is associated with To model effect to downstream, a candidate solution is associated with v: a node v: a node C: downstream capacitance C: downstream capacitance Q: required arrival time Q: required arrival time W: cumulative buffer cost W: cumulative buffer cost

15
Candidate Buffering Solutions 15

16
Dynamic Programming (DP) 16 Candidate solutions are propagated toward the source Start from sinks Candidate solutions are generated Three operations –Add Wire –Insert Buffer –Merge Solution Pruning

17
Solution Propagation: Add Wire 17 c 2 = c 1 + cx q 2 = q 1 - (rcx 2 /2 + rxc 1 ) r: wire resistance per unit length c: wire capacitance per unit length (v 1, c 1, w 1, q 1 ) (v 2, c 2, w 2, q 2 ) x

18
Solution Propagation: Insert Buffer 18 (v 1, c 1, w 1, q 1 ) (v 1, c 1b, w 1b, q 1b ) q 1b = q 1 - d(b) c 1b = C(b) w 1b = w 1 + w(b) d(b): buffer delay

19
Solution Propagation: Merge 19 c merge = c l + c r w merge = w l + w r q merge = min(q l, q r ) (v, c l, w l, q l )(v, c r, w r, q r )

20
Example of Solution Propagation 20 (v 1, 1, 20, 0) 22 v1v1 v1v1 (v 2, 3, 16, 0) r = 1, c = 1 R b = 1, C b = 1, t b = 1 R d = 1 (v 2, 1, 12, 1) v1v1 (v 3, 5, 8, 0) v1v1 (v 3, 3, 8, 1) slack = 5slack = 3 Add wire Insert buffer Add wire Add driver (v, C, Q, W)

21
Solution Propagation 21 (1) (2) (3)

22
Exponential Runtime 22 2 solutions 4 solutions 8 solutions 16 solutions n candidate buffer locations lead to 2 n solutions

23
Too Many Solutions 23 Needs solution pruning for acceleration Two candidate solutions –(v, c 1, q 1,w 1 ) –(v, c 2, q 2,w 2 ) Solution 1 is inferior to Solution 2 if –c 1 c 2 : larger load –and q 1 q 2 : tighter timing –and w 1 w 2 : larger cost

24
Car Race - Speed 24 END Car Speed RAT

25
Car Race - Load 25 Load Load Capacitance

26
Faster & Smaller Load 26 END Faster & smaller load (larger RAT, smaller capacitance): Good Slower & larger load (smaller RAT, larger capacitance): Inferior

27
Faster & Larger Load: Result 1 27 END

28
Faster & Larger Load: Result 2 28 END Who will be the winner? Cannot tell at this moment, so keep both of them.

29
Pruning 29 (Q 1,C 1,W 1 ) (Q 2,C 2,W 2 ) inferior/dominated if C 1 C 2, W 1 W 2 and Q 1 Q 2 Non-dominated solutions are maintained: for the same Q and W, pick min C Non-dominated solutions are maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values # of solutions depends on # of distinct W and Q, but not their values

30
Generating Candidates 30 (1) (2) (3)

31
Pruning Candidates 31 (3) (a) (b) Both (a) and (b) look the same to the source. Remove the one with the worse slack and cost (4)

32
Candidate Example Continued 32 (4) (5)

33
Candidate Example Continued 33 After pruning (5) At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal.

34
Branch Merge 34 Right Candidates Left Candidates

35
Pruning During Branch Merge 35 With pruning (n 1 n 2 ) solutions after each branch merge. Worst-case ((n/m) m ) solutions.

36
Selected Milestone Works on Timing Buffering 36 19901991…….1996…….20032004…….20082009 van Ginneken ’ s algorithm Lillis ’ algorithm Shi and Li’s algorithm NP-hardness proof Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution? This is a major open problem for a decade!

37
Bridging The Gap 37 We are bridging the gap! A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Computes a solution with cost at most (1+ ɛ ) of the optimal cost for any ɛ >0 Runs in time polynomial in n (nodes), b (buffer types) and 1/ ɛ Best solution for an NP-hard problem in theory Highly practical

38
The Rough Picture 38 W*: the cost of optimal solution Make guess on W* Good (close to W*) Not Good Key 2: Smart guess Key 1: Efficient checking Check it Return the solution

39
Key 1: Efficient Checking 39 Benefit of guess Only maintain the solutions with cost no greater than the guessed cost This is the first reason for acceleratation

40
The Oracle 40 Oracle (x): the checker, able to decide whether x>W* or not Oracle (x): the checker, able to decide whether x>W* or not – Without knowing W* – Answer efficiently

41
Construction of Oracle(x) 41 Only interested in whether there is a solution with cost up to x satisfying timing constraint Dynamic Programming Perform DP to scaled problem with cost upper bound n/ ɛ. Time polynomial in n/ ɛ

42
Scaling and Rounding 42 x ɛ /n2x ɛ /n 3x ɛ /n4x ɛ /n Buffer cost 0

43
Scaling and Rounding 43 Buffer cost 1 2 3 4 0 # distinct buffer costs is at most O(n/ε) since only solutions with W bounded by n/ ɛ are propagated. Rounding error at each buffer x ɛ /n, total rounding error x ɛ. Larger x ɛ /n: larger error, fewer distinct costs and faster Larger x ɛ /n: larger error, fewer distinct costs and faster Smaller x ɛ /n: smaller error, more distinct costs and slower Smaller x ɛ /n: smaller error, more distinct costs and slower Rounding is the second reason for acceleration Rounding is the second reason for acceleration

44
Oracle Construction 44 Yes, there is a solution satisfying timing constraint No, no such solution With cost rounded and scaled back, the solution has cost at most n/ ɛ x ɛ /n + x ɛ = (1+ ɛ )x > W* With cost rounded and scaled back, the solution has cost at least n/ ɛ x ɛ /n = x W* Run dynamic programming with cost n/ ɛ

45
Rounding on Q 45 # solutions bounded by # distinct W and Q # W = O(n/ ɛ 1 ), ɛ 1 is used for W –Rounding before DP # Q –Round up Q to nearest value in {0, ɛ 2 T/m, 2 ɛ 2 T/m, 3 ɛ 2 T/m,…,T }, in branch merge (m is # sinks) –Rounding during DP –# Q = O(m/ ɛ 2 ), ɛ 2 is used for Q –Rounding error bounded by ɛ 2 T/m per branch merge, by ɛ 2 T for the whole tree # non-dominated solutions is O(mn/ ɛ 1 ɛ 2 ) 3 ɛ 2 T/m2 ɛ 2 T/m ɛ 2 T/m4 ɛ 2 T/m 0

46
Q-W Rounding Before Branch Merge 46 W Q n/ ɛ 1 T ɛ 2 T/m 01234 2 ɛ 2 T/m 3 ɛ 2 T/m 4 ɛ 2 T/m

47
Buffer Insertion Runtime 47

48
Branch Merge Runtime - 1 48 Target Q=0 When merging W l =2 with W r =1, previously we need to try quadratic # of combinations, now only linear # of combinations.

49
Branch Merge Runtime - 2 49 Target Q= ɛ 2 T/m

50
Branch Merge Runtime - 3 50 Target Q= 2 ɛ 2 T/m

51
Branch Merge Runtime - 4 51

52
Timing-Cost Approximate DP 52 Lemma: a buffering solution with cost at most (1+ ɛ 1 )W* and with timing at most (1+ ɛ 2 )T can be computed in time

53
U (L): upper (lower) bound on W* Naive binary search style approach Runtime (# iterations) depends on the initial bounds U and L Key 2: Geometric Sequence Based Guess 53 Oracle (x) x=(U+L)/2 Set U and L on W* U= (1+ ɛ )x L= x W*<(1+ ɛ )x W* x

54
Adapt ɛ 1 54 Rounding factor x ɛ 1 /n for W Larger ɛ 1 : faster with rough estimation Smaller ɛ 1 : slower with accurate estimation Adapt ɛ 1 according to U and L

55
U/L Related Scale and Round 55 Buffer cost 0 U/L x ɛ /n

56
Conceptually 56 Begin with large ɛ 1 and progressively reduce it (towards ɛ ) according to U/L as x approaches W* Fix ɛ 2 = ɛ in rounding T for limiting timing violation Set ɛ 1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ Suppose that one run of DP takes O(n/ ɛ 1 ) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ ɛ ) = O(n/ ɛ ).

57
Oracle Query Till U/L<2 57

58
Mathematically 58

59
When U/L<2 59 At least one feasible solution, otherwise no solution with cost 2n/ ɛ L ɛ /n = 2L U L ɛ /n rounding error per buffer and L ɛ in a solution A single DP runtime Pick min cost solution satisfying timing at driver W=2n/ ɛ Scale and round each cost by L ɛ /n Run DP

60
U/L<2 The Algorithmic Flow 60 Oracle (x) Adapting ɛ 1 =[U/L-1] 1/2 Set U and L of W* Set x=[UL/(1+ ɛ 1 )] 1/2 Update U or L Compute final solution

61
Main Theorem 61 Theorem: a (1+ ɛ ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time for 0< ɛ <1 and in O(m 2 n 2 b/ ɛ +mn 2 b+n 3 b) time for ɛ 1

62
Experiments 62 Experimental Setup – 1000 industrial nets – 48 industrial buffer types including non-inverting buffers and inverting buffers Compared to Dynamic Programming which is the state of the art technique and is widely used in industry

63
Cost Ratio Compared to DP 63 Buffer Cost Ratio

64
Speedup Compared to DP 64 Speedup

65
Observations 65 FPTAS always achieves the theoretical guarantee Larger ɛ leads to more speedup On average about 5x faster than dynamic programming Can run 4.6x faster with 0.57% solution degradation <5% nets with timing violations which can be fixed by a simple timing recovery procedure

66
Our Bridge 66 NP-Hardness Complexity Exponential Time Algorithm

67
Conclusion 67 Propose a (1+ ɛ ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09) –Runs in O(m 2 n 2 b/ ɛ 3 + n 3 b 2 / ɛ ) time –Timing-cost approximate dynamic programming –Double- ɛ geometric sequence based oracle search –5x speedup in experiments –Few percent additional buffers as guaranteed theoretically The first provably good approximation algorithm on this problem which is a major open problem in the field

68
Thanks

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google