Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved Flop Tray-Based Design Implementation for Power Reduction

Similar presentations


Presentation on theme: "Improved Flop Tray-Based Design Implementation for Power Reduction"— Presentation transcript:

1 Improved Flop Tray-Based Design Implementation for Power Reduction
Andrew B. Kahng, Jiajia Li and Lutong Wang UC San Diego VLSI CAD Laboratory

2 Outline Background and Motivation Related Work Our Methodology
Experimental Setup and Results Conclusion

3 Flop Tray Benefits (1) Flop tray = multi-bit flip-flop (MBFF)
Application of flop trays significantly reduces #sinks Motivating “thought experiments” Replacing all single-bit flops in a clock tree (N sinks) with 64-bit flop trays can reduce #clock buffers by (N-N/64)/(N-1) ≈ 98.4% ! In a clock tree with N = 100K, F = 8, replacing all single-bit flops with 64-bit flop trays can reduce #levels from 6 to 4  Fewer clock buffers, smaller clock power N sinks root logFN levels Each buffer has F fanouts #Buffers ≈ (N-1)/(F-1) N/K sinks root logF(N/K) levels Each buffer has F fanouts #Buffers ≈ (N/K-1)/(F-1) Use K-bit flop trays

4 Flop Tray Benefits (2) Inverters for clock signals are shared within a flop tray  Power and area reductions A recent work (Lin et al. TCAD 2015) achieves 22% flop power reduction by using 2-bit and 4-bit flop trays Master latch Slave latch clk Single-bit flop Master latch Slave latch clk 2-bit flop tray

5 Challenges of Flop Tray Generation
Flops occupy large portion of block area In VGA, 30% of instances are flops  51% of block area Flop trays can have high aspect ratio and distinct size 4-bit flop tray = 1 row x 63 sites 64-bit flop tray = 4 rows x 244 sites Clustering of flops imposes additional placement constraints Small clusters do not fully exploit flop tray benefits Large clusters may sacrifice datapath wirelength / power Power overhead on datapaths (flop tray w/ logical clustering vs. single-bit flop)

6 Outline Background and Motivation Related Work Our Methodology
Experimental Setup and Results Conclusion

7 Related Works Early-stage flop tray generation
[Chen10] enables flop tray generation during synthesis [Hou09] splits flop trays to mitigate routing congestion But are not aware of physical layout Flop tray generation during/after placement [Lin11] clusters flops by finding K-cliques in a merging graph [Jiang12] generates flop trays using interval graphs [Tsai13] guides placement of flops with bonding force Hard to define feasible displacement region But ignore the shape (AR) of flop trays and timing paths Our work: flop tray generation considering flop displacement, timing paths and flop tray shapes

8 Outline Background and Motivation Related Work Our Methodology
Experimental Setup and Results Conclusion

9 Overall Optimization Flow
In blue are our optimizations Initial placement w/ single-bit flops == “optimal” placement Objectives Minimize displacement of flops Minimize timing impact Minimize #flop trays Two-step optimization Capacitated K-means clustering (in dotted red boxes) ILP-based selection of flop trays

10 Example of Overall Flow
4-bit only solution 16-bit only solution 64-bit only solution ILP solution Design: AES Technology: 28FDSOI

11 Capacitated K-Means Clustering
Given N points (flops), a capacity of K (flop tray size), obtain (N/K) clusters. Selection of starting points Randomly select one flop among single-bit flops For each flop (h), calculate the total Manhattan distance (d) from h to all selected flops Randomly select one new flop with probability d Repeat Steps II and III until M flops are selected Min-cost flow-based clustering Update of cluster centers Minimize ∑dk Such that |xi + x’ij – xk| + |yi + y’ij – yk| = dk flop location: (xk, yk); flop tray location: (xi, yi); relative slot location (x’ij, y’ij) Initial center Clustering Cluster center update Solution By considering distances between flops and slots, we are aware of flop tray ARs hk : kth flop (point) ti : ith flop tray (cluster) fij : jth slot on ith flop tray dk,ij: Mahattan distance between hk and fij

12 Example on AES Circles: initial flop locations
Red dots: flop tray locations

13 Awareness of Flop Tray Shapes
Our clustering solution more closely matches the AR of flop trays  Smaller displacements Without awareness of flop tray AR, layout Avg. displacement = 15μm With awareness of flop tray AR, layout Avg. displacement = 5μm Design: AES Technology: 28FDSOI

14 ILP-Based Selection of Flop Tray Solutions
Formulate an ILP to select flop tray solutions with various flop tray sizes to minimize displacement, timing impact and flop tray cost Minimize α ∙ W + D + β ∙ Z Such that // flop displacements |∑ij (xi + x’ij - xk) ∙ bk,ij| + |∑ij (yi + y’ij - yk) ∙ bk,ij| = dk ∑k dk = D // relative displacements between timing-critical flop pairs |∑ij (xi +x’ij - xk) · bk,ij - ∑i’j’ (xi’ +x’i’j’ - xk’) · bk’,i’j’| + |∑ij (yi +y’ij - yk) · bk,ij - ∑i’j’ (yi’ +y’i’j’ - yk’) · bk’,i’j’| = zkk’ ∑kk’ zkk’ = Z // cost of flop trays bk,ij ≤ ei ; ei ≤ ∑kj bk,ij ∑i (wi · ei) = W // each flop has exactly one slot to match & each slot can have at most one flop to match ∑ij bk,ij = 1; ∑k bk,ij ≤ 1 Notations D total displacement Z total relative displacement of timing-critical flop pairs W total cost of flop trays α, β weighting parameters (xi, yi) location of ith flop tray (x’ij, y’ij) relative location of jth slot on ith flop tray (xk, yk) location of kth flop bk,ij binary indicator whether kth flop is assigned to jth slot on ith flop tray ei binary indicator whether ith flop tray is selected wi cost of ith flop tray

15 Impact of α Value Choice of α determines a tradeoff between clock power reduction versus datapath power penalty Small value of α  Small-size flop trays, small displacement Large value of α  Large-size flop trays, large displacement

16 Minimization of Relative Placement
Relative displacement between timing-critical start-end flop pairs degrades timing Move apart  wire↑  delay↑ Move closer  routing/placement congestion Minimization of relative displacement reduces power penalty by 5% logic cone Move closer  placement/routing congestion Move apart  longer wire 5%

17 Outline Background and Motivation Related Work Our Methodology
Experimental Setup and Results Conclusion

18 Norm. area/power per bit
Experimental Setup Designs: AES, JPEG, MPEG, VGA (from OpenCores website) Technology: 28nm FDSOI, dual-VT Tools Synthesis: Synopsys Design Compiler vH SP3 P&R: Cadence Innovus Implementation System v15.2 Power/timing analysis: Cadence Innovus Implementation System v15.2 Candidate flop trays Tray size 4-bit 8-bit 16-bit 32-bit 64-bit Norm. area/power per bit 0.875 0.854 0.844 AR (#rows x #columns) 1 x 4 2 x 4 4 x 4 4 x 8 4 x 16 AR (#rows x #sites) 1 x 63 2 x 62 4 x 62 4 x 122 4 x 244

19 Power Benefits Reference flows
ref_1b: conventional implementation flow with single-bit flops ref_mb: flop tray-based implementation with logical clustering (flop tray generation during synthesis with commercial tools) Up to 98% sink number reduction and 90% clock power reduction compared to ref_1b Up to 16% more total power reduction and 40% more clock power reduction compared to ref_mb Design Flow Clock power (mW) Total power (mW) #Sinks AES ref_1b 1.53 14.02 530 ref_mb 0.72 13.35 227 opt_mb 0.46 12.56 128 JPEG 13.37 84.54 4512 6.1 76.2 1665 2.28 69.24 515 MPEG 10.72 45.53 3181 5.19 38.7 1316 0.98 31.76 181 VGA 42.19 164.84 17053 20.73 138.99 7665 2.04 111.32 308 17053 42.19 111.32 20.73 138.99 2.04 308

20 Layout Examples In red are flop trays and flops, in blue are combinational cells

21 Optimization with Various Flop Tray Sizes
Flop tray-based optimization with various combinations of flop tray size candidates Optimization with large-size (i.e., > 16-bit) flop trays achieves 11% more clock power reduction on average, especially on large designs I II III IV V 1 bit {1, 4} bit {1, 4, 8} bit {1, 4, 8, 16} bit {1, 4, 8, 16, 32, 64} bit AES JPEG MPEG VGA

22 Study of Useful Skew Optimization
Comparison of useful skew benefits (= datapath leakage power reductions) across various flows ref_1b: design with only single-bit flops opt_mb: flop tray-based design (w/o skew-aware clustering) opt_mb (skew aware): flop tray-based design (w/ skew-aware clustering) Skew-aware clustering achieves similar useful skew benefits as ref_1b, but with 21% less sink number reduction #sinks ref_1b 530 4512 3181 17053 opt_mb 128 515 181 308 opt_mb (skew aware) 392 1830 205 1245

23 Outline Background and Motivation Related Work Our Methodology
Experimental Setup and Results Conclusion

24 Conclusion A novel flop tray-based optimization with capacitated K-means algorithm Up to 16% total block power reduction compared to logical clustering Useful skew optimization in the context of flop tray-based design Ongoing / Future works Scalable optimization considering all flop tray sizes Floorplan blockage awareness

25 Thank you! UCSD ABKGroup is grateful to Qualcomm, Samsung, NXP, the IMPACT+/C-DEN centers, Mentor Graphics and the NSF for research support. We thank IMEC and Cadence for additional research enablements and collaborations.


Download ppt "Improved Flop Tray-Based Design Implementation for Power Reduction"

Similar presentations


Ads by Google