Page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning.

Slides:



Advertisements
Similar presentations
Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.
Advertisements

Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.
OCV-Aware Top-Level Clock Tree Optimization
4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.
ELEN 468 Lecture 261 ELEN 468 Advanced Logic Design Lecture 26 Interconnect Timing Optimization.
Xing Wei, Wai-Chung Tang, Yu-Liang Wu Department of Computer Science and Engineering The Chinese University of HongKong
X-Architecture Placement Based on Effective Wire Models Tung-Chieh Chen, Yi-Lin Chuang, and Yao-Wen Chang Graduate Institute of Electronics Engineering.
Meng-Kai Hsu, Sheng Chou, Tzu-Hen Lin, and Yao-Wen Chang Electronics Engineering, National Taiwan University Routability Driven Analytical Placement for.
A Size Scaling Approach for Mixed-size Placement Kalliopi Tsota, Cheng-Kok Koh, Venkataramanan Balakrishnan School of Electrical and Computer Engineering.
Ch.7 Layout Design Standard Cell Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.
1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.
Ripple: An Effective Routability-Driven Placer by Iterative Cell Movement Xu He, Tao Huang, Linfu Xiao, Haitong Tian, Guxin Cui and Evangeline F.Y. Young.
Coupling-Aware Length-Ratio- Matching Routing for Capacitor Arrays in Analog Integrated Circuits Kuan-Hsien Ho, Hung-Chih Ou, Yao-Wen Chang and Hui-Fang.
An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim, Deokjin Joo, Taehan Kim DAC’13.
Power-Aware Placement
Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
ELEN 468 Lecture 271 ELEN 468 Advanced Logic Design Lecture 27 Interconnect Timing Optimization II.
Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,
Pei-Ci Wu Martin D. F. Wong On Timing Closure: Buffer Insertion for Hold-Violation Removal DAC’14.
Metal Layer Planning for Silicon Interposers with Consideration of Routability and Manufacturing Cost W. Liu, T. Chien and T. Wang Department of CS, NTHU,
POLAR 2.0: An Effective Routability-Driven Placer Chris Chu Tao Lin.
1 ENTITY test is port a: in bit; end ENTITY test; DRC LVS ERC Circuit Design Functional Design and Logic Design Physical Design Physical Verification and.
Chih-Hung Lin, Kai-Cheng Wei VLSI CAD 2008
MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD
A Topology-based ECO Routing Methodology for Mask Cost Minimization Po-Hsun Wu, Shang-Ya Bai, and Tsung-Yi Ho Department of Computer Science and Information.
Area-I/O Flip-Chip Routing for Chip-Package Co-Design Progress Report 方家偉、張耀文、何冠賢 The Electronic Design Automation Laboratory Graduate Institute of Electronics.
Xin-Wei Shih and Yao-Wen Chang.  Introduction  Problem formulation  Algorithms  Experimental results  Conclusions.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
CAFE router: A Fast Connectivity Aware Multiple Nets Routing Algorithm for Routing Grid with Obstacles Y. Kohira and A. Takahashi School of Computer Science.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Power Reduction for FPGA using Multiple Vdd/Vth
1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.
TSV-Aware Analytical Placement for 3D IC Designs Meng-Kai Hsu, Yao-Wen Chang, and Valerity Balabanov GIEE and EE department of NTU DAC 2011.
Wen-Hao Liu 1, Yih-Lang Li 1, and Kai-Yuan Chao 2 1 Department of Computer Science, National Chiao-Tung University, Hsin-Chu, Taiwan 2 Intel Architecture.
UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
A NEW ECO TECHNOLOGY FOR FUNCTIONAL CHANGES AND REMOVING TIMING VIOLATIONS Jui-Hung Hung, Yao-Kai Yeh,Yung-Sheng Tseng and Tsai-Ming Hsieh Dept. of Information.
Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.
Bus-Pin-Aware Bus-Driven Floorplanning B. Wu and T. Ho Department of Computer Science and Information Engineering NCKU GLSVLSI 2010.
Ho-Lin Chang, Hsiang-Cheng Lai, Tsu-Yun Hsueh, Wei-Kai Cheng, Mely Chen Chi Department of Information and Computer Engineering, CYCU A 3D IC Designs Partitioning.
ECO Timing Optimization Using Spare Cells Yen-Pin Chen, Jia-Wei Fang, and Yao-Wen Chang ICCAD2007, Pages ICCAD2007, Pages
Jason Cong‡†, Guojie Luo*†, Kalliopi Tsota‡, and Bingjun Xiao‡ ‡Computer Science Department, University of California, Los Angeles, USA *School of Electrical.
IO CONNECTION ASSIGNMENT AND RDL ROUTING FOR FLIP-CHIP DESIGNS Jin-Tai Yan, Zhi-Wei Chen 1 ASPDAC.2009.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.
1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.
1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.
Clock-Tree Aware Placement Based on Dynamic Clock-Tree Building Yanfeng Wang, Qiang Zhou, Xianlong Hong, and Yici Cai Department of Computer Science and.
Escape Routing of Mixed-Pattern Signals Based on Staggered-Pin- Array PCBs K. Wang, H. Wang and S. Dong Department of Computer Science & Technology, Tsinghua.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Simultaneous Analog Placement and Routing with Current Flow and Current Density Considerations H.C. Ou, H.C.C. Chien and Y.W. Chang Electronics Engineering,
ILP-Based Inter-Die Routing for 3D ICs Chia-Jen Chang, Pao-Jen Huang, Tai-Chen Chen, and Chien-Nan Jimmy Liu Department of Electrical Engineering, National.
Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,
Maze Routing Algorithms with Exact Matching Constraints for Analog and Mixed Signal Designs M. M. Ozdal and R. F. Hentschke Intel Corporation ICCAD 2012.
BOB-Router: A New Buffering-Aware Global Router with Over-the-Block Routing Resources Yilin Zhang1, Salim Chowdhury2 and David Z. Pan1 1 Department of.
LEMAR: A Novel Length Matching Routing Algorithm for Analog and Mixed Signal Circuits H. Yao, Y. Cai and Q. Gao EDA Lab, Department of CS, Tsinghua University,
1 NTUplace: A Partitioning Based Placement Algorithm for Large-Scale Designs Tung-Chieh Chen 1, Tien-Chang Hsu 1, Zhe-Wei Jiang 1, and Yao-Wen Chang 1,2.
1 CS612 Algorithms for Electronic Design Automation CS 612 – Lecture 8 Lecture 8 Network Flow Based Modeling Mustafa Ozdal Computer Engineering Department,
System in Package and Chip-Package-Board Co-Design
High-Performance Global Routing with Fast Overflow Reduction Huang-Yu Chen, Chin-Hsiung Hsu, and Yao-Wen Chang National Taiwan University Taiwan.
An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Improved Flop Tray-Based Design Implementation for Power Reduction
Chang Xu1, Peixin Li1, Guojie Luo1, Yiyu Shi2, and Iris Hui-Ru Jiang3
Chapter 7 – Specialized Routing
Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.
Performance Optimization Global Routing with RLC Crosstalk Constraints
Timing Analysis and Optimization of Sequential Circuits
Presentation transcript:

page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging Shih-Chuan Lo Chih-Cheng Hsu Mark Po-Hung Lin

page 2 Outline Introduction Preliminaries The Proposed Algorithms Experimental Results Conclusions

page 3 Outline Introduction  Low Power Design Methodologies  The Concept of Clock-Gating Cell  The Concept of Clock-Gate Cloning  The Concept of Flip-Flop Merging  Previous Work  Our Contributions Preliminaries The Proposed Algorithms Experimental Results Conclusions

page 4 Low Power Design Methodologies Clock gating cell (CG)  [Wu et al., TCAS'00], [Shen et al., TVLSI'10], Clock gate cloning  [Teng & Soin, ICSE'10], [Vishweshwara et al., ISQED'12] Multi-bit flip-flop (MBFF)  [Pokala et al., ASIC92], [Kretchmer, EE Times Asia'01], [Chen et al., SNUG10], [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12], [Shyu et al., TVLSI13], [Tsai et al., ISPD13] …

page 5 The Concept of Clock-Gating Cell A clock-gating cell can turn off the clocks at flip-flop inputs when they are not required.  In Fig.(a), the FFs will load new data at their input pins “D” only when the enable signal “EN” is active.  In Fig.(b), the CG can shut off “gclk” to the FFs when “Din” is not changed. Less clock network power and smaller chip area

page 6 The Concept of Clock-Gate Cloning Clock buffer chain may result in:  Longer delay  Degrade the circuit performance  Induce power consumption After replicate sufficient CGs and connect each CG to a smaller number of FFs  The number of required clock buffers can be reduced.  Power consumption and path delay of the gated clock network can be minimized.

page 7 The Concept of Multi-bit flip-flop Replacing 1-bit FFs with MBFFs can reach up to 30% total clock power reduction.  [Jiang et al., TCAD'12] An MBFF contains several 1-bit FFs which share common inverters in the MBFF cell.  [Chen et al., SNUG'10] Replacing several 1-bit FFs with an MBFF will reduce  Inverters in FF cells  Clock sinks  Clock drivers

page 8 Previous Work of CG Cloning [Teng & Soin, ICSE'10]  Introduced cutting-based algorithm to split a CG and redistribute the CG fanout according to the cut line.  The CG splitting algorithm is iteratively performed until the timing violation of each CG’s enable signal is eliminated. [Vishweshwara et al., ISQED'12]  Proposed a clustering-based algorithm to recursively replicate a CG when the CG has a large number of fanout, or when the spreading area of its fanout is larger than a limit.

page 9 Previous Work of FF Merging [Kretchmer, EE Times Asia'01], [Chen et al., SNUG10]  Demonstrated the feasibility of applying MBFFs during logic synthesis. [Pokala et al., ASIC92]  Applied MBFFs before placement optimization. [Tsai et al., ISPD13]  Applied MBFFs during placement optimization. [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12], [Shyu et al., TVLSI13]  Perform power optimization with MBFFs at the post-placement stage for better timing budgeting.

page 10 Our Contributions We present the first problem formulation  For gated clock network optimization with simultaneous CG cloning and FF merging. We introduce a novel optimization flow consisting of  MBFF aware CG cloning  CG-based FF merging  MBFF and CG placement optimization We formulate the MBFF-aware CG cloning optimization problem as a partitioning problem.  Our formulation is to maximize skew slack corresponding to different CGs subject to bounded slack constraints. Our experimental results show that the proposed approach leads to better dynamic power and clock wirelength.

page 11 Outline Introduction Preliminaries  Power Model of Gated Clock Network  Inter-CG Clock Skew due to CG Cloning  Control-Path Timing Constraint for Gated Clock Network  Data-Path Timing Constraint for FF Merging  Placement Density Constraint for CGs and MBFFs  Problem Formulation The Proposed Algorithms Experimental Results Conclusions

page 12 The power dissipated in the gated clock network can be modelled as follows.  [Shen et al., TVLSI'10] Power Model of Gated Clock Network dynamic power consumption clock period supply voltage wirelength input capacitance unit wire capacitance switching activity clock netgated clock treeenable signal net

page 13 When a CG is replicated in the gated clock network, the inter- CG clock skew, can be calculated as follows. To minimize, we shall balance the wirelength and flip- flop fanout numbers among all different CGs. Inter-CG Clock Skew due to CG Cloning inter-CG clock skew among gated FFs interconnection delay from the clock root to g i interconnection delay from g i to the farthest gated FF CG delay the CG

page 14 Control-Path Timing Constraint for Gated Clock Network The figure shows the control-path timing of the gated clock network. interconnection delay from the clock root to g i CG delay

page 15 Data-Path Timing Constraint for FF Merging Only the FFs which have common intersection of their timing- feasible regions can be merged.  [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12] The timing-feasible region of a flip-flop can be obtained from the available timing slack on the corresponding data paths.

page 16 Placement Density Constraint for CGs and MBFFs We divide the chip area into a number of bins with equal size.  [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12] A CG or an MBFF can only be placed in a bin whose density is less than the maximum placement density.  To evenly distribute logic cells throughout the chip area, in order to avoid routing congestion.

page 17 Problem Formulation Input  A clock gating domain contains a set of FFs which are controlled by the gated clock signals whose switching activities are the same.  A cell library containing both CG and MBFF cells. Objectives  Minimize P d and T skew of the clock-gating domain (P d is the primary objective, while T skew is the secondary one because T skew can be further minimized after clock tree routing.) Constraint  Control-path timing constraint  Data-path timing constraint  Placement density constraint.

page 18 Outline Introduction Preliminaries The Proposed Algorithms  The Proposed Algorithms Flow  MBFF-aware CG Cloning  CG-based FF Merging  MBFF and CG Placement Optimization Experimental Results Conclusions

page 19 The Proposed Algorithms Flow Optimized placement containing newly generated CGs and MBFFs Initial placement / Cell library / Design constraints MBFF-aware CG Cloning CG-based FF Merging MBFF & CG Placement Opt.

page 20 The CG must be replicated and the fanout FFs are bisected when:  Control path violates the timing constraint  CG drives too many FFs leading to larger clock power consumption. MBFF-aware CG Cloning

page 21 According to the timing-feasible region of each FF, we construct the hypergraph, H(V,E).  v i : the timing-feasible region of the FF f i.  e i : the intersection among the timing feasible regions of different f i.  w(e i ): the number of vertices connected by e i. Hyper Graph Construction w(e 1 )=4w(e 2 )=2 w(e 3 )=3

page 22 Cut-line Determination with Inter-CG Skew Budgeting The cut direction is determined by the physical dimension of the FF bounding box. [Teng & Soin, ICSE'10]  A vertical (horizontal) cut is applied if the dimension in x-direction is larger (smaller) than that in y-direction. To balance the delay passing through different CGs, we sweep the cut line to search for the maximum skew slack.

page 23 Skew Slack (1/2) In Fig.(c) (Fig.(d)), the CGs are placed at the position closest to (farthest from) the clock root within the respective FF bounding boxes, resulting in the shortest (longest ) clock signal delay from the clock root to the FFs.

page 24 Skew Slack (2/2) The skew slack, can be calculated by the difference between the minimum longest and the maximum shortest clock signal delay. To more easily balance the delay passing through different CGs, we would like to find out a physical cut line which maximizes the skew slack.

page 25 We perform the FM algorithm on H(V,E) to move FFs between different FF sets such that the cut size is minimized.  Cut size: sum of edge weights on the cut line A balance condition that the skew slack after moving an FF to the other FF set must not less than.  is a balance factor,. MBFF-aware FF Swapping

page 26 CG-based FF Merging We merge 1-bit FFs into MBFFs starting from the four boundaries of the FF bounding box to the center area, based on  INTEGRA [Jiang et al., TCAD'12]  Spiral clustering technique [Chang et al., ISPD'12]

page 27 MBFF and CG Placement Optimization We perform MBFF and CG placement optimization to  Minimize inter-CG clock skew  Minimize wirelength  Minimize required clock buffers  Satisfying control/data-path timing constraints  Satisfying placement density constraints

page 28 MBFF Placement When placing the MBFFs controlled by the same CG, we search for the placement bins, which satisfy:  Placement density constraint  In the timing-feasible region corresponding to each MBFF  The FF bounding box of the CG fanouts is minimized. The smaller FF bounding box can result in shorter gated clock signal wirelength, and hence smaller and.

page 29 CG Placement The CGs are initially placed inside their feasible positions which satisfy the control-path timing constraings.  The feasible region of a CG is roughly an ellipse whose the two foci are at the positions of the enable logic and one of the CG fanout FFs. We perform an iterative optimization algorithm to:  Move CGs around their feasible regions until inter-CG clock skew cannot be further minimized.  Add clock buffers to either clock path from the clock root to a CG for delay balance.  Insert buffers to either enable signal path from the enable logic to a CG for a larger feasible region of the CG.

page 30 Outline Introduction Preliminaries The Proposed Algorithms Experiments  Experimental Setups  Experimental Comparisons  Experimental Results Conclusions

page 31 Experimental Setups Programming language  C++ Platform  2.26GHz Intel Xeon machine under the Linux operating system We adopted the benchmark circuits in [Jiang et al., TCAD'12]  Add other logical, physical and timing information for CGs, clock root, and EL.  Referred to the Nangate 45nm Open Cell Library to set the input capacitance.  Assumed that all FFs in each circuit are initially connected to the same CG.  Chose the circuits containing less than 1,000 FFs with reasonable FF bounding boxes.

page 32 Reference Flow 1 & 2  CG cloning technique is based on the MBFF-aware CG cloning without applying MBFF-aware FF swapping.  FF merging technique is exactly the same as the CG-based FF merging. Experimental Comparisons

page 33 Comparisons the numbers of MBFFs with different bit numbers (“# of FFs”) and CG numbers (“# of CGs”).  When comparing with “Reference Flow 1” the proposed flow results in much more MBFFs with similar clock gate numbers.  When comparing with “Reference Flow 2” the proposed flow results in much slightly more CGs and slightly fewer MBFFs. Experimental Results (1/2)

page 34 Comparisons of the dynamic power consumption  15% less than that resulting from “Reference Flow 1”.  10% less than that resulting from “Reference Flow 2”. Comparisons of the clock net wirelength  22% less than that resulting from “Reference Flow 1”.  18% less than that resulting from “Reference Flow 2”. Comparisons of the signal net wirelength  2% less than that resulting from “Reference Flow 2”. Experimental Results (2/2)

page 35 Outline Introduction Preliminaries The Proposed Algorithms Experimental Results Conclusions

page 36 Conclusions We have presented a new problem formulation for clock network optimization with both CGs and MBFFs. We have also introduced novel techniques to optimize gated clock network with CG cloning and FF merging simultaneously. The experimental results have shown that the proposed approach results in better dynamic power and clock wirelength compared with those which optimize gated clock network with CGs and MBFFs separately.

page 37 Thanks for Your Attention