OCV-Aware Top-Level Clock Tree Optimization

Slides:

Advertisements

Similar presentations

(1/25) UCSD VLSI CAD Laboratory - ISQED10, March. 23, 2010 Toward Effective Utilization of Timing Exceptions in Design Optimization Kwangok Jeong, Andrew.

Advertisements

Tunable Sensors for Process-Aware Voltage Scaling

-1- VLSI CAD Laboratory, UC San Diego Post-Routing BEOL Layout Optimization for Improved Time- Dependent Dielectric Breakdown (TDDB) Reliability Tuck-Boon.

Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.

4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.

Timing Margin Recovery With Flexible Flip-Flop Timing Model

Minimum Implant Area-Aware Gate Sizing and Placement

Ch.7 Layout Design Standard Cell Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.

UC San Diego / VLSI CAD Laboratory NOLO: A No-Loop, Predictive Useful Skew Methodology for Improved Timing in IC Implementation Tuck-Boon Chan, Andrew.

Toward Better Wireload Models in the Presence of Obstacles* Chung-Kuan Cheng, Andrew B. Kahng, Bao Liu and Dirk Stroobandt† UC San Diego CSE Dept. †Ghent.

Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.

Background: Scan-Based Delay Fault Testing Sequentially apply initialization, launch test vector pairs that differ by 1-bit shift A vector pair induces.

Boosting: Min-Cut Placement with Improved Signal Delay Andrew B. KahngSherief Reda CSE & ECE Departments University of CA, San Diego La Jolla, CA

Power-Aware Placement

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Chung-Kuan Cheng†, Andrew B. Kahng†‡,

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*

A Proposal for Routing-Based Timing-Driven Scan Chain Ordering Puneet Gupta 1 Andrew B. Kahng 1 Stefanus Mantik 2

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

1 A Method for Fast Delay/Area Estimation EE219b Semester Project Mike Sheets May 16, 2000.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Enhanced Metamodeling Techniques for High-Dimensional IC Design Estimation Problems Andrew B. Kahng, Bill Lin and Siddhartha Nath VLSI CAD LABORATORY,

UC San Diego / VLSI CAD Laboratory Reliability-Constrained Die Stacking Order in 3DICs Under Manufacturing Variability Tuck-Boon Chan, Andrew B. Kahng,

Page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning.

Dose Map and Placement Co-Optimization for Timing Yield Enhancement and Leakage Power Reduction Kwangok Jeong, Andrew B. Kahng, Chul-Hong Park, Hailong.

Accuracy-Configurable Adder for Approximate Arithmetic Designs

-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.

A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.

Global Routing.

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

UC San Diego / VLSI CAD Laboratory Toward Quantifying the IC Design Value of Interconnect Technology Improvement Tuck-Boon Chan, Andrew B. Kahng, Jiajia.

Horizontal Benchmark Extension for Improved Assessment of Physical CAD Research Andrew B. Kahng, Hyein Lee and Jiajia Li UC San Diego VLSI CAD Laboratory.

Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.

A NEW ECO TECHNOLOGY FOR FUNCTIONAL CHANGES AND REMOVING TIMING VIOLATIONS Jui-Hung Hung, Yao-Kai Yeh,Yung-Sheng Tseng and Tsai-Ming Hsieh Dept. of Information.

Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.

Clock Clustering and IO Optimization for 3D Integration Samyoung Bang*, Kwangsoo Han ‡, Andrew B. Kahng ‡† and Vaishnav Srinivas ‡ ‡ ECE and † CSE Departments,

Kwangsoo Han, Andrew B. Kahng, Hyein Lee and Lutong Wang

Kwangsoo Han‡, Andrew B. Kahng‡† and Hyein Lee‡

Placement. Physical Design Cycle Partitioning Placement/ Floorplanning Placement/ Floorplanning Routing Break the circuit up into smaller segments Place.

VLSI Physical Design: From Graph Partitioning to Timing Closure Chapter 6: Detailed Routing © KLMH Lienig 1 What Makes a Design Difficult to Route Charles.

-1- UC San Diego / VLSI CAD Laboratory High-Dimensional Metamodeling for Prediction of Clock Tree Synthesis Outcomes Andrew B. Kahng, Bill Lin and Siddhartha.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Outline Introduction: BTI Aging and AVS Signoff Problem

CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

UC San Diego / VLSI CAD Laboratory Learning-Based Approximation of Interconnect Delay and Slew Modeling in Signoff Timing Tools Andrew B. Kahng, Seokhyeong.

Routing Topology Algorithms Mustafa Ozdal 1. Introduction How to connect nets with multiple terminals? Net topologies needed before point-to-point routing.

Mixed Cell-Height Implementation for Improved Design Quality in Advanced Nodes Sorin Dobre +, Andrew B. Kahng * and Jiajia Li * * UC San Diego VLSI CAD.

Outline Motivation and Contributions Related Works ILP Formulation

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

Effective Linear Programming-Based Placement Techniques Sherief Reda UC San Diego Amit Chowdhary Intel Corporation.

Architecture and algorithm for synthesizable embedded programmable logic core Noha Kafafi, Kimberly Bozman, Steven J. E. Wilton 2003 Field programmable.

-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.

Construction of Latency-Bounded Clock Trees Rickard Ewetz, Chuan Yean Tan, Cheng-Kok Koh Purdue University.

Improved Flop Tray-Based Design Implementation for Power Reduction

Kun Young Chung*, Andrew B. Kahng+ and Jiajia Li+

VLSI Physical Design Automation

Chapter 7 – Specialized Routing

Kristof Blutman† , Hamed Fatemi† , Andrew B

Improved Performance of 3DIC Implementations Through Inherent Awareness of Mix-and-Match Die Stacking Kwangsoo Han, Andrew B. Kahng and Jiajia Li University.

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

Revisiting and Bounding the Benefit From 3D Integration

Timing Analysis 11/21/2018.

Buffered tree construction for timing optimization, slew rate, and reliability control Abstract: With the rapid scaling of IC technology, buffer insertion.

Presentation transcript:

OCV-Aware Top-Level Clock Tree Optimization Tuck-Boon Chan, Kwangsoo Han, Andrew B. Kahng, Jae-Gon Lee and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego

Outline Motivation and Previous Work Our Approach Experimental Setup Results and Conclusions

Clock Tree Synthesis Is Challenging! Clock tree consumes up to 40% power  aggressive power reduction  complex clock tree with clock logic cells (CLCs) such as, clock gating, divider, MUXes Complex timing constraints across process, voltage, temperature and operating scenarios On-chip variation  more design margin

Top-Level Clock Tree Problems CLCs Top-level tree Bottom-level trees Clock root The “top-level” clock tree comprises of all transitive fanins to CLCs starting from a clock root pin Trees below the CLCs are the bottom-level trees Industry tools do not always optimize the top-level clock trees Results in large skews with multi-corner multi-mode (MCMM) scenarios CGC DIV MUX Sinks 1 Sinks 2 CTS with long non-common paths

Top-Level Clock Tree Optimization CTS with long non-common paths Optimizing the “top-level” clock tree involves handling of complex clock logic cells The optimization involves CLC placements Buffer insertion Minimizing non-common paths Balancing the tree based on timing information (WNS, TNS across setup and hold corners) CGC DIV MUX Sinks 1 Sinks 2 CTS with reduced non-common paths CGC Sinks 2 DIV MUX

Previous Works Rajaram and Pan (2011) Reduce non-common path delay by reallocating clock pin locations of soft-IP blocks Insert buffers to minimize difference in clock latency among subtrees across PVT corners Do not consider CLCs, timing between sink groups, wirelength Tsai (2005), Velenis et al. (2003) Minimize effect of OCV during CTS but do not handle CLCs or MCMM scenarios Lung et al. (2010) Optimize clock skew using LP and account for delay variation across PVT corners Ignore non-common paths and CLC placement

Outline Motivation and Previous Work Our Approach Experimental Setup Results and Conclusions

Our Work Current CTS tools  Balance bottom-level clock trees  Optimize CLC placement  Multi corner multi mode (MCMM) optimization Our method Focus on top-level clock tree Simultaneously optimize CLC placement and balance clock tree across multi corner multi mode Extract timing constraints from bottom level clock trees  capture accurate MCMM constraints

LP-Based Optimization Objective: a weighted sum of worst negative slack (WNS) total negative slack (TNS) non-common paths wirelength of a clock tree Variables: CLC locations and net delays Model delay from pin I to pin J as a linear function of Manhattan distance  Captures impact of CLC placement pin i pin j CLC Manhattan distance Delay Delay is linear function of the Manhattan distance with uniform buffer insertion! Extract insertion and timing constraints from bottom level clock trees to estimate slacks of critical paths Delays across different PVT corners are normalized to a reference corner for MCMM optimization

Example Example: Make d(1,2) = 4ns  improves timing tp are the terminal pins d(i,j) : delay from pin i to pin j root Top level Example: Make d(1,2) = 4ns  improves timing t1 d (1,3) = 0.5ns t3 d (3,4) = 0.5ns CLC d (1,2) = 2ns t4 d (4,5) = 1ns t2 t5 1ns 3ns Sink group 1 Sink group 2 Sink group 3 Bottom level Critical path delay = 3ns

Our Heuristics To implement our optimization in an industrial CTS flow, we implement three heuristics Algorithm 1: Extract top-level clock tree Algorithm 2: Create Steiner points Algorithm 3: Insert buffers

Extract Top-Level Clock Tree Inputs Initial clock tree; cells in the tree are vertices and connections between them are edges List of vertices that belong to CLCs Algorithm description Obtain transitive fanins of all CLCs Remove clock routes to the fanin cells Remove buffers and reconnect nets accordingly Output List of top-level clock cells and connections between them

Output of Algorithm 1 Algorithm 1 CLC CLC CLC CLC FF group 1

Create Steiner Points Inputs Algorithm description Output Top-level clock tree List of vertices that belong to CLCs Algorithm description Find pin-pair that minimize the sum of the difference in sink latency and the delay due to Manhattan distance Merge the pin-pair that has minimum sum of difference by inserting a new Steiner point Repeat until all driving pins have a single connection Output A binary top-level clock tree and connections between them

Output of Algorithm 2 i j1 j2 j3 j4 i j1 j2 j3 j4 j2' i j1 j2 j3 j4 j1.L j3.L j2.L j4.L Manhattan distance & sink latency j1.L = j2.L = j3.L << j4.L

Insert Buffers Inputs Algorithm Output Two pin nets of top-level clock tree Required delay of each nets Algorithm Calculate the number of buffers required to meet the delay target as a function of net and buffer delays Calculate the minimum wirelength required to insert the number of buffers Determine whether to insert in L-shape or U-shape manner Output Two pin nets of top-level clock tree that buffers are inserted Algorithm 3 L-shape U-shape

Outline Motivation and Previous Work Our Approach Experimental Setup Results and Conclusions

CTS Testcase Requirements Realistic and resemble clock trees typically seen in SoC blocks Include CLCs and top-level hierarchies Combinational logic and critical paths across sink groups Multiple clock roots and generated clocks

Our CTS Testcases We develop generators for high-speed CTS testcases typically found in CPU/GPU blocks in modern SoCs Implement clock roots that are outputs of PLLs as well as crystal oscillators Implement different types of CLCs Glitch-free clock MUX Dividers Clock-gating cells Multiple generated clocks for debug, tracing, IO, peripherals

Examples of CTS Testcases DIV2 clk DIV4 DIV8 scan_clk m_clk CGC MUX SINKS Clocks to all sink groups are generated clocks Top-level has up to two levels of hierarchy MUX DIV4 DIV2 DIV8 CGC scan_clk clk m_clk SINKS Reconvergent paths Top-level has up to two levels of hierarchy

Experimental Setup Six high-speed testcases P&R tool is an industry tool CTS uses MCMM scenarios Timing analysis tool is Synopsys PrimeTime LP-solver is CPLEX Flow implemented in Tcl

Operating Conditions Parameters Value PVT corner for setup @ 1.25GHz SS, 0.85V, 125C PVT corner for hold @ 1.25GHz FF, 1.05V, 125C PVT corner for setup @ 1.67GHz SS, 1.10V, 125C PVT corner for hold @ 1.67GHz FF, 1.30V, 125C Max. transition for clock paths 55ps Max. transition for data paths 12.5% of clock period Timing derate on net delay (early/late) 0.90/1.19 Timing derate on cell delay (early/late) 0.90/1.05

Our Optimization Flow Reference CTS flow Our optimization flow Placed design Remove buffers from top-level tree CTS CLCs placement & buffer insertion Initial clock tree Placement legalization Route top-level clock Post-CTS opt Post-CTS opt Routing + optimization Routing + optimization DRC & timing fix DRC & timing fix Compare post-route metrics

Outline Motivation and Previous Work Our Approach Experimental Setup Results and Conclusions

Results: Improved Timing Our formulation focuses on minimizing setup WNS Improved setup WNS up to 320ps Hold WNS is worsen but < 70ps

Results: Improved WL, Power Metric T1 T2 T3 Wirelength (WL) 46% 41% 51% Switching Power 23% 15% 28%

Conclusions Industry tools do not optimize the top-level clock tree always We develop an optimization formulation for the top-level tree and solve it using three heuristics We develop realistic high-speed CTS testcases typically seen in clock trees of CPU/GPU Our optimization flow improves setup WNS by up to 320ps, wirelength by up to 51% and dynamic power by up to 28% Ongoing works include Handling obstacles Accounting for optimal buffering solutions Creating testcases for other important SoC elements Joint optimization of the top- and bottom-level trees

Thank You