Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University.

Slides:



Advertisements
Similar presentations
Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.
Advertisements

Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.
OCV-Aware Top-Level Clock Tree Optimization
Advanced Interconnect Optimizations. Buffers Improve Slack RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack.
Ispd-2007 Repeater Insertion for Concurrent Setup and Hold Time Violations with Power-Delay Trade-Off Salim Chowdhury John Lillis Sun Microsystems University.
Multicast in Wireless Mesh Network Xuan (William) Zhang Xun Shi.
4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.
Buffer and FF Insertion Slides from Charles J. Alpert IBM Corp.
On Large-Scale Peer-to-Peer Streaming Systems with Network Coding Chen Feng, Baochun Li Dept. of Electrical and Computer Engineering University of Toronto.
1 Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion Presented By Cesare Ferri Takumi Okamoto, Jason Kong.
SimPL: An Effective Placement Algorithm Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1ICCAD 2010, Myung-Chul Kim,
Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.
Variability-Driven Formulation for Simultaneous Gate Sizing and Post-Silicon Tunability Allocation Vishal Khandelwal and Ankur Srivastava Department of.
Minimal Skew Clock Synthesis Considering Time-Variant Temperature Gradient Hao Yu, Yu Hu, Chun-Chen Liu and Lei He EE Department, UCLA Presented by Yu.
Improved Algorithms for Link- Based Non-tree Clock Network for Skew Variability Reduction Anand Rajaram †‡ David Z. Pan † Jiang Hu * † Dept. of ECE, UT-Austin.
A Useful Skew Tree Framework for Inserting Large Safety Margins Rickard Ewetz and Cheng-Kok Koh School of Electrical and Computer Engineering, Purdue University.
VLSI Physical Design: From Graph Partitioning to Timing Closure Paper Presentation © KLMH Lienig 1 EECS 527 Paper Presentation Topological Design of Clock.
An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim, Deokjin Joo, Taehan Kim DAC’13.
The Cost of Fixing Hold Time Violations in Sub-threshold Circuits Yanqing Zhang, Benton Calhoun University of Virginia Motivation and Background Power.
Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.
Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.
Power-Aware Placement
ABSTRACT We consider the problem of buffering a given tree with the minimum number of buffers under load cap and buffer skew constraints. Our contributions.
UCLA TRIO Package Jason Cong, Lei He Cheng-Kok Koh, and David Z. Pan Cheng-Kok Koh, and David Z. Pan UCLA Computer Science Dept Los Angeles, CA
Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield A. B. Kahng, B. Liu, X. Xu, J. Hu* and G. Venkataraman*
Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Temperature-Aware Design Presented by Mehul Shah 4/29/04.
Selective Gate-Length Biasing for Cost-Effective Runtime Leakage Control Puneet Gupta 1 Andrew B. Kahng 1 Puneet Sharma 1 Dennis Sylvester 2 1 ECE Department,
Interconnect Synthesis. Buffering Related Interconnect Synthesis Consider –Layer assignment –Wire sizing –Buffer polarity –Driver sizing –Generalized.
Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.
Xin-Wei Shih and Yao-Wen Chang.  Introduction  Problem formulation  Algorithms  Experimental results  Conclusions.
Energy Efficient Routing and Self-Configuring Networks Stephen B. Wicker Bart Selman Terrence L. Fine Carla Gomes Bhaskar KrishnamachariDepartment of CS.
-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.
Power Reduction for FPGA using Multiple Vdd/Vth
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
1 Coupling Aware Timing Optimization and Antenna Avoidance in Layer Assignment Di Wu, Jiang Hu and Rabi Mahapatra Texas A&M University.
EE 5900 Advanced Algorithms for Robust VLSI CAD, Spring 2009 Static Timing Analysis and Gate Sizing.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
An Efficient Algorithm for Dual-Voltage Design Without Need for Level-Conversion SSST 2012 Mridula Allani Intel Corporation, Austin, TX (Formerly.
March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.
1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Fast Algorithms for Slew Constrained Minimum Cost Buffering S. Hu*, C. Alpert**, J. Hu*, S. Karandikar**, Z. Li*, W. Shi* and C. Sze** *Dept of ECE, Texas.
Physical Synthesis Buffer Insertion, Gate Sizing, Wire Sizing,
Routing Tree Construction with Buffer Insertion under Obstacle Constraints Ying Rao, Tianxiang Yang Fall 2002.
Routability-driven Floorplanning With Buffer Planning Chiu Wing Sham Evangeline F. Y. Young Department of Computer Science & Engineering The Chinese University.
Introduction to Clock Tree Synthesis
1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.
-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.
Algorithmic Tuning of Clock Trees and Derived Non-Tree Structures Igor L. Markov and Dong-Jin Lee University of Michigan Additional details in Dong-Jin.
Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.
PROCEED: Pareto Optimization-based Circuit-level Evaluation Methodology for Emerging Devices Shaodi Wang, Andrew Pan, Chi-On Chui and Puneet Gupta Department.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Yanqing Zhang University of Virginia On Clock Network Design for Sub- threshold Circuitry 1.
Gopakumar.G Hardware Design Group
Buffer Insertion with Adaptive Blockage Avoidance
Buffered tree construction for timing optimization, slew rate, and reliability control Abstract: With the rapid scaling of IC technology, buffer insertion.
Post-Silicon Calibration for Large-Volume Products
Clock Tree Routing With Obstacles
Performance-Driven Interconnect Optimization Charlie Chung-Ping Chen
Presentation transcript:

Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University of Michigan

Outline ■Motivation and challenges ■Modeling and objectives −Local skew with variation −Local-skew slack −Modeling process variation ■Proposed methodology and techniques −Initial tree construction and buffer insertion −Robustness improvements −Wire snaking and delay buffer insertion ■Empirical validation ■Summary 2 ICCAD 2010, Dong-Jin Lee, University of Michigan

Motivation ■Clock networks −Contribute a significant fraction of dynamic power −A limiting factor in high-performance CPUs and SoCs ■Challenges −Interconnect is lagging in performance while transistors continue scaling −Multi-objective optimization – Traditional clock network synthesis constraints – The increasing impact of process variation – Power-performance-cost trade-offs 3 ICCAD 2010, Dong-Jin Lee, University of Michigan

Tree vs Mesh ■Objectives −Minimize skew of a high-performance clock tree −Minimize the impact of PVT variations −Clock trees vs meshes, subject to skew < 7.5ps 4 Robustness Power efficiency Trees Ideal clock networks Meshes ICCAD 2010, Dong-Jin Lee, University of Michigan

Our Contributions ■The notion of local-skew slack for clock trees ■A tabular technique to estimate the impact of variations ■A path-based technique to enhance the robustness ■A time-budgeting algorithm for clock-tree tuning with minimal power resources ■Fine tuning of clock trees : accurate, fast, power efficient ■Implementation : Contango2.0 ■Strong empirical results : low skew, robustness, low power 5 ICCAD 2010, Dong-Jin Lee, University of Michigan

Modeling and Objectives 6 ICCAD 2010, Dong-Jin Lee, University of Michigan

Local Skew ■Main objective (concept) −Minimize local skew in the presence of variation ■Definition: Skew − Ψ : Clock tree − λ(s i ) : the clock latency (insertion delay) at sink s i ∈ Ψ − ■Definition: Global Skew ( ω Ψ ) − 7 ICCAD 2010, Dong-Jin Lee, University of Michigan

■Definition: The worst nominal local skew ( ω Ψ Δ ) − Δ : local skew distance bound − dist(s i,s j ) : Manhattan distance between s i and s j ∈ Ψ − ■Definition: The worst local skew with variation ( ω Ψ Δ,ν,y ) − ν : variation model − y : yield ( 0 <y ≤ 1 ) − f(t) : the cumulative distribution function of ω Ψ Δ,ν − Local Skew 8 ICCAD 2010, Dong-Jin Lee, University of Michigan

Worst local skew with variation ( ω Ψ Δ,ν,y ) −Probability density function of ω Ψ Δ,ν − Ω Δ = 7.5ps, y = 95%, ω Ψ Δ,ν,y < Ω Δ − ω Ψ Δ,ν,y = 6.05ps Modeling and Objectives - Example 9 ΩΔΩΔ ω Ψ Δ,ν,y ps ICCAD 2010, Dong-Jin Lee, University of Michigan PDFCDFInverse CDFPDF y = 0.95 ω Ψ Δ,ν,y = 6.05ps

■Building variation-tolerant clock trees −such that ω Δ,ν,y < Ω Δ ( Ω Δ – local skew limit) −subject to slew constraints ■Minimizing clock-tree power Optimization Objectives 10 ICCAD 2010, Dong-Jin Lee, University of Michigan ΩΔΩΔ ω Ψ Δ,ν,y ps

Local-skew Slack σ(s) for sink s ∈ Ψ ■Definition − σ(s) is the minimum amount of additional delay for s, so that the tree satisfies ω Ψ Δ < Ω Δ ■Example ( Ω δ = 5 ps ) 11 ICCAD 2010, Dong-Jin Lee, University of Michigan

Modeling Process Variation ■Impact of variation on skew( s i,s j ) depends on tree path length (s i,s j ), num. buffers (s i,s j ) and type buffers (s i,s j ) ■Notation − T : technology node − B : buffer and wire library − v : variation model ■Variation-estimation table Ξ T,B,ν,y [ w,b,t ] −worst-case increase in skew (with probability y ) between two sinks connected by a tree path of length w with b buffers and the buffer type t 12 ICCAD 2010, Dong-Jin Lee, University of Michigan w : tree path length b : num. of buffers (2) t : buffer type ABCD

Modeling Process Variation ■ varEst(s i,s j ) −the worst case variational skew(s i,s j ) − ■Key constraint − 13 ICCAD 2010, Dong-Jin Lee, University of Michigan

Initial Tree Construction ■ZST-DME algorithm* based on Elmore delay ■A simple and robust technique for obstacle avoidance** ■Initial buffer insertion − t 0 : the initial buffer type for initial buffer insertion −Use variation-estimation table with path lengths from initial tree −Once t 0 is determined, we adapt the fast variant of van Ginneken’s algorithm*** for initial buffer insertion −Minimize insertion delay, reliable slew rate 14 * : J.-H. Huang et al, “On Bounded-Skew Routing Tree Problem,” DAC‘95 ** : D.-J. Lee et al, “Contango: Integrated Optimization of SoC Clock Networks,” DATE‘10 *** : W. Shi et al, “A Fast Algorithm for Optimal Buffer Insertion,” Trans. on CAD 24(6),2005 ICCAD 2010, Dong-Jin Lee, University of Michigan

Robustness Improvement ■Improve robustness after initial buffer insertion so that ω Ψ Δ,ν,y < Ω Δ holds after skew optimization ■ ■The target buffer type for a tree-path between sink s i and s j, t(s i,s j ) is defined as the smallest t such that −choosing smaller buffers reduces capacitance 15 ICCAD 2010, Dong-Jin Lee, University of Michigan

Local Skew Optimization : Wire Snaking 16 T target (e) : 11psT actual (e) : 7ps T 2 actual (e) : 3ps T 3 actual (e) : 1ps ICCAD 2010, Dong-Jin Lee, University of Michigan ■Local-skew optimization techniques −based on the optimal tuning amount from the slack computation algorithms with varEst(s i,s j ) ■Improved wire snaking algorithm −speed, accuracy and routing resources e T 1 target (e) : 11psT 1 actual (e) : 7ps T 2 target (e) : 4ps T 3 target (e) : 1ps T actual (e) : 7psT target (e) : 11psT actual (e) : 10psT actual (e) : 11ps T i target (e) ≥ T i actual (e)

■ α : to keep T i actual (e) ≤ T i target (e) efficiently ■Delay model for wire snaking aims for T i actual (e) to satisfy the above inequality with the highest α possible ■Look-up tables for length estimation −to enhance the quality of estimation by wire snaking −a set of SPICE simulations for each technology environment which includes technology model, types of buffers and wires, variation specification ■We achieved α values between 60% and 70% for the ISPD 2010 CNS contest benchmarks Delay Model for Wire Snaking 17 ICCAD 2010, Dong-Jin Lee, University of Michigan

■Wire snaking at buffer outputs is more accurate than at other nodes ■Limiting wire snaking to buffer outputs reduces # of SPICE calls ■Example Optimal Node Selection for Wire Snaking 18 ICCAD 2010, Dong-Jin Lee, University of Michigan

■Highly unbalanced sink capacitances or layout obstacles may result in significant local skew ■Delay buffer insertion −Skew can be reduced by the delay of the inserted buffer −Further precise wire snaking is possible because the inserted buffer isolates the target node ■Example Delay Buffer Insertion 19 ICCAD 2010, Dong-Jin Lee, University of Michigan

ISPD’10 Clock Network Synthesis Contest ■ 45nm 2GHz CPU benchmarks from IBM and Intel ■Evaluation −Monte-Carlo SPICE simulations with PVT variations −Skew and slew constraints ( 7.5ps, 100ps ) −Objective : total capacitance — proxy for dynamic power ■A rare opportunity to compare multiple strategies for clock-network synthesis 20 ICCAD 2010, Dong-Jin Lee, University of Michigan

■ispd10cns07 Example of Our Clock Tree 21 ICCAD 2010, Dong-Jin Lee, University of Michigan

■ISPD 2010 benchmarks −2.6 ps nominal local skew −Smaller capacitance than CNSrouter and NTUclock by 4.22× and 4.13× resp. −Our clock trees yield > 95%, while CNSrouter violates yield constraints on 3 benchmarks and NTUclock on 7 Empirical Validation 22 ICCAD 2010, Dong-Jin Lee, University of Michigan

■Local skew constraints are all cleared ■Smaller capacitance than NTU and CUHK by 2.09× and 4.24× resp. ■More robust with smaller capacitance ICCAD 2010 Proceedings 23 ICCAD 2010, Dong-Jin Lee, University of Michigan NTUCUHKContango2 Bench ω Ψ Δ,ν,y Cap. ω Ψ Δ,ν,y Cap. ω Ψ Δ,ν,y Cap. cns cns cns cns cns cns cns cns Avg

■Probability density functions (PDF) for skew on ISPD’10 benchmarks Skew Profiles for Contango2 & CNSrouter 24 ICCAD 2010, Dong-Jin Lee, University of Michigan

■When tight local skew constraints, large buffers ensure robustness, increasing capacitance −Much capacitance can be saved when local skew constraints are loose ■Experiments on ispd10cns08 Trade-off - Power vs Robustness to Variations 25 ICCAD 2010, Dong-Jin Lee, University of Michigan

■A tree solution for CPU clock routing −Improves power consumption under tight skew constraints in the presence of variation −Clock trees can be tuned to have nominal skew below 5 ps and low total skew in the presence of variation −4x capacitance improvement on average over mesh structures ■Our clock trees have a higher yield than meshes −meshes are not as easy to tune for nominal skew Summary 26 ICCAD 2010, Dong-Jin Lee, University of Michigan

Thank you!! Questions? Questions and Answers 27 ICCAD 2010, Dong-Jin Lee, University of Michigan

Questions and Answers 28 ICCAD 2010, Dong-Jin Lee, University of Michigan