-1- Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing Jin Hu, Andrew B. Kahng, Seokhyeong Kang, Myung-Chul Kim and Igor L. Markov*

Slides:

Advertisements

Similar presentations

Variations of the Turing Machine

Advertisements

3.6 Support Vector Machines

EE384y: Packet Switch Architectures

Constraint Satisfaction Problems

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

UNITED NATIONS Shipment Details Report – January 2006.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Multipath Routing for Video Delivery over Bandwidth-Limited Networks S.-H. Gary Chan Jiancong Chen Department of Computer Science Hong Kong University.

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

The 5S numbers game..

Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Break Time Remaining 10:00.

Announcements Homework 6 is due on Thursday (Oct 18)

Utility Optimization for Event-Driven Distributed Infrastructures Cristian Lumezanu University of Maryland, College Park Sumeer BholaMark Astley IBM T.J.

Chapter 4: Informed Heuristic Search

EE, NCKU Tien-Hao Chang (Darby Chang)

PP Test Review Sections 6-1 to 6-6

Introduction to CMOS VLSI Design Combinational Circuits

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Optimization 1/33 Radford, A D and Gero J S (1988). Design by Optimization in Architecture, Building, and Construction, Van Nostrand Reinhold, New York.

© 2012 National Heart Foundation of Australia. Slide 2.

Adding Up In Chunks.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

Artificial Intelligence

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.

Analyzing Genes and Genomes

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

Clock will move after 1 minute

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Chapter 5 The Mathematics of Diversification

Timing Margin Recovery With Flexible Flip-Flop Timing Model

Minimum Implant Area-Aware Gate Sizing and Placement

Chop-SPICE: An Efficient SPICE Simulation Technique For Buffered RC Trees Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of.

High-Performance Gate Sizing with a Signoff Timer

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Horizontal Benchmark Extension for Improved Assessment of Physical CAD Research Andrew B. Kahng, Hyein Lee and Jiajia Li UC San Diego VLSI CAD Laboratory.

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

-1- UC San Diego / VLSI CAD Laboratory Construction of Realistic Gate Sizing Benchmarks With Known Optimal Solutions Andrew B. Kahng, Seokhyeong Kang VLSI.

High-Performance Gate Selection with a Signoff Timer Andrew B. Kahng *, Seokhyeong Kang *, Hyein Lee *, Igor L. Markov + and Pankit Thapar + UC San Diego.

Eyecharts: Constructive Benchmarking of Gate Sizing Heuristics Puneet Gupta, University of California, Los Angeles Andrew B. Kahng, University of California,

UC San Diego / VLSI CAD Laboratory Learning-Based Approximation of Interconnect Delay and Slew Modeling in Signoff Timing Tools Andrew B. Kahng, Seokhyeong.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

Heuristic Optimization Methods

Presentation transcript:

-1- Sensitivity-Guided Metaheuristics for Accurate Discrete Gate Sizing Jin Hu*, Andrew B. Kahng, Seokhyeong Kang, Myung-Chul Kim* and Igor L. Markov* UC San Diego, *University of Michigan International Conference on Computer-Aided Design November 5 th, 2012

-2- Outline Background and Motivation Background and Motivation Sensitivity-Guided Metaheuristics Sensitivity-Guided Metaheuristics –Global Timing Recovery –Power Reduction with Feasible Timing Experimental Results Experimental Results Conclusions and Ongoing Work Conclusions and Ongoing Work

-3- Gate Sizing in VLSI Design Gate sizing Gate sizing –Effective approach to power, delay optimization –Sizing problem seen at all phases of RTL-to-GDS flow Energy vs. Performance Envelope in VLSI Design Energy vs. Performance Envelope in VLSI Design All Possible Designs Energy Delay Lowest possible delay Lowest possible energy Energy consumption vs. performance tradeoff Pareto frontier

-4- Gate Sizing in VLSI Design Objective Objective –Size the library cell of each gate while minimizing total power subject to design constraints (e.g., slack, slew, capacitance)

-5- Gate Sizing in VLSI Design Objective Objective –Size the library cell of each gate while minimizing total power subject to design constraints (e.g., slack, slew, capacitance) Tunable parameters : gate-width, gate-length and V th Tunable parameters : gate-width, gate-length and V th gate-width (drive-strength) multi-Vth L gate -bias INVX2INVX4 INVX8INVX16 HVTNVT LVT L=60nmL=65nmL=55nm lower power lower speed higher power higher speed

-6- Previous Approaches Common heuristics/algorithms Common heuristics/algorithms Limitations Limitations –Continuous methods: industrial cell libraries offer discrete gate sizes, and rounding solutions is not easy –Discrete methods: scalability to large circuits is an issue –Do not account for realistic delay models and constraints (capacitance, slew) Continuous methods Discrete methods Linear programming Convex optimization Lagrangian relaxation Dynamic programming Sensitivity-based sizing Optimality Scalability

-7- Stochastic Combinatorial Optimization Hard combinatorial optimizations are often solved using Simulated Annealing or other metaheuristics Hard combinatorial optimizations are often solved using Simulated Annealing or other metaheuristics Our work uses two newer metaheuristic frameworks: Large-Step Markov Chains and Go-With-The-Winners Our work uses two newer metaheuristic frameworks: Large-Step Markov Chains and Go-With-The-Winners SA: analogy to physical annealing and thermodynamic ensembles [Kirkpatrick, Gelatt, Vecchi 1983] SA: analogy to physical annealing and thermodynamic ensembles [Kirkpatrick, Gelatt, Vecchi 1983] State = solution; Energy = cost State = solution; Energy = cost Optimal only in limit of infinitely slow cooling and runtime Optimal only in limit of infinitely slow cooling and runtime –Annealing on fractal landscapes: Sorkin91 –Finite-time annealing: BoeseK93

-8- Stochastic Combinatorial Optimization: LSMC Large-Step Markov Chains [Martin, Otto, Felten 1991]: Iteratively perform two operations: 1. descend using a greedy search method, 2. perturb local optimum result with kick move Large-Step Markov Chains [Martin, Otto, Felten 1991]: Iteratively perform two operations: 1. descend using a greedy search method, 2. perturb local optimum result with kick move Takes advantage of an available local search heuristic; more efficient than conventional simulated annealing Takes advantage of an available local search heuristic; more efficient than conventional simulated annealing LSMC is essentially greedy, but with a powerful neighborhood operator (= {kick + descent}); always steps directly from one local minimum to better local minimum LSMC is essentially greedy, but with a powerful neighborhood operator (= {kick + descent}); always steps directly from one local minimum to better local minimum

-9- Stochastic Combinatorial Optimization: GWTW Go-With-The-Winners [Aldous, Vazirani 1994]: invoke greedy heuristics with randomized multi-starts, explore large space by continuing the search from a small set of best-seen solutions Go-With-The-Winners [Aldous, Vazirani 1994]: invoke greedy heuristics with randomized multi-starts, explore large space by continuing the search from a small set of best-seen solutions Finds global optimum with high probability under certain assumptions [AV94] Finds global optimum with high probability under certain assumptions [AV94] Runtime of GWTW is bounded by a polynomial in depth of tree and tree imbalance parameter Runtime of GWTW is bounded by a polynomial in depth of tree and tree imbalance parameter

-10- Our Work: Sensitivity-Guided Metaheuristics We apply sensitivity-guided metaheuristics based on the Go-With-The-Winners paradigm We apply sensitivity-guided metaheuristics based on the Go-With-The-Winners paradigm –Define parameterized space for gate sizing –Explore a heuristic space using multistart technique and efficient parallelization on multi-core system –Use total negative slack (TNS) as a sensitivity function (with fast estimation technique) Infrastructure: ISPD 2012 gate sizing contest Infrastructure: ISPD 2012 gate sizing contest –Realistic benchmarks mapped into a modern discrete gate library

-11- Outline Background and Motivation Background and Motivation Sensitivity-Guided Metaheuristics Sensitivity-Guided Metaheuristics –Global Timing Recovery –Power Reduction with Feasible Timing Experimental Results Experimental Results Conclusions and Ongoing Work Conclusions and Ongoing Work

-12- Trident: Sensitivity-Guided Metaheuristics In our heuristic, multiple tines of a trident represent multiple solution trajectories Trident: central to the symbols of both UCSD and the Ukraine

-13- Trident: Sensitivity-Guided Metaheuristics Our Heuristic: explore a parameterized heuristic space with multistarts, then apply Go-With-The-Winners Initial solution Final solution multistarts go-with-the-winners Global Timing Recovery (GTR) Power Reduction with Feasible Timing (PRFT) Find violation-free solutions with multstarts (recover feasibility ) Iteratively reduce total leakage with greedy downsizing (maintain feasibility)

-14- Trident: Entire Flow Primary Optimization Multi-threaded Final Cell Assignments Power Reduction with Feasible Timing (PRFT) Sensitivity-guided Greedy Sizing Perturbing (upsizing) Bottleneck Cells Input Design (Netlist, SPEF, SDF, Cell Library) Initial Cell Assignments Global Timing Recovery (GTR) Coarse Search Fine Search Multistart Violation-free solution

-15- GTR seeks violation-free solutions w/ two parameters: α : leakage exponent and γ : % of upsizing GTR seeks violation-free solutions w/ two parameters: α : leakage exponent and γ : % of upsizing Global Timing Recovery: Flow on Each Thread Run static timing analysis Calculate sensitivity ( α ) for cells w/ negative slack Upsize γ % of cells in descending order of sensitivity Timing meet? Update timing NO TNS: total negative slack TNS: TNS reduction after cell upsizing leakage: cell leakage increase after cell upsizing

-16- Global Timing Recovery: TNS Estimation delay = delay change (old – new) from upsizing Npaths = # of negative-slack paths through the cell

-17- Multistart w/ different parameters and Go-With-The- Winners; GTR sweeps parameter α and γ, and chooses the best (minimum leakage) solution Multistart w/ different parameters and Go-With-The- Winners; GTR sweeps parameter α and γ, and chooses the best (minimum leakage) solution Global Timing Recovery: Multistart Search Space Coarse Search Step Size (0, ] init [, ] CGS Thres (0, ] init Thres Best solutions ( α, γ ) [, ] Search Space Fine Search Step Size FGS [ - /2, + /2] Thres Best solutions (, ) Focus on ranges around best-seen param.

-18- In GTR, some cells are oversized In GTR, some cells are oversized PRFT iteratively reduces total leakage power using sensitivity-guided greedy sizing (SGGS) PRFT iteratively reduces total leakage power using sensitivity-guided greedy sizing (SGGS) Power Reduction with Feasible Timing Run static timing analysis Calculate sensitivity for all cells Downsize cell C with maximum sensitivity slack (C ) < 0 Incremental STA NO Revert the sizing YES SGGS procedure:

-19- PRFT runs multiple SGGS with different sensitivity functions (SF1 ~ SF5) PRFT runs multiple SGGS with different sensitivity functions (SF1 ~ SF5) PRFT: Sensitivity Functions SF1leakage / delay SF2leakage * slack SF3leakage / (delay*#paths) SF4leakage * slack / #paths SF5leakage * slack / (delay*#paths) Each SF provides a different solution, and we select the best solution among them Each SF provides a different solution, and we select the best solution among them Each run automatically finds the best SF for a given testcase Each run automatically finds the best SF for a given testcase

-20- Monotonic downward sizing can be a local optimum Monotonic downward sizing can be a local optimum Speed up bottleneck cells: recover timing slack with minimum power impact Speed up bottleneck cells: recover timing slack with minimum power impact Perturbation and greedy sizing recall the LSMC approach Perturbation and greedy sizing recall the LSMC approach PRFT: Speeding up Bottleneck Cells Sensitivity-guided Greedy Sizing w/ SF i best solution Speed up γ % bottleneck cells best seen ? yes no final solution Progression of GTR & PRFT (TNS, leakage) GTR PRFT kick-move

-21- In PRFT, cell slack should be recalculated incremental STA is used after cell sizing to reduce runtime In PRFT, cell slack should be recalculated incremental STA is used after cell sizing to reduce runtime To achieve further speedup, we propagate updated timing when it is larger than a propagation threshold (e.g., 0.1ps) To achieve further speedup, we propagate updated timing when it is larger than a propagation threshold (e.g., 0.1ps) Incremental Static Timing Analysis 1. Update cell delay, transition time and AAT 2. Update RAT and slack

-22- Handling Capacitance and Slew Violations Each standard cell can drive a certain maximum capacitance, and transition time must be smaller than maximum transition time Each standard cell can drive a certain maximum capacitance, and transition time must be smaller than maximum transition time Trident removes max-capacitance and max- transition (slew) violations at every iteration of GTR Trident removes max-capacitance and max- transition (slew) violations at every iteration of GTR Max. Cap. violation 1.Backward traversal: visit cells in reverse order, and upsize driving cells 2.Forward traversal: downsize fanout cells Requires one to two iterations Max. Cap. violation

-23- Configurations for GWTW Trident can configure the number of best-seen solutions used in GWTW Trident can configure the number of best-seen solutions used in GWTW GTR: coarse search GTR: coarse search GTR: fine search 1 GTR: fine search 1 GTR: fine search 2 GTR: fine search 2 PRFT: greedy sizing PRFT: greedy sizing PRFT: kick-move+ greedy sizing PRFT: kick-move+ greedy sizing Parameters from N-best seen solutions Which configuration is optimal in terms of runtime vs. sizing quality? Which configuration is optimal in terms of runtime vs. sizing quality? –More start points More chances to find near-optimum –Runtime increases with the number of start points SF1, SF2 ….

-24- Outline Background and Motivation Background and Motivation Sensitivity-Guided Metaheuristics Sensitivity-Guided Metaheuristics –Global Timing Recovery –Power Reduction with Feasible Timing Experimental Results Experimental Results Conclusions and Ongoing Work Conclusions and Ongoing Work

-25- ISPD 2012 Gate Sizing Contest [Ozdal et al.] Provide benchmarks to accurately model the discrete gate-sizing problem Provide benchmarks to accurately model the discrete gate-sizing problem Netilst (Verilog), parasitics (SPEF), timing constraint (SDC) Netilst (Verilog), parasitics (SPEF), timing constraint (SDC) Library: 11 different logic functions, 30 different cell types (three multi-Vt and ten different sizes) 330 cells Library: 11 different logic functions, 30 different cell types (three multi-Vt and ten different sizes) 330 cells The contest compares leakage power of violation-free solutions The contest compares leakage power of violation-free solutions

-26- Analysis of Our Implementation Trident is written in C++, has a built-in static timer Trident is written in C++, has a built-in static timer –Significantly faster than Tcl access to PT in contest env. –Improve runtime w/ an incremental STA Runtime for all (14) ISPD2012 benchmarks: less than 83 hours w/ 4 threads (Intel Xeon E31230) Runtime for all (14) ISPD2012 benchmarks: less than 83 hours w/ 4 threads (Intel Xeon E31230) coarse search: 6.2% fine-grain search: 25.1% Greedy sizing: % Perturbing iterations: 23.0% Runtime breakdown (for NETCARD_slow)

-27- Characterization of ISTA Runtime of full-scale STA and incremental STA Runtime of full-scale STA and incremental STA –Average runtime and maximum slack error have been measured after randomly sizing 1% of cells benchmarks FSTA (sec) ISTA runtime (msec)ISTA max. error (ps) 0ps0.1ps1.0ps0ps0.1ps1.0ps DMA PCI DES VGA B LEON3MP NETCARD Geomean924X2.86X1.0X0.54X0.0X1.0X9.1X

-28- Tuning of ISTA Runtime vs. Quality of sizing solutions with different propagation thresholds Runtime vs. Quality of sizing solutions with different propagation thresholds Prop. threshold0.0ps0.1ps1.0ps ISTA runtime (msec) ISTA max. error (ps) Trident runtime (min) Final leakage (mW) Testcase: DMA (w/ tight timing constraint)

-29- Configurations of GWTW Five multi-threaded stages can provide n best-seen solutions for GWTW heuristic Five multi-threaded stages can provide n best-seen solutions for GWTW heuristic Solution quality vs. runtime Solution quality vs. runtime [A][B][C][D][E] Stage A: GTR coarse-grain search B: GTR fine-grain search I C: GTR fine-grain search II D: PRFT greedy sizing E: PRFT speed up bottleneck Value 0: skip the stage 1: keep one best solution 2: keep two best solutions Default configuration: 22211

-30- Experimental Results on ISPD Benchmarks Benchmarks (# of cells) leakage power Runtime (min) GTR param.PRFT param. GTRPRFT α γ (%)SF γ (%) DMA (25K) SF51 PCI (33K) SF44 DES (111K) SF51 VGA (165K) SF54 B19 (219K) SF24 LEON3 (649K) SF41 NETCARD (959K) SF31 Leakage power/ runtime/ parameters for GTR, PRFT Leakage power/ runtime/ parameters for GTR, PRFT Benchmark set with tight timing constraint Benchmark set with tight timing constraint Best parameter values found by our heuristic

-31- Experimental Results on ISPD Benchmarks Benchmarks (# of cells) leakage power Runtime (min) GTR param.PRFT param. GTRPRFT α γ (%)SF γ (%) DMA (25K) SF55 PCI (33K) SF54 DES (111K) SF23 VGA (165K) SF43 B19 (219K) SF51 LEON3 (649K) SF42 NETCARD (959K) SF31 Leakage power/ runtime/ parameters for GTR, PRFT Leakage power/ runtime/ parameters for GTR, PRFT Benchmark set with loose timing constraint Benchmark set with loose timing constraint

-32- Leakage Comparison on ISPD Benchmarks Contest best: best of all entries in the competition (ISPD 2012 contest) Intel Labs (contest organizer) released five (near-optimal) results ISPD 2012 contest: In all benchmarks (except one), Trident achieves lowest leakage power: 43% further reduction over contest winner. We outperform Intel results on four.

-33- Conclusions Within the research-oriented infrastructure used in ISPD 2012 Gate-Sizing Contest, we have developed a metaheuristic approach to gate sizing Within the research-oriented infrastructure used in ISPD 2012 Gate-Sizing Contest, we have developed a metaheuristic approach to gate sizing Our implementation, Trident, outperforms the best reported results on all but one of the ISPD 2012 benchmarks. Our implementation, Trident, outperforms the best reported results on all but one of the ISPD 2012 benchmarks. Compared to the 2012 contest winner, we further reduce leakage power by an average of 43% Compared to the 2012 contest winner, we further reduce leakage power by an average of 43%

-34- Ongoing Works Extension to support real industry library Extension to support real industry library Consider addition of interconnect delay in the next version of sizer Consider addition of interconnect delay in the next version of sizer

-35- Thank you

-36- Experimental Results on ISPD Benchmarks Leakage power/ runtime/ parameters for GTR, PRFT Benchmarks # of cells leakage power Runtime (min) GTR param.PRFT param. GTRPRFTαγ(%)SFγ(%) DMA_fast SF51 DMA_slow SF55 PCI_fast SF44 PCI_slow SF54 DES_fast SF51 DES_slow SF23 VGA_fast SF54 VGA_slow SF43 B19_fast SF24 B19_slow SF51 LEON3_fast SF41 LEON3_slow SF42 NETCARD_fast SF31 NETCARD_slow SF31

-37- Experimental Results on ISPD Benchmarks Leakage power/ runtime/ parameters for GTR, PRFT Benchmarks # of cells leakage power Runtime (min) GTR param.PRFT param. GTRPRFTαγ(%)SFγ(%) DMA_slow SF55 PCI_slow SF54 DES_slow SF23 VGA_slow SF43 B19_slow SF51 LEON3_slow SF42 NETCARD_slow SF31