High Performance, Multi-CPU Power Signoff for Mega Designs

Slides:



Advertisements
Similar presentations
18 July 2001 Work In Progress – Not for Publication 2001 ITRS Test Chapter ITRS Test ITWG Mike Rodgers Don Edenfeld.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
© 2013 IBM Corporation Use of Hierarchical Design Methodologies in Global Infrastructure of the POWER7+ Processor Brian Veraa Ryan Nett.
Computer Abstractions and Technology
Lars Arge 1/13 Efficient Handling of Massive (Terrain) Datasets Lars Arge A A R H U S U N I V E R S I T E T Department of Computer Science.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Computer Hardware Introduction. Computer System Components Input Keyboard, Mouse, Camera, Touch Pad Processing CPU Output Monitor, Printer Storage Floppy,
Voltus IC Power Integrity Solution Break-through in Power Signoff
Algorithm Analysis (Big O) CS-341 Dick Steflik. Complexity In examining algorithm efficiency we must understand the idea of complexity –Space complexity.
Weiping Shi Department of Computer Science University of North Texas HiCap: A Fast Hierarchical Algorithm for 3D Capacitance Extraction.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
 2000 M. CiesielskiPTL Synthesis1 Synthesis for Pass Transistor Logic Maciej Ciesielski Dept. of Electrical & Computer Engineering University of Massachusetts,
BIST vs. ATPG.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Reconfigurable ECO Cells for Timing Closure and IR Drop Minimization TingTing Hwang Tsing Hua University, Hsin-Chu.
EE141 © Digital Integrated Circuits 2nd Introduction 1 The First Computer.
Answering the Database Scale Out Problem: SSDs in the Data Center April 14, 2010 Dan Marriott Director - Production Operations
Hierarchical Physical Design Methodology for Multi-Million Gate Chips Session 11 Wei-Jin Dai.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Fast & Furious: Taming the Challenges of Advanced-Node Design Anirudh Devgan, Senior Vice President, Digital & Signoff Group.
Principles Of Digital Design Chapter 1 Introduction Design Representation Levels of Abstraction Design Tasks and Design Processes CAD Tools.
ECO Methodology for Very High Frequency Microprocessor Sumit Goswami, Srivatsa Srinath, Anoop V, Ravi Sekhar Intel Technology, Bangalore, India Introduction.
Extracted directly from:
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
1. CAD Challenges for Leading-Edge Multimedia Designs Ira Chayut, Verification Architect (opinions are my own and do not necessarily represent the opinion.
SSV Summit November 2013 Cadence Tempus™ Timing Signoff Solution.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Present – Past -- Future
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
Bi-CMOS Prakash B.
Power Integrity Test and Verification CK Cheng UC San Diego 1.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.
EE141 © Digital Integrated Circuits 2nd Introduction 1 Principle of CMOS VLSI Design Introduction Adapted from Digital Integrated, Copyright 2003 Prentice.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
By Islam Atta Supervised by Dr. Ihab Talkhan
Integrated Microsystems Lab. EE372 VLSI SYSTEM DESIGNE. Yoon 1-1 Panorama of VLSI Design Fabrication (Chem, physics) Technology (EE) Systems (CS) Matel.
Oracle Business Intelligence Foundation - Commonly Used Features in Repository.
Dept. of Electronics Engineering & Institute of Electronics National Chiao Tung University Hsinchu, Taiwan ISPD’16 Generating Routing-Driven Power Distribution.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
Kanupriya Gulati * Mathew Lovell ** Sunil P. Khatri * * Computer Engineering, Texas A&M University ** Hewlett Packard Company, Fort Collins, CO Efficient.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Copyright © 2009, Intel Corporation. All rights reserved. Power Gate Design Optimization and Analysis with Silicon Correlation Results Yong Lee-Kee, Intel.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
HPC In The Cloud Case Study: Proteomics Workflow
Introduction to ASICs ASIC - Application Specific Integrated Circuit
Parallel Programming By J. H. Wang May 2, 2017.
Architecture & Organization 1
Parallel Algorithm Design
Multi-Processing in High Performance Computer Architecture:
Parallel Programming in C with MPI and OpenMP
Architecture & Organization 1
Chapter 15, Exploring the Digital Domain
Overview of VLSI 魏凱城 彰化師範大學資工系.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A High Performance SoC: PkunityTM
ECE 699: Lecture 3 ZYNQ Design Flow.
HIGH LEVEL SYNTHESIS.
Cloud Computing Architecture
Presentation transcript:

High Performance, Multi-CPU Power Signoff for Mega Designs Patrick Sproule Director of Engineering, VLSI Methodology

Nvidia Power Analysis Requirements Static and Dynamic Full Chip Power Analysis Tool implementation must handle both sub-chip analysis or full die analysis in a single sessions. Ideally provide full domain analysis for full accuracy in a single run. Design Size Scalability Full flat design analysis to handle both small and largest production designs on existing/available compute resource. Runtime Predictability Designs get larger but schedule time for power analysis is required to stay constant or shrink. Required close ended runtime estimates. Clear Reporting Large amount of analysis data must be condensed to clear reports.

Power Analysis Challenges Designs have seen device count grow by 4 orders of magnitude in less than 10 years. Increased number of metal layers and modelled device count cause calculation to expand faster than tools and compute resources. Large runtimes and/or inefficient subdivision of designs required. Designs have also become highly replicated at a multitude of hierarchy levels. Complexity of data handling and integration within the tools. Many engineer run analysis at different hierarchy levels. Recreation of db and duplication of analysis costs schedule.

Current Rail Analysis Methodology Partition-based hierarchical methodology is planned and executed within a large design team at many levels Unique design technologies, especially in low power Multi-power domains, power gating switches, … Full Chip Integration Full Chip chiplet Chiplet Owners partition Partition Owners

Typical Extraction and Rail Analysis Power-Grid-View (PGV): physical modeling of IP Current Signatures Extraction Rail: RC, current, geometry Physical Database PGDB Primitive PGV RC Extraction Rail Analysis Current Signatures IR Drop Results/Plots

Hierarchical Rail Analysis Method (H-PGV) … Partition 1 Partition N H-PGV 1 H-PGV N RC Extraction Top-Level Database PGDB RC Extraction Current Signatures Rail Analysis Primitive PGV IR Drop Results/Plots

H-PGV Advantages H-PGV generation runtime is minimal compared to full chip database setup for IR-drop analysis H-PGVs can be generated in parallel Hierarchical methodology supports bottom-up and top-down rail analysis. Capturing H-PGV boundary condition for ECO at partition level (top down push) Full and Sub-chip level analysis time greatly improved with same accuracy

Flat vs. Hierarchical Correlation Example Analysis: Sub-chip level 14.4M total primitive instance count (modelled cells) 8.9M regular logic and memory cells 5.5M filler, tap, decap cells 18 total partitions in chiplet 7 unique partitions 3 partitions replicated 4 time each. H-PGV run metrics : Runtime : 18~32 minutes Memory : 40~45G

Rail Analysis at Full Chip Level Design Metal Layers # of Transistors (Billions) RAM (GB) CPU Rail Analysis Runtime (Days) GF100 (flat) 9 3.0 200 1 2.25 GK104 11 3.5 600 8 10 GK110 7 1000+ (est.) 26 (est.) (hierarchical) 650

Nvidia Scale and Runtime Issues Design Size Growth outpacing tool and resource capability.

Voltus on Kepler ~380M instances flat analysis – tsmc28nm Main resource: ~725Gb memory on 1Tb 32 cpu machine. Static and Dynamic Signoff Power analysis at VDD & VSS (done as parallel runs) 21 hour runtime per analysis domain. ~8x runtime improvement over previous method with equivalent accuracy.

Rail Analysis at Full Chip Level Design Metal Layers # of Transistors (Billions) RAM (GB) CPU Rail Analysis Runtime (Days) GF100 (flat) 9 3.0 200 1 2.25 GK104 11 3.5 600 8 10 GK110 7 1000+ (est.) 26 (est.) (hierarchical) 650 (VOLTUS) 700 32 21 hours

Nvidia Scale and Runtime Issues Memory requirement

Summary Voltus meets our needs for Rail analysis with accuracy and runtime with far less than expected runtimes. Further testing proved possible to run VDD-GND combined domain in a single pass in 50 hrs runtime using multi-threaded and distributed capabilities. Capability to run both multi-threaded and distributed allows us the flexibility to manage schedule and resource requirements. Congratulations to the Voltus team on delivering a distruptive runtime improvement.

Q&A