Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout

Slides:

Advertisements

Similar presentations

Topics Electrical properties of static combinational gates:

Advertisements

An International Technology Roadmap for Semiconductors

Cadence Design Systems, Inc. Why Interconnect Prediction Doesn’t Work.

1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Clock Design Adopted from David Harris of Harvey Mudd College.

Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.

From Compaq, ASP- DAC00. Power Consumption Power consumption is on the rise due to: - Higher integration levels (more devices & wires) - Rising clock.

Lecture 7: Power.

Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 11 – Design Concepts.

Power Reduction for FPGA using Multiple Vdd/Vth

CAD for Physical Design of VLSI Circuits

EGRE 427 Advanced Digital Design Figures from Application-Specific Integrated Circuits, Michael John Sebastian Smith, Addison Wesley, 1997 Chapter 4 Programmable.

ASIC Design Flow – An Overview Ing. Pullini Antonio

Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

Optimal digital circuit design Mohammad Sharifkhani.

Low Power – High Speed MCML Circuits (II)

XIAOYU HU AANCHAL GUPTA Multi Threshold Technique for High Speed and Low Power Consumption CMOS Circuits.

ASIC, Customer-Owned Tooling, and Processor Design Nancy Nettleton Manager, VLSI ASIC Device Engineering April 2000 Design Style Myths That Lead EDA Astray.

Modern VLSI Design 4e: Chapter 3 Copyright  2008 Wayne Wolf Topics n Pseudo-nMOS gates. n DCVS logic. n Domino gates. n Design-for-yield. n Gates as IP.

4. Combinational Logic Networks Layout Design Methods 4. 2

NUMERICAL TECHNOLOGIES, INC. Assessing Technology tradeoffs for 65nm logic circuits D Pramanik, M Cote, K Beaudette Numerical Technologies Inc Valery Axelrad.

Ｅｘｅｒｃｉｓｅ TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.

© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Joshua L. Garrett Digital Circuits Design GroupUniversity of California, Berkeley Compact DSM MOS Modeling for Energy/Delay Estimation Joshua Garrett,

CS203 – Advanced Computer Architecture

-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.

COE 360 Principles of VLSI Design Delay. 2 Definitions.

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Introduction to ASICs ASIC - Application Specific Integrated Circuit

Power-Optimal Pipelining in Deep Submicron Technology

CS203 – Advanced Computer Architecture

Copyright © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

THE CMOS INVERTER.

The Interconnect Delay Bottleneck.

Subject Name: Fundamentals Of CMOS VLSI Subject Code: 10EC56

Full Custom Associative Memory Core

Circuits and Interconnects In Aggressively Scaled CMOS

VLSI Design MOSFET Scaling and CMOS Latch Up

Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2

SIDDAGANGA INSTITUTE OF TECHNOLOGY

Challenges in Nanoelectronics: Process Variability

The Xilinx Virtex Series FPGA

An Illustration of 0.1µm CMOS layout design on PC

Timing Analysis 11/21/2018.

Chapter 10: IC Technology

Digital Integrated Circuits A Design Perspective

Topics Circuit design for FPGAs: Logic elements. Interconnect.

University of Colorado at Boulder

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

FPGA Glitch Power Analysis and Reduction

Chapter 10: IC Technology

Post-Silicon Calibration for Large-Volume Products

Lecture 7: Power.

The Xilinx Virtex Series FPGA

Lecture 7: Power.

Reading: Hambley Ch. 7; Rabaey et al. Secs. 5.2, 5.5, 6.2.1

EE216A – Fall 2010 Design of VLSI Circuits and Systems

Measuring the Gap between FPGAs and ASICs

Arithmetic Building Blocks

Chapter 10: IC Technology

COMBINATIONAL LOGIC - 2.

Arithmetic Circuits.

Presentation transcript:

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Closing the Gap between ASIC & Custom Chapter 4 High Speed Logic, Circuits, Libraries and Layout Fabio Garzia Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Overview Set of technology independent metrics to compare ASIC and custom chips Four criteria: Performance Area efficiency Energy efficiency Design effort Performance difference due to logic style, circuit design, libraries and layout Area and density differences Impact of circuit styles on energy efficiency Future trends and impact of technology scaling Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Performance: Fan-out of 4 Inverter Delay Technology independent performance metric Useful to compare different design and to grade the performance of successive generations of fabrication processes from the same foundry and across different foundries Four basic characteristics of a process Effective channel length ( ), oxide thickness, nominal voltage They affect gate delay and energy Metal pitch rules It affects the achievable density Transistor rules (FEOL) and metal pitch rules (BEOL) are continuously optimized, separately and concurrently It is possible to estimate the temporary achievable performance using FO4 delay FO4 delay is equal to half the process , if nm are replaced by ps, under TTLH assumptions (typical NMOS, typical PMOS, 90% of Vdd and 100°C of operating temperature Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Area Efficiency: Cost Area efficiency achieved balancing the transistor density and the routing resource usage (one half the minimum device length) was the normalized size metric But from 0.5um technology, the drawn and effective transistor length have continued to diverge Nowadays the is used as size metric is the M2 pitch The area for gates remains constant over a wide range of process generations The transistor cost is the technology independent density metric Lower bound of 3 derived from SRAM bit-cells Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Energy: Energy to switch a gate: Dynamic power: is the activity factor is the frequency Three different gates: is the energy required to write a bit in a SRAM bit-cell Reference for flip-flops, latches, registers and memory is the energy required to modify/transform state of a NAND Reference for combinational logic is the energy required to switch a minimally sized inverter driving a FO4 load Useful to benchmark data transfer across different blocks Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Measuring Design Efforts: Full chip design time Not reliable, because it depends on the actual complexity of the chip Unique transistor per week ( ) It includes design, schematic entry, timing and functional simulation, layout, backend verification, and integration overhead An average designer produces 22 versus 33 of an excellent designer Considering a 2-NAND gate made of four unique transistors, it corresponds respectively to a productivity of 5 and 8 gates Comparing this results with an older study, very little progress has been made in the last 20 years to improve the productivity of custom Considering the time per gate, an excellent designer can produce a 2-nand gate in under five hours, and an 8 KB SRAM block in 10 weeks The average would take respectively seven hours and 17 weeks Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Survey of Semiconductor Processes from 1995 - 2002 Transistor length shrank from 0.5um to 0.13 um Clock frequencies increased from 180MHz (HP PA-8000) to 2.2GHz (Intel Pentium 4) In the table six major processes Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Applying Metric to Flagship Design Custom is the best in design of high-end microprocessors Custom can meet and even exceed both the reference clock rate and the maximum available gate capacity Combination of high density and large on-chip SRAM blocks enable to exploit the large number of available transistors effectively Many flagship ASIC are not microprocessor and don’t require large SRAM blocks Maximum size of custom chip is less constrained ASIC designs may vary a lot in their ability to approach the gate counts advertised by the foundries Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Performance Penalties in ASIC Designs Impact of circuit style on the performance Overview Domino circuits: Single-ended (b) Differential (a) Differential static circuits (c) Standard static CMOS (d) (e) Custom circuit style are more efficient than static CMOS Restriction of logic depth imposed by pull-up trees of PMOS transistors Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Different Logic Styles from Dynamic to Static Logic Custom differential domino design Custom techniques, single-phase clock and latched based timing Clock period equal to 20 FO4 delays 3FO4 for registers overhead, 1FO4 for timing margin, 16 FO4 for the combinational logic Custom differential static techniques (DCVSL, DSL, CNTL) Minimization of the effect of the parasitic PMOS Decreasing of the overhead of the inverters Increasing of combinational logic delay, due to the fight on the PMOS They require 21FO4 for the combinational logic, so they are slower than Domino (25 FO4 instead of 20 FO4) Semi-custom single-ended static Despite the semi-custom approach, the three parameters degrade 5 more FO4 required for the combinational logic, 1 more for the timing margin Totally 32 FO4 are required Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout The Additional Effects of Logic Design, Cell Design and Layout in ASICs Flip-flops vs. latches 2-3 FO4 slower than latches They do not allow slack passing 4 additional FO4 of delay Coarser control of clock distribution 2 FO4 delay penalty Difference in the design and layout of the combinational logic Design not implemented optimally for speed Structure not exploited in P&R Mismatched cell drive strengths and loads Inaccurate wire load models Altogether 25 FO4 Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Summary of the Comparison Approximate clock periods for different logic and design style Resulting calculate performance of the chips Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Comparison of Dynamic and Static 64-Bit Adders Logic design (a) Radix-4 domino logic Ling adder cells (b) Radix-2 Static logic (c) Radix-2 Static logic carry lookahead adder cells Relative transistor size For the static logic, pull-up and pull-down have the same strength Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Comparison of Dynamic and Static 64-Bit Adder The choice of logic style changes which logic design is best: Radix-4 for domino (they can implement more complex cells) Radix-2 for static logic Cells with skewed drive strength can improve the design Smaller area and lower power consumption But they are not usually provided in a standard cell library The logic design affects the speed, the area and the power A carry look-ahead adder would have larger cells than the Ling adder These cells are slower, with the larger input capacitance increasing the power consumption Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Comparison of Skew Tolerant Domino Logic with Synthesized Static Logic Implementation of an execution unit in the comparison Register file, adder, shifter, Boolean operation The domino logic is 2.1x faster The clocking scheme contribute for a factor of 1.2x The skew tolerant domino logic used four phase clocking scheme with semi-custom clock trees Less affected by clock skew Impact of setup time on the timing overhead reduced The static logic used flip-flops Slack passing not possible The remaining factor depends on the logic style The domino implementation was hand-designed The static implementation was synthesized from RTL No difference due to layout Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Performance Impact of Cell Libraries There are key difference in the cell libraries Custom libraries: Designed for specific applications Greater functionality, high number of transistors (~1000-10000) Wide range of logic styles Cell heights are taller, better location of ports Drive strengths targeted on the intended load Transistor size and P:N ratio locally tunable Few restrictions on layout Design-Specific Libraries ASIC libraries: Limited support in synthesis tools They cannot exploit the ASIC most aggressive technique yet Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Performance of Static CMOS Standard Cell Libraries Limit in the ability of ASIC libraries to support transistor sizing and other transistor-level features On the non-critical path, smaller transistor Less power consumption, less load On the critical path high speed devices Iterative transistor resizing and resynthesis Not enough drive strength and only one polarity Buffer used to drive loads Loss of compactness Usually compact cell are faster Usage of different cell from the same library Scan flip-flops instead of D flip-flops Exclusion of cells from the synthesis for practical reasons Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Better Performance with Better Layout Wire load models Used to estimate the load for a given fan-out Standard model versus specific one After the P&R, capacitances can be extracted Resynthesis Percentile: percentage of gates with a given fan-out that have less load than the capacitance given by the model Larger percentile is more conservative Conservative design: oversized gates, too many buffers Typically a 50-60% is the best choice (experimentation are suggested) Partition of design in smaller block with separate wire load models for intra and inter-blocks communication Useful to improve the wire model estimation Long nets with large fan-out require a different model In deep submicron technologies, it is not possible to tolerate inaccuracies of the model Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Better Performance with Better Layout Wire load models Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Floorplanning and Global Routing Better floorplanning can improve the speed after final placement It can be optimized: Partitioning the design in smaller blocks Performing resizing and resynthesis as the design is floorplanned and globally routed Consider the impact of noise (i.e. crosstalk) IP vendors cannot provide a manual floorplan for each configuration But Lexra developed a methodology to calculate the allowable configurations and build a floorplan based on this Custom design uses similar techniques Hopefully EDA floorplanning tools will automate layout of carefully partitioned designs Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Summary of the Performance Differences between ASICs and Custom Chips Summary of ASIC static CMOS maximum performance penalty Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Comparison of ASIC and Custom Cell Areas Overhead for datapath mapped to standard cells 10%-25% for the alignment of cell interface to the global grid 0.5%-1.0% per extra protection diode contact 20%-40% due to the differences in P:N ratio used and in the absolute number of PMOS Datapath cells using only NMOS or domino style gains area Reducing the number of inverters required Eliminating the parasitic PMOS Reducing the logic depth Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Cost of Grid Alignment and Reduced Encompassed Functionality Automatic P&R Route Input and output ports must be aligned to the routing grid All edge geometries must be “on-grid” Considering one cell, abutment in both direction The maximum cost is one routing track in each direction Considering a more complex datapath, difference in area and transistor cost From 11% to 25% Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Impact of Extra PMOS Devices and Differing P:N Sizes and Ratio Increasing of cell area and transistor cost of 25%-50% In standard cells design, P:N ratio is between 1.5:1 (optimum speed) and 2:1 (balanced rise and falling edge) N-well occupies slightly more than 50% of the cell Handicap when custom functions are implemented Mapping N-Dominated datapaths into the balanced standard cell circuit style results into area inefficiency Smaller useful area for NMOS Included P-transistors increase transistor cost and layout inefficiency Mapping N-Dominated datapaths into the balanced standard cell circuit style results into degradation in timing Datapath cells can vary P:N ratios to achieve more performance Additional parasitic load of extra PMOS Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Cell Bit Height – Partitioning Global and Local Resources “High density” cell libraries Number of 2-NAND gates that can be fabricated, but maybe not usefully connected Cell height minimized and high density achieved Reduction of inter-cell routing resources Adverse impact on the block and global assembly Less functionality inside the cell and less ports Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Extraneous Diode and Substrate Contacts Overhead of internal diode contacts on inputs and full well and substrate ties In custom design Cells are optimized for a specific set of input drivers, output loads and routing paths Power distribution and required ties are usually included in stitch cell For standard cells Diode contacts increase the layout inefficiency Diode contacts are required where a silicon gate is connected to metal routing beyond a certain length threshold and which is not already driven Only few design tools provide the ability to insert diode cells The penalty due to the diode contacts depends on the specific design rule of the selected process Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Energy Tradeoffs Between ASIC Cells and Custom Cells Energy-delay product is a metric for efficiency Delay is function of the load, the power supply voltage and the device current Equations: The reduction in operating voltage yields a quadratic reduction in energy and power with only a linear decrease in performance As the power supply voltage approaches the threshold voltage, the increase in delay cancels the decrease in energy A low energy-delay product enables Higher speed with the same energy Same speed with lower energy Dynamic circuit can further decrease the energy-delay Automatically customizing standard cells to match drive strengths can reduce the power by the 50% Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Circuit Efficiency Leads to Low Power Energy Dissipation Reduction of parasitic capacitance enables the design to run faster This allows to decrease the power supply and the power dissipation Static / Dynamic Comparison 24 instead of 22 transistors In the domino, 8 are minimum sized 12 PMOS instead of 9 Wp = 19.8um instead of 5.4um 3.85FO4 / 2.3FO4 95nW / 222nW Reducing the supply voltage, 8% increase of speed with the same power or same speed with 11% reduction of power Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Energy Efficiency vs. Performance Custom circuit techniques increase the maximum efficient-circuit-speed If not enough, trading of energy efficiency for speed Reduction of delay simply increasing the drive strength Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018

Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Future Trends Four primary trends Combination of aggressive operating voltage scaling and limited threshold voltage scaling Increased leakage current and leakage power (Domino problem) Limit in transistor stack rate (Pass-gate problem) Interconnection delay Complexity of design Limit to the ability to customize Layout constraints Standard cells will be implemented in CMOS static Robustness, scaling properties, existing infrastructures Custom will migrate to DCVSL and CML Domino will require bigger keepers Difference between ASIC and custom will continue to shrink Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018