Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Closing the Gap between ASIC & Custom Chapter 4 High Speed Logic, Circuits, Libraries and Layout Fabio Garzia Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Overview Set of technology independent metrics to compare ASIC and custom chips Four criteria: Performance Area efficiency Energy efficiency Design effort Performance difference due to logic style, circuit design, libraries and layout Area and density differences Impact of circuit styles on energy efficiency Future trends and impact of technology scaling Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Performance: Fan-out of 4 Inverter Delay Technology independent performance metric Useful to compare different design and to grade the performance of successive generations of fabrication processes from the same foundry and across different foundries Four basic characteristics of a process Effective channel length ( ), oxide thickness, nominal voltage They affect gate delay and energy Metal pitch rules It affects the achievable density Transistor rules (FEOL) and metal pitch rules (BEOL) are continuously optimized, separately and concurrently It is possible to estimate the temporary achievable performance using FO4 delay FO4 delay is equal to half the process , if nm are replaced by ps, under TTLH assumptions (typical NMOS, typical PMOS, 90% of Vdd and 100°C of operating temperature Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Area Efficiency: Cost Area efficiency achieved balancing the transistor density and the routing resource usage (one half the minimum device length) was the normalized size metric But from 0.5um technology, the drawn and effective transistor length have continued to diverge Nowadays the is used as size metric is the M2 pitch The area for gates remains constant over a wide range of process generations The transistor cost is the technology independent density metric Lower bound of 3 derived from SRAM bit-cells Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Energy: Energy to switch a gate: Dynamic power: is the activity factor is the frequency Three different gates: is the energy required to write a bit in a SRAM bit-cell Reference for flip-flops, latches, registers and memory is the energy required to modify/transform state of a NAND Reference for combinational logic is the energy required to switch a minimally sized inverter driving a FO4 load Useful to benchmark data transfer across different blocks Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Measuring Design Efforts: Full chip design time Not reliable, because it depends on the actual complexity of the chip Unique transistor per week ( ) It includes design, schematic entry, timing and functional simulation, layout, backend verification, and integration overhead An average designer produces 22 versus 33 of an excellent designer Considering a 2-NAND gate made of four unique transistors, it corresponds respectively to a productivity of 5 and 8 gates Comparing this results with an older study, very little progress has been made in the last 20 years to improve the productivity of custom Considering the time per gate, an excellent designer can produce a 2-nand gate in under five hours, and an 8 KB SRAM block in 10 weeks The average would take respectively seven hours and 17 weeks Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Survey of Semiconductor Processes from 1995 - 2002 Transistor length shrank from 0.5um to 0.13 um Clock frequencies increased from 180MHz (HP PA-8000) to 2.2GHz (Intel Pentium 4) In the table six major processes Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Applying Metric to Flagship Design Custom is the best in design of high-end microprocessors Custom can meet and even exceed both the reference clock rate and the maximum available gate capacity Combination of high density and large on-chip SRAM blocks enable to exploit the large number of available transistors effectively Many flagship ASIC are not microprocessor and don’t require large SRAM blocks Maximum size of custom chip is less constrained ASIC designs may vary a lot in their ability to approach the gate counts advertised by the foundries Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Performance Penalties in ASIC Designs Impact of circuit style on the performance Overview Domino circuits: Single-ended (b) Differential (a) Differential static circuits (c) Standard static CMOS (d) (e) Custom circuit style are more efficient than static CMOS Restriction of logic depth imposed by pull-up trees of PMOS transistors Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Different Logic Styles from Dynamic to Static Logic Custom differential domino design Custom techniques, single-phase clock and latched based timing Clock period equal to 20 FO4 delays 3FO4 for registers overhead, 1FO4 for timing margin, 16 FO4 for the combinational logic Custom differential static techniques (DCVSL, DSL, CNTL) Minimization of the effect of the parasitic PMOS Decreasing of the overhead of the inverters Increasing of combinational logic delay, due to the fight on the PMOS They require 21FO4 for the combinational logic, so they are slower than Domino (25 FO4 instead of 20 FO4) Semi-custom single-ended static Despite the semi-custom approach, the three parameters degrade 5 more FO4 required for the combinational logic, 1 more for the timing margin Totally 32 FO4 are required Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout The Additional Effects of Logic Design, Cell Design and Layout in ASICs Flip-flops vs. latches 2-3 FO4 slower than latches They do not allow slack passing 4 additional FO4 of delay Coarser control of clock distribution 2 FO4 delay penalty Difference in the design and layout of the combinational logic Design not implemented optimally for speed Structure not exploited in P&R Mismatched cell drive strengths and loads Inaccurate wire load models Altogether 25 FO4 Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Summary of the Comparison Approximate clock periods for different logic and design style Resulting calculate performance of the chips Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Comparison of Dynamic and Static 64-Bit Adders Logic design (a) Radix-4 domino logic Ling adder cells (b) Radix-2 Static logic (c) Radix-2 Static logic carry lookahead adder cells Relative transistor size For the static logic, pull-up and pull-down have the same strength Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Comparison of Dynamic and Static 64-Bit Adder The choice of logic style changes which logic design is best: Radix-4 for domino (they can implement more complex cells) Radix-2 for static logic Cells with skewed drive strength can improve the design Smaller area and lower power consumption But they are not usually provided in a standard cell library The logic design affects the speed, the area and the power A carry look-ahead adder would have larger cells than the Ling adder These cells are slower, with the larger input capacitance increasing the power consumption Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Comparison of Skew Tolerant Domino Logic with Synthesized Static Logic Implementation of an execution unit in the comparison Register file, adder, shifter, Boolean operation The domino logic is 2.1x faster The clocking scheme contribute for a factor of 1.2x The skew tolerant domino logic used four phase clocking scheme with semi-custom clock trees Less affected by clock skew Impact of setup time on the timing overhead reduced The static logic used flip-flops Slack passing not possible The remaining factor depends on the logic style The domino implementation was hand-designed The static implementation was synthesized from RTL No difference due to layout Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Performance Impact of Cell Libraries There are key difference in the cell libraries Custom libraries: Designed for specific applications Greater functionality, high number of transistors (~1000-10000) Wide range of logic styles Cell heights are taller, better location of ports Drive strengths targeted on the intended load Transistor size and P:N ratio locally tunable Few restrictions on layout Design-Specific Libraries ASIC libraries: Limited support in synthesis tools They cannot exploit the ASIC most aggressive technique yet Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Performance of Static CMOS Standard Cell Libraries Limit in the ability of ASIC libraries to support transistor sizing and other transistor-level features On the non-critical path, smaller transistor Less power consumption, less load On the critical path high speed devices Iterative transistor resizing and resynthesis Not enough drive strength and only one polarity Buffer used to drive loads Loss of compactness Usually compact cell are faster Usage of different cell from the same library Scan flip-flops instead of D flip-flops Exclusion of cells from the synthesis for practical reasons Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Better Performance with Better Layout Wire load models Used to estimate the load for a given fan-out Standard model versus specific one After the P&R, capacitances can be extracted Resynthesis Percentile: percentage of gates with a given fan-out that have less load than the capacitance given by the model Larger percentile is more conservative Conservative design: oversized gates, too many buffers Typically a 50-60% is the best choice (experimentation are suggested) Partition of design in smaller block with separate wire load models for intra and inter-blocks communication Useful to improve the wire model estimation Long nets with large fan-out require a different model In deep submicron technologies, it is not possible to tolerate inaccuracies of the model Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Better Performance with Better Layout Wire load models Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Floorplanning and Global Routing Better floorplanning can improve the speed after final placement It can be optimized: Partitioning the design in smaller blocks Performing resizing and resynthesis as the design is floorplanned and globally routed Consider the impact of noise (i.e. crosstalk) IP vendors cannot provide a manual floorplan for each configuration But Lexra developed a methodology to calculate the allowable configurations and build a floorplan based on this Custom design uses similar techniques Hopefully EDA floorplanning tools will automate layout of carefully partitioned designs Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Summary of the Performance Differences between ASICs and Custom Chips Summary of ASIC static CMOS maximum performance penalty Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Comparison of ASIC and Custom Cell Areas Overhead for datapath mapped to standard cells 10%-25% for the alignment of cell interface to the global grid 0.5%-1.0% per extra protection diode contact 20%-40% due to the differences in P:N ratio used and in the absolute number of PMOS Datapath cells using only NMOS or domino style gains area Reducing the number of inverters required Eliminating the parasitic PMOS Reducing the logic depth Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Cost of Grid Alignment and Reduced Encompassed Functionality Automatic P&R Route Input and output ports must be aligned to the routing grid All edge geometries must be “on-grid” Considering one cell, abutment in both direction The maximum cost is one routing track in each direction Considering a more complex datapath, difference in area and transistor cost From 11% to 25% Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Impact of Extra PMOS Devices and Differing P:N Sizes and Ratio Increasing of cell area and transistor cost of 25%-50% In standard cells design, P:N ratio is between 1.5:1 (optimum speed) and 2:1 (balanced rise and falling edge) N-well occupies slightly more than 50% of the cell Handicap when custom functions are implemented Mapping N-Dominated datapaths into the balanced standard cell circuit style results into area inefficiency Smaller useful area for NMOS Included P-transistors increase transistor cost and layout inefficiency Mapping N-Dominated datapaths into the balanced standard cell circuit style results into degradation in timing Datapath cells can vary P:N ratios to achieve more performance Additional parasitic load of extra PMOS Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Cell Bit Height – Partitioning Global and Local Resources “High density” cell libraries Number of 2-NAND gates that can be fabricated, but maybe not usefully connected Cell height minimized and high density achieved Reduction of inter-cell routing resources Adverse impact on the block and global assembly Less functionality inside the cell and less ports Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Extraneous Diode and Substrate Contacts Overhead of internal diode contacts on inputs and full well and substrate ties In custom design Cells are optimized for a specific set of input drivers, output loads and routing paths Power distribution and required ties are usually included in stitch cell For standard cells Diode contacts increase the layout inefficiency Diode contacts are required where a silicon gate is connected to metal routing beyond a certain length threshold and which is not already driven Only few design tools provide the ability to insert diode cells The penalty due to the diode contacts depends on the specific design rule of the selected process Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Energy Tradeoffs Between ASIC Cells and Custom Cells Energy-delay product is a metric for efficiency Delay is function of the load, the power supply voltage and the device current Equations: The reduction in operating voltage yields a quadratic reduction in energy and power with only a linear decrease in performance As the power supply voltage approaches the threshold voltage, the increase in delay cancels the decrease in energy A low energy-delay product enables Higher speed with the same energy Same speed with lower energy Dynamic circuit can further decrease the energy-delay Automatically customizing standard cells to match drive strengths can reduce the power by the 50% Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Circuit Efficiency Leads to Low Power Energy Dissipation Reduction of parasitic capacitance enables the design to run faster This allows to decrease the power supply and the power dissipation Static / Dynamic Comparison 24 instead of 22 transistors In the domino, 8 are minimum sized 12 PMOS instead of 9 Wp = 19.8um instead of 5.4um 3.85FO4 / 2.3FO4 95nW / 222nW Reducing the supply voltage, 8% increase of speed with the same power or same speed with 11% reduction of power Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Energy Efficiency vs. Performance Custom circuit techniques increase the maximum efficient-circuit-speed If not enough, trading of energy efficiency for speed Reduction of delay simply increasing the drive strength Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout Future Trends Four primary trends Combination of aggressive operating voltage scaling and limited threshold voltage scaling Increased leakage current and leakage power (Domino problem) Limit in transistor stack rate (Pass-gate problem) Interconnection delay Complexity of design Limit to the ability to customize Layout constraints Standard cells will be implemented in CMOS static Robustness, scaling properties, existing infrastructures Custom will migrate to DCVSL and CML Domino will require bigger keepers Difference between ASIC and custom will continue to shrink Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout 15/11/2018