# Optimizing Design Time Circuits

## Presentation on theme: "Optimizing Design Time Circuits"— Presentation transcript:

Optimizing Power @ Design Time Circuits
Dejan Marković Borivoje Nikolić Power reduction is a reoccurring theme in many phases of integrated circuit life-time. This chapter focuses on techniques for power reduction at design time and at circuit-level. We will discuss many practical questions commonly encountered by designers: whether gate sizing of supply voltage yields larger returns in terms of power-performance, how many supplies do we need, what is the ratio of discrete supplies and thresholds etc. A unified sensitivity-based optimization framework will be used as a tool for finding the best tradeoff between power and delay at the circuit level by changing gate size, supply and threshold voltage. The power-performance analysis framework will be illustrated on several circuit examples that represent common topologies in logic design. The results presented in this chapter will serve as the foundation for optimization at the architecture level (Chapter 5) and higher layers of design abstraction.

Chapter Outline Optimization framework for energy-delay trade-off
Dynamic power optimization Multiple supply voltages Transistor sizing Technology mapping Static power optimization Multiple thresholds Transistor stacking

Energy/Power Optimization Strategy
For given function and activity, an optimal operation point can be derived in the energy-performance space Time of optimization depends upon activity profile Different optimizations apply to active and static power Fixed Activity Variable Activity No Activity - Standby Active Design time Run time Sleep Static Before attempting any optimization, we should recall that power and energy efficiency are the same and equal to the amount of energy required to perform some average operation (for example, addition). Since the energy is proportional to charge and voltage, this means that the efficiency could be improved by simply scaling down the supply voltage or downsizing the gates. The situation is complicated by the fact that the amount of charge stored onto gate capacitances is a function of activity (and voltage). Furthermore, a system can randomly transition between low-activity and high-activity operation modes. Optimal design parameters (e.g. supply and threshold voltage) could change dynamically, because the design parameters largely depend on the activity profile. The goal in the overall power/energy optimization is then to reduce the dominant component of power (or energy) at any given time while meeting real-time performance requirements. This naturally motivates the use of energy-performance space as a coordinate system in which designers can evaluate the effectiveness of their techniques. The energy-delay space is a compact representation of a design that provides intuition about the fundamental tradeoff between energy and performance. Optimization of both active and static components of power could be done at various phases of integrated circuit life-cycle: design time, run time, or idle (sleep) time. The optimization techniques for fixed or variable activity, active and static power, design and run time are different and will be treated in the upcoming chapters. This chapter will review techniques for power minimization at design time.

Energy/op Emax Emin Dmin Dmax Delay
Energy-Delay Optimization and Trade-off Energy/op Trade-off space Unoptimized design Emax Emin Dmin Dmax Delay What do we mean by power-performance optimization? This slide shows energy-delay plane: by changing various parameters in the design, each design maps to a region of the plane where energy can be traded off for delay. Starting from an unoptimized design, we either want to speed up the system while bringing the design under the power cap (indicated by Emax), or we want to minimize energy while satisfying the throughput constraint (Dmax). The goal is to stay at the optimal boundary energy-performance curve obtained by optimally tuning design variables (gate size, supply and threshold voltage). This curve is optimal, because all other points consume more energy for the same delay or have longer delay for the same energy. While in this graph it looks like a simple task in real life it is more complex … Maximize throughput for given energy or Minimize energy for given throughput Other important metrics: Area, Reliability, Reusability

The Design Abstraction Stack
A very rich set of design parameters to consider! It helps to consider options in relation to their abstraction layer System/Application Choice of algorithm Software Amount of concurrency Parallel versus pipelined, general purpose versus application specific (Micro-)Architecture Logic/RT logic family, standard cell versus custom Circuit This Chapter sizing, supply, thresholds Device Bulk versus SOI

Optimization Can/Must Span Multiple Levels
Architecture Micro-Architecture Circuit (Logic & FFs) The most energy-efficient solution is achieved by simultaneously optimizing across multiple abstraction layers: circuit-level, micro-architecture, as well as macro-architecture. Going upward through design abstraction layers provides more degrees of freedom for circuit optimization and results in larger overall impact on power. To manage the complexity of top-level design, the goal is to decompose the problem into several abstraction layers and try to identify independent sub-spaces to effectively exchange tradeoffs between various layers. At the circuit level, we will minimize energy subject to delay constraint using gate size, supply and threshold voltage. The result is optimal value of Vdd, Vth, and W, as well as energy-delay tradeoff. These results are then used to form models for micro-architecture level optimization. The micro-architecture level also introduces additional degrees of freedom: the choice of circuit topology, parallelism/pipelining or time-multiplexing. Optimization along these variables then yields energy-delay tradeoff for the system architect. At the system level, we also have to worry about area, because the architectural techniques such parallelism and time-multiplexing could largely vary area of the design. Additionally, incorporating flexibility typically incurs energy and area overhead and has to be accounted for by the system designers. Design optimization combines top-down and bottom-up: “meet-in-the-middle”

Energy/op Energy/op Delay Delay
Energy-Delay Optimization topology A Energy/op topology A topology B Energy/op Delay topology B The problem is that there are many sets of parameters to adjust. Some of these variables are continuous, like transistor sizes, supply and threshold voltages, and some are discrete, like different logic styles, topologies and micro-architectures. Using optimization we can determine the optimal boundary curve in energy delay space, for a given topology (topology A, for example). The problem is that another topology (topology B) will have a different boundary curve. Optimal boundary curves for individual topologies combine to define composite boundary curve for a logic function implemented. For example, topology B is better in the energy-performance sense for large target delays while the topology A is more effective for shorter delays. The goal of this chapter is to demonstrate how we can quickly search for this global optimum, based on understanding of the scope and effectiveness of all variables in the optimization. How do we do that? Delay Globally optimal energy-delay curve for a given function

Some Optimization Observations
∂E / ∂A ∂D / ∂A A=A0 SA= SB SA f (A0,B) f (A,B0) Delay Energy D0 (A0,B0) Energy-delay sensitivity is the formal way to evaluate the effectiveness of various variables in the design. This is the core of circuit optimization infrastructure. It relies on simple gradient expressions that talk about profitability of optimization: how much change in energy and delay do we have by tuning one of the design variables, for example by tuning design variable A at point A0. At point (A0, B0) illustrated in the graph, the sensitivity to each of the variables is simply the slope of the curve obtained by tuning that variable. Observe that sensitivities are negative due to the nature of energy-delay tradeoff. When we compare sensitivities, we will compare their absolute values, because the larger absolute values indicates higher potential for energy reduction. For example, variable B has higher energy-delay sensitivity at point (A0,B0) than the variable A. Energy-Delay Sensitivities [Ref: V. Stojanovic, ESSCIRC’02]

On the optimal curve, all sensitivities must be equal
Finding the Optimal Energy-Delay Curve Pareto-optimal: the best that can be achieved without disadvantaging at least one metric. f (A1,B) ∆E = SA∙(∆D) + SB∙∆D Energy (A0,B0) f (A,B0) ∆D The key concept to realize is that at the solution point sensitivities should be equal. Intuitively, this makes sense. If the sensitivities are not equal, we can utilize low-energy cost variable (variable A) to create some timing slack and increase energy by delta E, proportional to sensitivity of that variable. Now we are at point (A1,B0), so we can then use higher-energy-cost variable B to achieve overall energy reduction indicated by the formula. Fixed point in the optimization is reached when all sensitivies are equal. f (A0,B) D0 Delay On the optimal curve, all sensitivities must be equal

Reducing Active Energy @ Design Time
Reducing voltages Lowering the supply voltage (VDD) at the expense of clock speed Lowering the logic swing (Vswing) Reducing transistor sizes (CL) Slows down logic Reducing activity (a) Reducing switching activity through transformations Reducing glitching by balancing logic Let us first focus on the active component of power. Active power is a product of switching activity at the output of a gate, load capacitance at the output, logic swing, supply voltage, and frequency. Simple guideline for power reduction is therefore to reduce each of the terms in the product expression. Some variables, however, are more efficient than others. The largest impact on active power is seemingly through supply voltage scaling, because it also affects logic swing, which results in quadratic impact on power. All other terms have linear impact. For example, smaller transistors have less capacitance and result in less energy. Frequency is another factor that depends on both supply voltage and gate sizing. Switching activity mostly depends on the choice of circuit topology. For a fixed circuit topology, the most interesting tradeoff exists between supply voltage and gate sizing, since these tuning knobs affect both power and performance. In fact, gate sizing might actually have higher indirect impact on the load capacitance if we keep in mind that performance and power have to be jointly considered in the optimization. In order to fully evaluate the impact of design variables -- supply and threshold voltage and gate sizing -- on power and performance, we need an optimization framework that relates design variables with power and performance.

Observation Downsizing and/or lowering the supply on the critical path lowers the operating frequency Downsizing non-critical paths reduces energy for free, but Narrows down the path delay distribution Increases impact of variations, impacts robustness target delay target delay # of paths # of paths We also need to keep in mind the implications of design variables on other important design properties, which are not captured in the power and delay formulas. For example, we may ask ourselves how sizing or supply affect circuit reliability. It is interesting to observe that trimming only the gates on the non-critical paths could save power without performance penalty. Pruning of the gate sizes should continue until all paths are critical or we reach the constraint of minimum gate size. This power reduction doesn’t come for free, because of the increased impact of process parameter variations on performance. This effect is illustrated in the figure: delay distribution for the design with few critical paths (shown on the left) narrows down (shown on the right) with downsizing of the gates. tp (path) tp (path)

Circuit Optimization Framework
minimize Energy (VDD, VTH, W) subject to Delay (VDD, VTH, W) ≤ Dcon Constraints VDDmin < VDD < VDDmax VTHmin < VTH < VTHmax Wmin < W topology A Energy/op We can formalize the search of globally optimal power-performance curve for a given circuit topology by formulating the search procedure as an optimization problem. The goal is to minimize energy subject to a delay constraint and bounds on optimization variables (Vdd, Vth, W). Optimization is performed with respect to reference design sized for minimum delay at reference supply and threshold voltages specified by technology (for example, Vdd = 1.2V for a 90nm process). This point is convenient for optimization, because it is well defined. In order to fully establish grounds for optimization, we need to formulate models of energy and delay. topology B Reference case Dmin VDDmax, VTHref Delay [Ref: V. Stojanovic, ESSCIRC’02]

Optimization Framework: Generic Network
VDD,i VDD,i+1 i i+1 Ci gCi Cw Ci+1 The energy of a logic gate is modeled by its switching component. In this models, KeWout is the total load at the output, including wire and gate loads, and KeWpar is the self-loading of the gate. The total energy stored on these three capacitances is the energy taken out of the supply voltage in stage i. On the other hand, if we change the size of the gate in stage i, it affects only the energy stored on the gate, at its input capacitance and its parasitic capacitance. The slide shows that eci is the energy that the gate in stage i contributes to the overall energy, and this is another parameter to remember for the analysis of optimization results. Gate in stage i loaded by fanout (stage i+1)

Alpha-power based Delay Model
Fit parameters: Von, d, Kd, g (90nm technology) 2 4 6 8 10 20 30 40 50 60 Fanout (Ci+1/Ci) Delay (ps) t p 0.5 0.6 0.7 0.8 0.9 1 1.5 2.5 3 3.5 V DD / V ref FO4 delay (norm.) on = 0.37 V a d = 1.53 simulation model tnom = 6 ps g = 1.35 The delay of a logic gate is expressed using a simple linear delay model, based on the alpha-power law for the drain current. This is a curve-fitting expression and the parameters Von and d are intrinsically related yet not equal to the transistor threshold and the velocity saturation index. Kd is another fitting parameter, and W’s correspond to various gate capacitances, with indices meaning output, parasitic and input. The model fits SPICE simulated data quite nicely, across a range of supply voltages, normalized to the nominal supply voltage set by the technology, which is 1.2V in our case, for a 90nm CMOS technology. {Kd = 15.5} Using the logical effort notation, the delay formula can be expressed simply as a product of the process-dependent time constant nom and unitless delay, where g is the logical effort that quantifies the relative ability of a gate to deliver current, h is the ratio of the total output to input capacitance, and p represents the delay component due to the self-loading of the gate. The product of the logical effort and the electrical effort is the effective fanout from the logical effort theory. VDDref = 1.2V, technology 90 nm

Combined with Logical Effort Formulation
For Complex Gates Parasitic delay pi – depends upon gate topology Electrical effort fi ≈ Si+1/Si Logical effort gi – depends upon gate topology Effective fanout hi = figi The delay of a logic gate is expressed using a simple linear delay model, based on the alpha-power law for the drain current. This is a curve-fitting expression and the parameters Von and d are intrinsically related yet not equal to the transistor threshold and the velocity saturation index. Kd is another fitting parameter, and W’s correspond to various gate capacitances, with indices meaning output, parasitic and input. The model fits SPICE simulated data quite nicely, across a range of supply voltages, normalized to the nominal supply voltage set by the technology, which is 1.2V in our case, for a 90nm CMOS technology. {Kd = 15.5} Using the logical effort notation, the delay formula can be expressed simply as a product of the process-dependent time constant nom and unitless delay, where g is the logical effort that quantifies the relative ability of a gate to deliver current, h is the ratio of the total output to input capacitance, and p represents the delay component due to the self-loading of the gate. The product of the logical effort and the electrical effort is the effective fanout from the logical effort theory. [Ref: I. Sutherland, Morgan-Kaufman’99]

Dynamic Energy i i+1 Cw gCi Ci Ci+1 VDD,i+1 VDD,i
The energy of a logic gate is modeled by its switching component. In this models, KeWout is the total load at the output, including wire and gate loads, and KeWpar is the self-loading of the gate. The total energy stored on these three capacitances is the energy taken out of the supply voltage in stage i. On the other hand, if we change the size of the gate in stage i, it affects only the energy stored on the gate, at its input capacitance and its parasitic capacitance. The slide shows that eci is the energy that the gate in stage i contributes to the overall energy, and this is another parameter to remember for the analysis of optimization results. = energy consumed by logic gate i

Optimizating Return on Investment (ROI)
Depends on Sensitivity (E/D) Gate Sizing  for equal h (Dmin) Supply Voltage Sensitivity analysis provides intuition about the profitability of optimization. The ultimate values that we worry about are marginal returns—the amount of energy that can be traded for delay at each point along the energy-delay curve, when using each of the control variables. This can be simply derived from gradient expressions (sensitivities), that quantify which parameter offers the largest energy reduction for the same incremental increase in delay. The formulas indicate that the largest potential for energy savings is at the starting point—the minimum delay. In sizing, a design that is sized for minimum delay has equal effective fanouts, which means infinite sensitivity. This makes sense because at minimum delay no amount of energy can be spent to improve the delay. When the supply is scaled down, delay increases, energy decreases, and the energy reduction potential from further supply reduction diminishes. The key point to realize is that the optimization always exploits the tuning variable with the largest sensitivity, which ultimately leads to the solution where all sensitivities are equal. You will see this concept at work in many examples. max at VDD(max) (Dmin)

Example: Inverter Chain
Properties of inverter chain Single path topology Energy increases geometrically from input to output 1 S1 = 1 S2 SN S3 To illustrate circuit optimization, we look at several examples that help us examine some common circuit topologies. The circuit topologies differ in the amount of off-path loading and path reconvergence. By analyzing how there properties affect a circuit energy profile, we can better define principles for energy reduction relating to logic blocks. In this chapter, we analyze examples of inverter chain and tree adder to illustrate designs with single and multiple paths, and path reconvergence properties. Let’s begin with the inverter chain. The goal is to find optimal sizing and supply voltage that result in the best energy-delay tradeoff. Goal Find optimal sizing S = [S1, S2, …, SN], supply voltage, and buffering strategy to achieve the best energy-delay tradeoff

Inverter Chain: Gate Sizing
25 nom d = 50% opt inc 20 30% 15 effective fanout, h 10% 10 1% 5 0% Inverter chain has been the focus of many related papers as much can be inferred from this problem. An inverter chain is the topology with geometrically increasing energy towards the output. Most of the energy is stored in the last few stages, with the largest energy in the final load. Shown here is the effective fanout going over various stages through the chain, for a family of curves that correspond to various delay increments. A general result of the optimal stage size is derived by Ma and Franzon, and we’ll explain here it by using the sensitivity analysis. Recall the result from a few slides back: the sensitivity to gate sizing is proportional to the energy stored on the gate, and is inversely proportional to the difference in effective fanouts. What this means is that, for equal sensitivity in all stages, the difference in the effective fanouts must increase in proportion to the energy of the gate, indicating that the difference in the effective fanouts ends in an exponential increase towards the output. An energy-efficient solution may sometimes require a reduced number of stages. In this example, the reduction in the number of stages is beneficial at large delay increments. 1 2 3 4 5 6 7 [Ref: Ma, JSSC’94] stage Variable taper achieves minimum energy Reduce number of stages at large dinc

Inverter Chain: VDD Optimization
1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1.0 stage V DD / V nom 0% 1% 10% 30% d inc = 50% opt This slide shows optimized per-stage supply and the resulting effective fanout. As in sizing, supply voltage optimization adds incremental delay; beginning with the stages of highest energy consumption, therefore increasing the effective fanout of these stages by lowering their supply voltage. The important difference between sizing and supply optimizations is that sizing does not affect the energy stored in the final output load, while supply reduces this energy first, by lowering the supply voltage of the gate that drives the load. Now, how good can all this be in terms of energy reduction? VDD reduces energy of the final load first Variable taper achieved by voltage scaling

Inverter Chain: Optimization Results
10 20 30 40 50 0.2 0.4 0.6 0.8 1.0 d inc (%) Sensitivity (norm) cVDD S gVDD 2VDD 50 inc 10 20 30 40 60 80 100 d (%) energy reduction (%) Here is the result of various optimizations performed on the inverter chain: sizing, global Vdd, two discrete Vdd’s, and per-stage Vdd. These graphs show energy reduction and sensitivity versus delay increment. The key concept to realize is that the parameter with the largest sensitivity has the largest potential for energy reduction. For example, at small delay increments sizing has the largest sensitivity, so it offers the largest energy reduction, but the potential for energy reduction from sizing quickly falls off. At large delay increments, it pays to scale the supply voltage of the entire circuit, achieving the sensitivity equal to that of sizing at around 25% excess delay. We also see from this graph that dual supply voltage closely approximates optimal per-stage supply reduction, meaning that there is almost no additional benefit of having more than two discrete supplies for improving energy in this topology. An inverter chain has a particularly simple energy distribution, which grows geometrically until the final stage. This type of profile drives the optimization over sizing and Vdd to focus on the final stages first. However, most practical circuits have a more complex energy profile. Parameter with the largest sensitivity has the largest potential for energy reduction Two discrete supplies mimic per-stage VDD

Example: Kogge-Stone Tree Adder
Long wires Re-convergent paths Multiple active outputs We next analyze tree adder that has long wires, reconvergent fanout and multiple active outputs qualified by paths of various logic depth. Adder is an interesting arithmetic block, so let’s take a closer look at this example and see what we can learn. This is standard Kogge-Stone tree adder, we have propagate/generate blocks at input (the squares), followed by carry merge operators which in circuit implementation terms means and-or-invert operation and we finally have XORs for the final sum generation. [Ref: P. Kogge, Trans. Comp’73]

Tree Adder: Sizing vs. Dual-VDD Optimization
Reference design: all paths are critical reference D=Dmin sizing: E (-54%) dinc=10% 2Vdd: E (-27%) dinc=10% This adder is convenient for three-dimensional representation: we can partition the circuit into bit-slices and gate stages, so we have this matrix organization that describes the energy at each internal node. We start from reference design that is optimized for minimum delay and then we see how we can trade off energy and delay starting from that point. To be fair, the initial sizing makes all paths in the adder equal to the critical path. As a result, further reductions in size would cause the delay of the adder to increase. Since the paths through an adder roughly correspond to different bit slices, we allocate each gate in the adder to a bit slice. The figure shows the resulting energy map for the minimum delay, as well as the situation when a 10% delay increase is allowed. The dominant energy peaks are internal, which makes gate sizing more effective than the supply voltage scaling. The data indicates that a 54% decrease in energy is possible using gate sizing, while only 27% is saved using two supplies. Observe that this adder has internal energy peaks. So if we assume that we can apply gate sizing, that gives us freedom to choose any of these gates to attack the problem and we efficiently push these peaks down achieving a 54% energy savings for only 10% delay increase. If we take a look into dual supply voltage optimization and assume that we can only allow supply voltage do decrease from input to output, then we have to start from the output energy peaks and by the time we reach these energy peaks in the middle we have already spent a lot of delay increment on these intermediate nodes, which is sub-optimal. So, dual Vdd has smaller potential for energy reduction than sizing. So far we analyzed individual sizing and supply optimizations. Let’s see what happens when we combine them. Internal energy  S more effective than VDD S: E(-54%), 2Vdd: E(-27%) at dinc = 10%

Tree Adder: Multi-dimensional Search
Energy / Eref Delay / Dmin 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 Reference S, VDD VDD, VTH S, VTH S, VDD, VTH How about combining the benefits of multiple design variables? Which variables to utilize first and how? How many variables… It turns out that in circuit optimization two variables can get nearly the optimal gain. In our case these are sizing and threshold. Sizing has the biggest and threshold the smallest sensitivity around the minimum delay point. Closing the sensitivity gap in the way as illustrated in Slide 4-7 would yield significant power saving. We should also observe that circuit optimization is effective in a small region around reference point, outside of this region the optimization becomes too costly in terms of energy (for aggressive delay targets) or delay (for aggressive energy targets). Can get pretty close to optimum with only 2 variables Getting the minimum speed or delay is very expensive

Multiple Supply Voltages
Block-level supply assignment Higher throughput/lower latency functions are implemented in higher VDD Slower functions are implemented with lower VDD This leads to so-called “voltage islands” with separate supply grids Level conversion performed at block boundaries Multiple supplies inside a block Non-critical paths moved to lower supply voltage Level conversion within the block Physical design challenging Another popular technique for power reduction is the use of multiple supply voltages. This technique is specially useful in logic macros which have significantly different activity profile, thus requiring different optimal supply and pipelining strategy for equal performance. Multiple supply voltages could be introduced at the block/macro level or all the way down at the gate level. Gate level Vdd assignment can be essentially viewed from the energy-delay standpoint: timing slack in non-critical paths can be utilized for energy reduction by scaling down the supply voltage. Unlike having multiple gate sizes, having multiple supplies inside a block comes with a penalty of generating and distributing the different voltages.

Using Three VDD’s Power Reduction Ratio V3 (V) V3 (V) V2 (V) V2 (V)
© IEEE 2002 V2 (V) V3 (V) Power Reduction Ratio 0.5 1 1.5 0.4 0.6 0.7 0.8 0.9 1.4 1.2 1 V3 (V) 0.8 + 0.6 An interesting question is, if multiple supply voltages are employed, how many discrete supplies are sufficient and what are the values of the discrete supplies. This slide illustrates the use of three supply voltages. Supply assignment to individual logic gates is done by an optimization routine that minimizes energy for given clock period. For the main supply fixed, second and third supply provide nearly two-fold power reduction ratio, compared to the case of single supply V1=1.5V. Key observation from these plots is that the minimum of power is quite shallow. This is good, because the values of V2 and V3 don’t need to be calculated exactly. Small deviations from their optimal values could be simply due to IR drop and other secondary effects. The question is how much impact on power each additional supply brings? 0.4 0.4 0.6 0.8 1 1.2 1.4 V2 (V) V1 = 1.5V, VTH = 0.3V [Ref: T. Kuroda, ICCAD’02]

Optimum Number of VDD’s
{ V1, V2, V3 } { V1, V2 } { V1, V2, V3, V4 } 1.0 V2/V1 V2/V1 VDD Ratio V2/V1 V3/V1 V3/V1 0.5 V4/V1 1.0 P2/P1 P3/P1 P Ratio P4/P1 0.4 The effects of power reduction quickly saturate with increasing number of supplies. Additional power savings from having three or four supplies are quite marginal. This makes sense, primarily because, the number of gates at each additional supply shrinks due to larger granularity of delay increment. In this aspect, for example, fourth supply works only with the non-critical path gates close to the tail of the delay distribution. Another complication is the area, power, and routing overhead in supporting additional supplies. © IEEE 2001 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5 V1 (V) V1 (V) V1 (V) The more VDD’s the less power, but the effect saturates Power reduction effect decreases with scaling of VDD Optimum V2/V1 is around 0.7 [Ref: M. Hamada, CICC’01]

Lessons: Multiple Supply Voltages
Two supply voltages per block are optimal Optimal ratio between the supply voltages is 0.7 Level conversion is performed on the voltage boundary, using a level-converting flip-flop (LCFF) An option is to use an asynchronous level converter More sensitive to coupling and supply noise Key lessons from multiple Vdd analysis so far are summarized below: Optimal ratio between the discrete supplies is about 0.7, but the effect of power reduction saturates after having two supplies. Three supplies provide additional 5-10% incremental saving. The larges benefit is from two supplies. Level conversion circuitry is needed to transition between the supply boundaries.

Distributing Multiple Supply Voltages
Conventional Shared N-well VDDH VDDH VDDL VDDL i1 o1 i1 o1 i2 o2 i2 o2 VSS Distribution of multiple supply voltages requires careful examination of system-level design methodology. Conventional way to support multiple Vdd’s (two in this example) is to use separate wells for low-Vdd and high-Vdd. This approach does not require redesign of standard cells. Another way to introduce second supply is to introduce second Vdd in the standard cells and selectively route the cells to appropriate Vdd. Both approaches come with an area overhead (conventional: N-well spacing, second supply rail in the Shared N-well approach). Let’s further analyze both techniques to see what kind of system-level tradeoffs they introduce? VSS VDDH circuit VDDL circuit VDDH circuit VDDL circuit

Conventional VDDL Row N-well isolation VDDH VDDL VDDH Row VDDL Row
(a) Dedicated row VSS Straightforward thinking would be to organize two groups of gates into dedicated regions and route corresponding supplies (scheme b). This scheme seems nice, but making a voltage island is not trivial. For instance, logic paths consisting of both high-Vdd and low-Vdd cells would incur additional overhead in wire delay due to long wires between the islands regions. The extra wire capacitance may also skew the power savings. Maintaining spatial locality of combinational logic gates is, thus, important. Another approach is to use dedicated rows of cells. Then we could distribute both VddL and VddH near the edge of the rows and introduce another standard cell that selects which supply is connected (std-cell methodology). The dedicated row approach is better in terms of area utilization, because of smaller spatial granularity of the individual rows. VDDH Region VDDL Region VDDH circuit VDDL circuit (b) Dedicated region

Shared N-Well VDDL circuit Shared N-well VDDH circuit VDDH VDDL VSS
Another approach is to redesign std cells and have both VddL and VddH inside the cell, so we can distribute both VddH and VddL. This approach seems quite attractive, because we don’t have to worry about area partitioning – both low-Vdd and high-Vdd cells can be abutted to each other. This approach was demonstrated on high-speed adder / ALU circuit by Shimazaki et al [ISSCC’03]. VDDH circuit VDDL circuit [Shimazaki et al, ISSCC’03] (a) Floor plan image

Example: Multiple Supplies in a Block
Conventional Design CVS Structure FF Level-Shifting F/F FF FF Level conversion is an important issue related to multiple discrete Vdds. We can easily drive a low voltage from a high voltage, but the opposite transition is hard due to leakage and skewing of logic transitions. Let’s examine this effect more closely… The simplest way to facilitate two supplies is to start with flip-flops and go to lower voltage structures, never switch back. The up-conversion happens in flip-flops, because they have feedback and are able to regenerate low-swing (low-Vdd) signals. Supply voltage assignment starts from critical paths and works backwards to find non-critical paths where supply voltage can be reduced. This is illustrated in the two datapaths. Conventional design on the left has all gates operating at nominal supply. Critical path is highlighted. Dual-Vdd design on the right still has all the gates on the critical path at high-Vdd, but uses low-Vdd for the non-critical path gates (shaded). This technique of grouping having clusters of gates at different supplies is called “clustered voltage scaling” (CVS). Let’s see what kind of flip-flops are suitable for level conversion? © IEEE 1998 Critical Path Critical Path Lower VDD portion is shared “Clustered voltage scaling” [Ref: M. Takahashi, ISSCC’98] 32

Level Converting Flip-Flops (LCFFs)
Master-Slave Pulsed Half-Latch © IEEE 2003 There are a number of flip-flops that can do level-conversion and maintain good speed. The example shown in this slide is typical master-slave configuration, with master stage operating at low voltage and the slave stage operating at high voltage. Level conversion from low-to-high Vdd happens due to positive feedback mechanism in the slave latch. Level conversion can also be done in a pulsed structure illustrated on the right. Output switches during the short transparency period of the pulsed latch. Pulsed Half-Latch versus Master-Slave LCFFs Smaller # of MOSFETs / clock loading Faster level conversion using half-latch structure Shorter D-Q path from pulsed circuit [Ref: F. Ishihara, ISLPED’03]

Dynamic Realization of Pulsed LCFF
Pulsed precharge LCFF (PPR) Fast level conversion by precharge mechanism Suppressed charge/discharge toggle by conditional capture Short D-Q path Dynamic gates with NMOS-only evaluation transistors are naturally suitable for operation with reduced logic swing, because the input gate signal does not need to develop full high-Vdd swing to drive the pull down to logic zero. Reduced swing would only result in longer delay. The use of dynamic structure with implicit level conversion is shown in the schematic. Pulsed Precharge Latch © IEEE 2003 [Ref: F. Ishihara, ISLPED’03]

Case Study: ALU for 64-bit mProcessor
clock gen. clk ain0 ain carry sum 9:1 MUX 5:1 MUX carry gen. sum sel. INV1 gp gen. A real-life example that demonstrated the effective use of dual-Vdd is datapath of Itanium, which has high-performance. Observe that the output capacitance is very large, so lowering the supply voltage on the output bus yields the largest potential for power reduction. The shared-well technique was implemented in a 64-bit ALU module. Block diagram of the ALU module is shown. It employs domino circuit and it is composed of the ALU, the loop-back bus driver and the input operand selectors. Since the carry generation is the most critical operation, circuits in the carry tree are assigned to the VDDH domain. On the other hand, the partial sum generator, and the logical unit are assigned to the VDDL domain. Additionally, the bus driver, as a gate with the largest load is also supplied from VDDL. The level conversion from VDDL signal to VDDH signal is performed by the sum selector and the 9 to 1 multiplexer. INV2 s0/s1 9:1 MUX 2:1 MUX partial sum 0.5pF bin logical unit : VDDH circuit : VDDL circuit sumb (long loop-back bus) © IEEE 2003 [Ref: Y. Shimazaki, ISSCC’03]

Low-Swing Bus and Level Converter
VDDH VDDH keeper pc ain0 VDDL VDDL sel (VDDH) sumb sum This schematic shows a low swing loop-back bus and a domino level converter. Since the loop-back bus sumb has a large capacitive load, implementing VDDL circuit in the bus driver is quite important to save power. Note that the NMOS transistor in the VDDL circuit does not suffer from the negative back-biasing. Since sum is a monotonic rising signal, the delay of INV1 does not increase. In dual supply design, noise is one of the critical issues. In order to eliminate an effect of a disturbance on the loop-back bus, the receiver inv2 is placed near the 9 to 1 multiplexer to increase nise immunity. The output of INV2, which is a VDDL signal, is converted to VDDH signal by the 9 to 1 multiplexer. The level conversion is performed very fast, because the domino circuit has a pre-charge-type structure and the pre-charge level is not determined by an input signal but by a pre-charge control signal. INV1 INV2 domino level converter (9:1 MUX) © IEEE 2003 INV2 is placed near 9:1 MUX to increase noise immunity Level conversion is done by a domino 9:1 MUX [Ref: Y. Shimazaki, ISSCC’03]

Measured Results: Energy and Delay
Energy [pJ] TCYCLE [ns] Room temperature 200 300 400 500 600 700 800 0.6 0.8 1.0 1.2 1.4 1.6 1.16GHz VDDL=1.4V Energy:-25.3% Delay :+2.8% VDDL=1.2V Energy:-33.3% Delay :+8.3% © IEEE 2003 Single-supply Shared well (VDDH=1.8V) This figure shows measured ALU delay and energy consumption. The horizontal axis is the minimum cycle time of the ALU module, which is measured using critical path operation. The vertical axis is the maximum energy consumption. The single-supply operation is drawn as a reference with diamond plots. When VDDH and VDDL are equal to 1.8V, the chip operates at 1.16GHz. As VDDL decreases, the energy dissipation starts to fall. If we accept 2.8% delay increase, the total energy saving will be 25.3%. And if we accept 8.3% increase in delay, the saving will be 33.3%. As you see in this figure, the dual-supply technique expands the power-delay optimization space, as we expected. [Ref: Y. Shimazaki, ISSCC’03]

Practical Transistor Sizing
Continuous sizing of transistors only an option in custom design In ASIC design flows, options set by available library Discrete sizing options made possible in standard-cell design methodology by providing multiple options for the same cell Leads to larger libraries (> 800 cells) Easily integrated into technology mapping

Technology Mapping Larger gates reduce capacitance, but are slower a b
slack=1 d f Larger gates reduce capacitance, but are slower

Technology Mapping Example: 4-input AND
(a) Implemented using 4 input NAND + INV (b) Implemented using 2 input NAND + 2-input NOR Library 1: High-Speed Library 2: Low-Power Gate type Area (cell unit) Input cap. (fF) Average delay (ps) INV 3 1.8 CL CL NAND2 4 2.0 CL CL NAND4 5 CL CL NOR2 2.2 CL CL This is old. – maybe nice to get a better and more up to date example (delay formula: CL in fF) (numbers calibrated for 90 nm) 40

Technology Mapping – Example
4-input AND (a) NAND4 + INV (b) NAND2 + NOR2 Area 8 11 HS: Delay (ps) CL CL LP: Delay (ps) CL CL Sw Energy (fF) CL CL Area 4-input more compact than 2-input (2 gates vs. 3 gates) Timing both implementations are 2-stage realizations 2nd stage INV (a) is better driver than NOR2 (b) For more complex blocks, simpler gates will show better performance Energy Internal switching increases energy in the 2-input case Low-power library has worse delay, but lower leakage (see later)

Gate-Level Tradeoffs for Power
Technology mapping Gate selection Sizing Pin assignment Logical Optimizations Factoring Restructuring Buffer insertion/deletion Don’t care optimization

Logic Restructuring 1 Logic restructuring to minimize spurious transitions 1 2 3 Buffer insertion for path balancing

Algebraic Transformations
Idea: Modify network to reduce capacitance a b c f p1=0.05 p2=0.05 p3=0.075 p4=0.75 p5=0.075 pa = 0.1; pb = 0.5; pc = 0.5 Caveat: This may increase activity!

Lessons from Circuit Optimization
Joint optimization over multiple design parameters possible using sensitivity-based optimization framework Equal marginal costs ⇔ Energy-efficient design Peak performance is VERY power inefficient About 70% energy reduction for 20% delay penalty Additional variables for higher energy-efficiency Two supply voltages in general sufficient; 3 or more supply voltages only offer small advantage Choice between sizing and supply voltage parameters depends upon circuit topology But … leakage not considered so far

Leakage is not essentially a bad thing
Considering Design Time Considering leakage as well as dynamic power is essential in sub-100 nm technologies Leakage is not essentially a bad thing Increased leakage leads to improved performance, allowing for lower supply voltages Again a trade-off issue …

Optimal designs have high leakage (ELk/ESw ≈ 0.5)
Leakage – Not Necessarily a Bad Thing 1 Version 1 0.8 ref V -180mV th max 0.81V DD 0.6 E norm Version 2 0.4 Topology Inv Add Dec (ELk/ESw)opt 0.8 0.5 0.2 V ref 0.2 -140mV th max 0.52V DD © IEEE 2004 -2 -1 1 This plot shows energy per operation for the reference design, its parallel and pipelined implementation, as the ratio of leakage to switching energy changes. Notice that the most energy efficient designs have considerable leakage energy. At the minimum energy point the leakage is about 50% of the switching energy, as represented by the dots on the curve. The leakage energy in the parallel design is larger than in the pipelined design because of the larger area. For that reason, optimal threshold voltage in parallel design is larger than that in pipeline design. Although pipelining appears marginally better overall, parallelism is much more practical because it doesn’t require any extra effort to re-time underlying circuit blocks. Observe that the energy minimum is fairly flat for a wide range of leakage-to-switching ratio. This ratio does not change much for different topologies except if activity changes by orders of magnitude, since the optimal ratio is a log function of activity and logic depth. Still, looking into significantly different circuit topologies from few slides back, I found that optimal ratio of the leakage-to-switching energy didn’t change much. Moreover, in the range defined by these extreme cases, energy of adder-based implementations is still very close to minimum, from 0.2 to 0.8 leakage-to-switching ratio, as shown in this graph. Similar situation occurs if we analyze inverter chain and memory decoder and assume optimal leakage-to-switching ratio of 0.5. From this analysis, we can derive very simple general result: energy is minimized when the leakage-to-switching ratio is about 0.5, regardless of logic topology or function. This is extremely important practical result. We can use this knowledge to determine optimal Vdd and Vth in any design. 10 10 10 10 E static /E dynamic Optimal designs have high leakage (ELk/ESw ≈ 0.5) Must adapt to process and activity variations [Ref: D. Markovic, JSSC’04]

Refining the Optimization Model
Switching energy Leakage energy with: I0(Y): normalized leakage current with inputs in state Y The switching energy model, consists of energy consuming transition probability, supply voltage, parasitic and output load capacitances. The leakage energy model is using the standard input state-dependent exponential leakage current model with DIBL effect.

Reducing Leakage @ Design Time
Using longer transistors Limited benefit Increase in active current Using higher thresholds Channel doping Stacked devices Body biasing Reducing the voltage!! We can adjust three parameters at the design stage. When the threshold is higher, leakage drops exponentially. Most technologies today come with multiple thresholds. This is purely result of channel doping. Body biasing can be used both at design and run time. Finally, when you stack devices you get the perception of higher Vt, because of body effect. Another way is to use longer transistors – they have larger threshold and start rolling-of sooner. Obvious choice is to scale supply voltage (quadratic in leakage power, unless at very high Vdd where it is exponential).

Longer Channels 10% longer gates reduce leakage by 50%
1.0 10 0.9 90 nm CMOS 9 10% longer gates reduce leakage by 50% Increases switching power by 18% with W/L = const. 0.8 8 Leakage power 0.7 7 0.6 6 Normalized leakage power Normalized switching energy 0.5 5 0.4 4 Switching energy 0.3 3 0.2 2 0.1 1 Leakage current drops very rapidly when backing off just a little from the nominal transistor length: 10% increase in length results in 35% reduction in leakage current. This doesn’t come for free: less performance and increased capacitive load (dynamic power goes up). Initially, we get very attractive benefit due to steep curve. For some sensitive circuits it makes sense to play with L. 100 110 120 130 140 150 160 170 180 190 200 Transistor length (nm) Doubling L reduces leakage by 5x Impacts performance Attractive when don’t have to increase W (e.g. memory)

Using Multiple Thresholds
There is no need for level conversion Dual thresholds can be added to standard design flows High-VTh and Low-VTh libraries are a standard in sub-0.18m processes For example: can synthesize using only high-VTh and then only in-place swap in low-VTh cells to improve timing. Second VTh insertion can be combined with resizing Only two thresholds are needed per block Using more than two yields small improvements

Leakage Reduction Ratio
Three VTH’s © IEEE 2002 + VTH.3 (V) VTH.2 (V) 0.4 0.6 0.8 1 1.2 1.4 Leakage Reduction Ratio VTH.3 (V) VTH.2 (V) 0.5 1 1.5 0.2 0.4 0.6 0.8 How many Vt’s to have? Vt variance determines the number of thresholds. Practically, there is not much benefit of having more than 2 thresholds, similarly to the plot shown for multiple Vdd’s. For Vt2 of about 0.35V, the impact of third threshold is very small due to shallow minimum. This argues for the fact that two thresholds are sufficient. Again a Kuroda slide VDD = 1.5V, VTH.1 = 0.3V Impact of third threshold very limited [Ref: T. Kuroda, ICCAD’02]

Using Multiple Thresholds
Cell-by-cell VTH assignment (not at block level) Achieves all-low-VTH performance with substantial leakage reduction in leakage FF FF FF Stacked device is used mostly for standby power reduction. Subthreshold is not good mode because of high performance penalty. Scaling Vdd and Vt simultaneously and deal with Vt by stacking is a good idea. If we have technology with 2 or 3 Vt’s, we can do the same as with multiple Vdd’s if we mostly worry about standby power. If we optimize for particular critical path, where we use low Vt devices where we trade off performance for leakage. In non-critical paths, we can use slack by utilizing Vth for leakage power instead of utilizing Vdd for dynamic power. Nice thing compared to multi-Vdd is that going from low to high threshold has no penalty. We can introduce different threshold arbitrarily without any design changes. FF FF High VTH Low VTH [Ref: S. Date, SLPE’94]

Dual-VT Domino … Low-threshold transistors used only in critical paths
Inv1 Inv2 Inv3 Dn+1 Clkn Clkn+1 Dn We can also do this for dynamic logic. We care that we have active power when we are switching very fast – that is where we use low-Vt devices. All other devices will be made of high-Vt. During evaluation, we want to have low-Vt NMOS and precharge with high-Vt PMOS. Very clever combination of NMOS and PMOS. That’s what we do at design time: longer devices , stacking, and multiple Vth. Shaded transistors are low threshold

Multiple Thresholds and Design Methodology
Easily introduced in standard cell design methodology by extending cell libraries with cells with different thresholds Selection of cells during technology mapping No impact on dynamic power No interface issues (as was the case with multiple VDD’s) Impact: Can reduce leakage power substantially

Dual-VTH Design for High-Performance Design
High-VTH Only Low-VTH Only Dual VTH Total Slack -53 psec 0 psec Dynamic Power 3.2 mW 3.3 mW Static Power 914 nW 3873 nW 1519 nW All designs synthesized automatically using Synopsys Flows [Courtesy: Synopsys, Toshiba, 2004]

Example: High- vs. Low-Threshold Libraries
Selected combinational tests 130 nm CMOS Leakage Power (nW) [Courtesy: Synopsys 2004]

Complex Gates Increase Ion/Ioff Ratio
No stack Stack 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.5 2 2.5 3 VDD (V) Ioff (nA) No stack Stack 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20 40 60 80 100 120 140 Ion (mA) VDD (V) (90nm technology) (90nm technology) I can comment on these slides Ion and Ioff of single NMOS versus stack of 10 NMOS transistors Transistors in stack are sized up to give similar drive

Complex Gates Increase Ion/Ioff Ratio
Stack No stack Factor 10! 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.5 2 2.5 3 3.5 x 105 VDD (V) Ion/Ioff ratio (90nm technology) Stacking transistors suppresses submicron effects Reduced velocity saturation Reduced DIBL effect Allows for operation at lower thresholds

Complex Gates Increase Ion/Ioff Ratio
Example: 4-input NAND versus Fan-in (4) Fan-in (2) 2 4 6 8 10 12 14 16 Input pattern Leakage Current (nA) Fan-in (2) Fan-in (4) With transistors sized for similar performance: Leakage of Fan-in(2) = Leakage of Fan-in(4) x 3 (Averaged over all possible input patterns)

Example: 32 bit Kogge-Stone Adder
factor 18 © Springer 2001 % of input vectors Standby leakage current (mA) Reducing the threshold by 150 mV increases leakage of single NMOS transistor by factor 60 [Ref: S.Narendra, ISLPED’01]

Summary Circuit optimization can lead to substantial energy reduction at limited performance loss Energy-delay plots the perfect mechanisms for analyzing energy-delay trade-off’s. Well-defined optimization problem over W, VDD and VTH parameters Increasingly better support by today’s CAD flows Observe: leakage is not necessarily bad – if appropriately managed.

References Books: Articles:
A. Bellaouar, M.I Elmasry, Low-Power Digital VLSI Design Circuits and Systems, Kluwer Academic Publishers, 1st Ed, 1995. D. Chinnery, K. Keutzer, Closing the Gap Between ASIC and Custom, Springer, 2002. D. Chinnery, K. Keutzer, Closing the Power Gap Between ASIC and Custom, Springer, 2007. J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2nd ed, Prentice Hall 2003. I. Sutherland, B. Sproul, D. Harris, Logical Effort: Designing Fast CMOS Circuits, Morgan-Kaufmann, 1st Ed, 1999. Articles: R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic, V. Stojanovic, “Methods for True Power Minimization,” Int. Conf. on Computer-Aided Design (ICCAD), pp , Nov S. Date, N. Shibata, S.Mutoh, and J. Yamada, "IV 30MHz Memory-Macrocell-Circuit Technology with a 0.5urn Multi-Threshold CMOS," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp , Oct M. Hamada, Y. Ootaguro, T. Kuroda, “Utilizing Surplus Timing for Power Reduction,” IEEE Custom Integrated Circuits Conf., (CICC), pp , Sept F. Ishihara, F. Sheikh, B. Nikolic, “Level conversion for dual-supply systems,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp , Aug P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient Solution of General Class of Recurrence Equations,” IEEE Trans. Comput., vol. C-22, no. 8, pp , Aug 1973. T. Kuroda, “Optimization and control of VDD and VTH for low-power, high-speed CMOS design,” Proceedings ICCAD 2002, pp. , San Jose, Nov 63

References Articles (cont.):
H.C. Lin and L.W. Linholm, “An Optimized Output Stage for MOS Integrated Circuits,” IEEE J. Solid-State Circuits, vol. SC-10, no. 2, pp , Apr S. Ma and P. Franzon, “Energy Control and Accurate Delay Estimation in the Design of CMOS Buffers,” IEEE J. Solid-State Circuits, vol. 29, no. 9, pp , Sept D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, “Methods for True Energy-Performance Optimization,” IEEE Journal of Solid-State Circuits, vol. 39, no. 8, pp , Aug MathWorks, S. Narendra, S. Borkar, V. De, D. Antoniadis, A. Chandrakasan, “Scaling of stack effect and its applications for leakage reduction,” Int. Conf. Low Power Electronics and Design, (ISLPED), pp , Aug T. Sakurai and R. Newton, “Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas,” IEEE J. Solid-State Circuits, vol. 25, no. 2, pp , Apr Y. Shimazaki, R. Zlatanovici, B. Nikolic, “A shared-well dual-supply-voltage 64-bit ALU,” Int. Conf. Solid-State Circuits, (ISSCC), pp , Feb V. Stojanovic, D. Markovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, “Energy-Delay Tradeoffs in Combinational Logic using Gate Sizing and Supply Voltage Optimization,” European Solid-State Circuits Conf., (ESSCIRC), pp , Sept M. Takahashi et al., “A 60mW MPEG video codec using clustered voltage scaling with variable supply-voltage scheme,” IEEE Int. Solid-State Circuits Conf., (ISSCC), pp , Feb

Similar presentations