Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lower Power Design Guide

Similar presentations


Presentation on theme: "Lower Power Design Guide"— Presentation transcript:

1 Lower Power Design Guide
성균관대학교 조 준 동 교수 SungKyunKwan Univ.

2 Contents 1. Intoduction Trends for High-Level Lower Power Design
2. Power Management Clock/Cache/Memory Management 3. Architecture Level Design Architecture Trade offs, Transformation 4. RTL Level Design Retiming, Loop-Unrolling, Clock Selection, Scheduling, Resource Sharing, Register Allocation 5. partitioning 6. Logic Level Design 7. Circuit Level Design 8. Quarter Sub Micron Layout Design Lower Power Clock Designs 9. CAD tools 10. References SungKyunKwan Univ.

3 1. Introduction SungKyunKwan Univ.

4 Motivation Portable Mobile (=ubiquitous =nomadic)
Systems with limited for heat sinks Lowering power with fixed performance: DSPs in modems and cellular phones Reliability: Increasing power ! increasing electromigration, 40-year reliability guarantee (product life cycle of telecommunication industries) Adding fans to reduce power cause reliability to plummet. Higher power leads to higher packaging costs: 2-watt package can be four times greater than a 1-watt package Myriad Constraints: timing, power, testability, area, packaging, time-to-market. Ad-Hoc Design: Lack a systematic process leading to universal applicability. SungKyunKwan Univ.

5 Power!Power!Power! SungKyunKwan Univ.

6 Power Dissipation in VLSI’s
clock clock memory clock I/O clock logic MPU1 MPU1 ASSP1 memory ASSP2 memory memory logic I/O logic MPU1: low-end microprocessor for embedded use MPU2: high-end CPU with large amount of cache ASSP1: MPEG2 decoder ASSP2: ATM switch SungKyunKwan Univ.

7 Current Design Issues in Lower Power Problem
Energy-hungry Function by Network Server: Infopad (univ. of California, Berkeley), weight < 1 pound, 0.5W (re ective color display) + 0.5W (computation,communication, I/O support) = 1W (Alpha chip: 25W StrongARM: 215 MHz at 2.0V:0.3W) runtime 50 hours, target: 100MIPS/mW. Deep-sub micron ( ) with low voltage for portable full motion video terminal; 0:5m : 40 AA NiMH; 1m : 1 AA NiMH System-On-A-Chip to reduce external Interconnection Capacitances Power Management: shut down idle units Power Optimization Techniques in Software, Architecture,Logic/Circuit, Layout Phases to reduce operations, frequency, capacitance, switching activity with maintaining the same throughput. SungKyunKwan Univ.

8 Battery Trends SungKyunKwan Univ.

9 Road-Map in Semiconductor Device Integration
SungKyunKwan Univ.

10 Road-Map in Semiconductor Device Complexity
SungKyunKwan Univ.

11 Power Component SungKyunKwan Univ.
Static: Leakage current(<< 1%) Dynamic: Short Circuit power(10-30%): Short circuit ow during transitions, Switching (or capacitive) power(70-90%): Charging/discharging of capacitive loads during transitions SungKyunKwan Univ.

12 Vdd vs Delay use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. Scale down device sizes to compensate for delay (Interconnects do not scale proportionately and can become dominant) SungKyunKwan Univ.

13 Good Design Methodologies
SungKyunKwan Univ.

14 Synthesis and Optimization
Pareto point SungKyunKwan Univ.

15 2. Power Management SungKyunKwan Univ.

16 Power Consumption in Multimedia Systems
LCD: 54.1%, HDD 16.8%, CPU 10.7%, VGA/VRAM 9.6%, SysLogic 4.5%, DRAM 1.1%, Others: 3.2% 5-55 Mode: Display mode: CPU is in sleep-mode (55 minutes), LCD (VRAM + LCDC) CPU mode: Display is idle ( 5 minutes), Looking up - data retrival Handwrite recognition - biggest power (memory, system bus active) SungKyunKwan Univ.

17 Power Management DPM (Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators. The clock regenerators produce two clocks, C1 and C2 . The logic: 0.3%, 10-20% of power savings. SPM (Static Power Management): saving of the power dissipation in the steady mode. When the system (or subsystem) remains idle for a significant period time, then the entire chip (or subsystem) is shut-down. Identify power hungry modules and look for opportunities to reduce power If f is increased, one has to increase the transistor size or Vdd. SungKyunKwan Univ.

18 Power Management(christian.piguet@csemne.ch)
use right supply and right frequency to each part of the system If one has to wait on the occurence of some input, only a small circuit could wait and wake-up the main circuit when the input occurs. Another technique is to reduce the basic frequency for tasks that can be executed slowly. PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel execution units. 4 modes: Full on mode for full speed Doze mode in which the execution units are not running Nap mode which also stops the bus clocking and the Sleep mode which stops the clock generator Sleep mode which stops the clock generator with or without the PLL (20-100mW). Superpipelined MIPS R4200 : 5-stage pipleline, MIPS R4400: 8 stage, 2 execution units, f/2 in reduce mode. SungKyunKwan Univ.

19 TI Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and system cost for wireless communication applications C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family: Three different power down modes, these devices are well-suited for wireless communications products such as digital cellular phones, personal digital assistants, and wireless modem,low power on voice coding and decoding The TMS320LC548 features: 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times 3.0- and 3.3-V operation 32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip Integrated Viterbi accelerator that reduces Viterbi butter y update in four instruction cycles for GSM channel decoding Powerful single-cycle instructions (dual operand, parallel instructions, conditional instructions) Low-power standby modes SungKyunKwan Univ.

20 Power Estimation Techniques
Circuit Simulation (SPICE): a set of input vectors, accurate, memory and time constraints Monte Carlo: randomly generated input patterns, normal distributed power per time interval T using a simulator switch level simulation (IRSIM): defined as no. of rising and falling transitions over total number of inputs Powermill (transistor level): steady-state transitions, hazards and glitches, transient short circuit current and leakage current; measures current density and voltage drop in the power net and identifies reliability problem caused by EM failures, ground bounce and excessive voltage drops. DesignPower (Synopsys): simulation-based analysis is within 8-15% of SPICE in terms of percentage difference (Probability-based analysis is within 15-20% of SPICE). SungKyunKwan Univ.

21 Cache/Memory Management
Clock and memory consumes between 15% to 45% of the total power in digital computers As block size increases, the energy required to service miss increases due to increased memory access external-memory access (530 mA) vs. on-chip access(300mA): Replacing excessive accesses to background memory by foreground memory Cache vertical partitioning (buffering): multi-level variable-size caches Caches are powerdown when idle. Cache horizontal partitioning (subarray access): several segments can be powered individually. Only the cache sub-bank where the requested data is located consumes power in each cache access. Using distributed memory instead of a single centralized memory Locality of reference to eliminate expensive data transfer across high capacitance busses Cache misses consume more energy (directed-mapping or k-associated mapping?), page faults consume more energy SungKyunKwan Univ.

22 Power Management Power Management Scheme by Enabling Clock
Power Management Scheme by adding Clock Generation block SungKyunKwan Univ.

23 3. Architectural Level Design
SungKyunKwan Univ.

24 Architectural-level Synthesis
Translate HDL models into sequencing graphs. Behavioral-level optimization: Optimize abstract models independently from the implementation parameters. Architectural synthesis and optimization: Create macroscopic structure: data-path and control-unit. Consider area and delay information Hardware compilation: Compile HDL model into sequencing graph. Optimize sequencing graph. Generate gate-level interconnection for a cell library. of the implementation. SungKyunKwan Univ.

25 Power Measure of P SungKyunKwan Univ.

26 System-Level Solutions
Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past). Precompute physical capacitance of Interconnect and switching activity (number of bus accesses) Architecture-Driven Voltage Scaling: Choose more parallel architecture Supply Voltage Scaling : Lowering V dd reduces energy, but increase delays SungKyunKwan Univ.

27 Software Power Issues Upto 40% of the on-chip power is dissipated on the buses ! System Software : OS, BIOS, Compilers Software can affect energy consumption at various levels Inter-Instruction Effects Energy cost of instruction varies depending on previous instruction For example, XORBX 1; ADDAX DX; Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA The difference defined as circuit state overhead Need to specify overhead as a function of pairs of instructions Due to pipeline stalls, cache misses Instruction reordering to improve cache hit ratio SungKyunKwan Univ.

28 Avoiding Wastful Computation
Preservation of data correlation Distributed computing / locality of reference Application-specific processing Demand-driven operation Transformation for memory size reduction Consider arrays A and C are already available in memory When A is consumed another array B is generated; when C is consumed a scalar value D is produced. Memory Size can be reduced by executing the j loop before the i loop so that C is consumed before B is generated and the same memory space can be used for both arrays. SungKyunKwan Univ.

29 Avoiding Wastful Computation
SungKyunKwan Univ.

30 Architecture Lower Power Design
Optimum Supply Voltage Architecture through Hardware Duplication (Trading Area for Lower Power) and/or Pipelining complex and fewer instruction requires less encoding, but larger decode logic! use small complex instruction with smaller instruction length (e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses only two operands, NEC V800: variable-length instruction decoding overhead ) Superscalar: CPI < 1: parallel instruction execution. VLIW architecture. SungKyunKwan Univ.

31 Variable Supply Voltage Block Diagram
Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload. The basic idea is to lower power supply when the a fixed supply for some fraction of time. The supply voltage and clock rate are increased during high workload period. SungKyunKwan Univ.

32 Power Reduction using Variable Supply
Circuits with a fixed supply voltage work at a fixed speed and idle if the data sample requires less than the maximum amount of computation. Power is reduced in a linear fashion since the energy per operation is fixed. If the work load for a given sample period is less than peak, then the delay of the processing element can be increased by a factor of 1/workload without loss in throughput, allowing the processor to operate at a lower supply voltage. Thus, energy per operation varies. SungKyunKwan Univ.

33 Data Driven Signal Processing
The basic idea of averaging two samples are buffered and their work loads are averaged. The averaged workload is then used as the effective workload to drive the power supply. Using a pingpong buffering scheme, data samples In +2, In +3 are being buffered while In, In +1 are being processed. SungKyunKwan Univ.

34 Architecture of Microcoded Instruction Set Processor
SungKyunKwan Univ.

35 Power and Area 1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption. SungKyunKwan Univ.

36 Datapath Parallelization
SungKyunKwan Univ.

37 Memory Parallelization
At first order P= C * f/2 * Vdd2 SungKyunKwan Univ.

38 Pipelined Micro-P SungKyunKwan Univ.

39 Architecture Trade-Off
Ppipeline = (1.15C)( 0.58V)2 (f) = 0.39P Pparallel = (2.15C)(0.58V)2 (0.5f) = 0.36P PIPLELINED Implementation SungKyunKwan Univ.

40 Through WAVE PIPELINING
SungKyunKwan Univ.

41 Different Classes of RISC Micro-P
SungKyunKwan Univ.

42 Application Specific Coprocessor
DSP's are increasingly called upon to perform tasks for which they are not ideally suited, for example, Viterbi decoding. They may also take considerably more energy than a custom solution. Use the DSP for portions of algorithms for which it is well suited, and craft an application-specic coprocessor (i.e., custom hardware) for other tasks. This is an example of the dierence between power and energy The application-specic coprocessor may actually consume a more power than the DSP, but it may be able to accomplish the same task in far less time, resulting in a net energy savings. Power consumption varies dramatically with the instruction being executed. SungKyunKwan Univ.

43 Clock per Instruction (CPI)
SungKyunKwan Univ.

44 SUPERPIPELINE micro-P
SungKyunKwan Univ.

45 VLIW Architecture SungKyunKwan Univ.
Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency. The fixed format could present more limitations to the combination of operations. Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel. As power becomes a major issue in the design of fast -Pro, the simple is the better architecture. VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneously high-speed and low-power. SungKyunKwan Univ.

46 Synchronous VS. Asynchronous SYSTEMS
Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch. Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs. SungKyunKwan Univ.

47 Synchronous VS. Asynchronous SYSTEMS
More difficult to implement, requiring explicit synchronization between communication blocks without clocks If the signal feeds directly to conventional gate-level circuitry, invalid logic levels could propagate throughout the system. Glitches, which are filtered out by the clock in synchronous designs, may cause an asynchronous design to malfunction. Asynchronous designs are not widely used, designers can't find the supporting design tools and methodologies they need. DCC Error Corrector of Compact cassette player saves power of 80% as compared to the synchronous counterpart. Offers more architectural options/freedom encourages distributed, localized control offers more freedom to adapt the supply voltage SungKyunKwan Univ.

48 Asynchronous Modules SungKyunKwan Univ.

49 Example: ABCS protocol
6% more logics SungKyunKwan Univ.

50 Control Synthesis Flow
SungKyunKwan Univ.

51 PIPELINED SELF-TIMED micro P
SungKyunKwan Univ.

52 Programming Style SungKyunKwan Univ.

53 Speed vs. Power Optimization
SungKyunKwan Univ.

54 VON NEUMANN VERSUS HARVARD
SungKyunKwan Univ.

55 Low Vdd Main Memories SungKyunKwan Univ.

56 CACHE MEMORIES SungKyunKwan Univ.

57 Low Power Memory SungKyunKwan Univ.
Hierarchical Word Line: Divide the memory in different blocks and access the bit cells of the desired block Selective precharge: Many bit lines are discharged even when these locations are not accessed. Only bit lines which will be accesses are precharged. Minimization of Non-zero Terms in the ROM table: Zero terms do not switch bit lines and reduce capacitance in both bit lines and row lines. Inverted ROM: If the number of ones is very high, the whole ROM core can be inverted. Inverted Row: A given row is inverted if more than half of the bits are non-zero terms. An extra bit is required to perfoem encoding. Sign magnitude representation: ROM is used to store the coefficients of a digital filter. As a result, a significant amount of the non-zero terms are due to the sign extension of the negative coefficients. The main drawback of this type is that a conversion to two’s complement is required at the end of a cycle, which slows down the ROM. Sign magnitude and inverted block: Difference Encoding: reduce the size of the ROM core. If the value between adjacent data do not change significantly, the ROM core stores the difference between the data. SungKyunKwan Univ.

58 Low Power Memory SungKyunKwan Univ.
Smaller ROMS: in 102 tap filter, more than 70% of the coefficients are below 18 bits. Still the largest coefficients are below 18 bits. Still the largest coefficient goes up 24 bits. A better implementation can be achieved if the large coefficients are stored in a wide ROM with fewer address; the small coefficients are stored in narrow ROM with many addresses. A similar approach is applied for locations in ROM which are often accessed. Loations that are accesses frequently are stored in a small, fast ROM, while the other locations are stored in a larger ROM. NMOS precharge: bit lines are precharged to Vdd - Vt. A drawback of this technique is degradation of noise margins and the body bias effect. Buffer Sizing: a large set of buffers is required in the control logic to drive the address lines through the decoder, generate the contol signals for the column multiplexers, drive the row lines and drive the precharge signals. Voltage scaling: SungKyunKwan Univ.

59 Memory Architecture SungKyunKwan Univ.

60 Exploiting Locality for Low-Power Design
A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flow graph representation. Two nodes are tightly connected to each other on the flow graph representation if the shortest distance between them, in terms of number of edges traversed, is low. Power consumption (mW) in the maximally time-shared and fully-parallel versions of the QMF sub-band coder filter Improvement of a factor of 10.5 at the expense of a 20% increase in area The interconnect elements (buses, multiplexers, and buffers) consumes 43% and 28% of the total power in the time-shared and parallel versions. SungKyunKwan Univ.

61 Cascade filter layouts
(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP SungKyunKwan Univ.

62 Stage-Skip Pipeline SungKyunKwan Univ.
The power savings is achieved by stopping the instruction fetch and decode stages of the processor during the loop execution except its first iteration. DIB = Decoded Instruction Buffer 40 % power savings using DSP or RISC processor. SungKyunKwan Univ.

63 Stage-Skip Pipeline SungKyunKwan Univ.
Selector: selects the output from either the instruction decoder or DIB The decoded instruction signals for a loop are temporarily stored in the DIB and are reused in each iteration of the loop. The power wasted in the conventional pipeline is saved in our pipeline by stopping the instruction fetching and decoding for each loop execution. SungKyunKwan Univ.

64 Stage-Skip Pipeline SungKyunKwan Univ.
Majority of execution cycles in signal processing programs are used for loop execution : 40% reduction in power with area increase 2%. SungKyunKwan Univ.

65 Parallel LIFO Scenario
SungKyunKwan Univ.

66 Parallel-serial Converter
SungKyunKwan Univ.

67 D- flip- flop Parallelization
SungKyunKwan Univ.

68 State Machine SungKyunKwan Univ.

69 Frequency Multipliers and Dividers
SungKyunKwan Univ.

70 Data Reuse Exploration
MH(memory hierarchy) introduces copies of data from larger to smaller memories in DFG. Power consumption is decreased because data is now read mostly from smaller memories, while it is increased because extra memory transfers are introduced. Moreover, adding another layer of hierarchy has a negative effect on the area and interconnect cost. SungKyunKwan Univ.

71 Instruction Decoding Architecture of Control Logic in Microprocessor
State Transition Diagram Binary Code Mapping Hardware Implementation SungKyunKwan Univ.

72 Optimizing Power using Transformation
SungKyunKwan Univ.

73 Summary of Results Optimum voltage for low-power is around 1.5V
SungKyunKwan Univ.

74 Data- flow based transformations
Tree Height reduction. Constant and variable propagation. Common subexpression elimination. Code motion Dead-code elimination The application of algebraic laws such as commutability, distributivity and associativity. Most of the parallelism in an algorithm is embodied in the loops. Loop jamming, partial and complete loop unrolling, strength reduction and loop retiming and software pipelining. Retiming: maximize the resource utilization. SungKyunKwan Univ.

75 Tree-height reduction
Example of tree-height reduction using commutativity and associativity Example of tree-height reduction using distributivity SungKyunKwan Univ.

76 Sub-expression elimination
Logic expressions: Performed by logic optimization. Kernel-based methods. Arithmetic expressions: Search isomorphic patterns in the parse trees. Example: a= x+ y; b = a+ 1; c = x+ y; a= x+ y; b = a+ 1; c = a; SungKyunKwan Univ.

77 Examples of other transformations
Dead-code elimination: a= x; b = x+ 1; c = 2 * x; a= x; can be removed if not referenced. Operator-strength reduction: a= x2 ; b = 3 * x; a= x * x; t = x<<1; b = x+ t; Code motion: for ( i = 1; i < a * b) { } t = a * b; for ( i = 1; i < t) { } SungKyunKwan Univ.

78 Strength reduction SungKyunKwan Univ.

79 Strength Reduction SungKyunKwan Univ.

80 Control- flow based transformations
Model expansion. Expand subroutine flatten hierarchy. Useful to expand scope of other optimization techniques. Problematic when routine is called more than once. Example: x= a+ b; y= a * b; z = foo( x, y) ; foo( p, q) {t =q-p; return(t);} By expanding foo: x= a+ b; y= a * b; z = y-x; Conditional expansion Transform conditional into parallel execution with test at the end. Useful when test depends on late signals. May preclude hardware sharing. Always useful for logic expressions. Example: y= ab; if ( a) x= b+d; else x= bd; can be expanded to: x= a( b+ d) + a’bd; y= ab; x= y+ d( a+ b); SungKyunKwan Univ.

81 Pipelining SungKyunKwan Univ.

82 Associativity Transformation
SungKyunKwan Univ.

83 FIR Parallelization SungKyunKwan Univ.

84 FIR PARALLELIZATION SungKyunKwan Univ.

85 FIR Filter Parallelization
SungKyunKwan Univ.

86 FIR parallelization: two working phases
SungKyunKwan Univ.

87 IIR filter recursive function
SungKyunKwan Univ.

88 Recursive Function SungKyunKwan Univ.

89 Interlaced Accumulation Programming for Low Power
SungKyunKwan Univ.

90 4. Register Transfer Level Design
SungKyunKwan Univ.

91 FIR3 Block Diagram and Flow Graph
SungKyunKwan Univ.

92 High-Level Power Estimation
Pcore = PDP + PMEM + PCNTR + PPROC PDP = PREG +PMUX +PFU + +PFU, where PREG is the power of the registers PMUX is the power of multiplexers PFU is the power of functional units PINT is the power of physical interconnet capacitance SungKyunKwan Univ.

93 High-Level Power Estimation: PREG
Compute the lifetimes of all the variables in the given VHDL code. Represent the lifetime of each variable as a vertical line from statement i through statement i + n in the column j reserved for the corresponding varibale v j . Determine the maximum number N of overlapping lifetimes computing the maximum number of vertical lines intersecting with any horizontal cut-line. Estimate the minimal number of N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i . Select a possible mapping of variables into registers by using register sharing Compute the number w i of write to the variables mapped to the same set of registers. Estimate n i of each set of register dividing w i by the number of statements S: i =wi/S; hence TR imax = n i f clk . Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers The non-switching power PNSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system. In total, SungKyunKwan Univ.

94 PCNTR SungKyunKwan Univ.
After scheduling, the control is defined and optimized by the hardware mapper and further by the logic synthesis process before mapping to layout. Like interconnect, therefore, the control needs to be estimated statistically. Global control model: Local control model: the local controller account for a larger percentage of the total capacitance than the global controller. Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus factor, and Clc is the capacitance switched in any local controller in one sample period. Bf is the ratio of the number of bus accesses to the number of busses. SungKyunKwan Univ.

95 Ntrans SungKyunKwan Univ.
The number of transitions depends on assignment, scheduling, optimizations, logic optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs. SungKyunKwan Univ.

96 Behavioral Synthesis SungKyunKwan Univ.
loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples. Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation, The transformation yields critical path of 3, thus voltage can be dropped. Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource utilization and Enable greater voltage scaling Module selection : For each operation, choose library template Flow graph restructuring : pull out operations on the critical cycle. SungKyunKwan Univ.

97 High-Level Power Estimation: PMUX and PFU
SungKyunKwan Univ.

98 Critical Path Critical path in Synchronous Sequential logic
Longest delayed path from input to output in combinational logic Determine operating clock frequency Resizing non-critical path transistor (In-Place Optimization) Critical path in Synchronous Sequential logic SungKyunKwan Univ.

99 Loop Unrolling for Low Power
SungKyunKwan Univ.

100 Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit SungKyunKwan Univ.

101 Exploiting spatial locality for interconnect power reduction
Global Local Adder1 Adder2 SungKyunKwan Univ.

102 Balancing maximal time-sharing and fully-parallel implementation
A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers) SungKyunKwan Univ.

103 Retiming/pipelining for Critical path
SungKyunKwan Univ.

104 Effective Resource Utilization
SungKyunKwan Univ.

105 Hazard propagation elimination by clocked sampling
By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics. SungKyunKwan Univ.

106 Latched Retiming SungKyunKwan Univ.

107 Latched retiming SungKyunKwan Univ.

108 Regularity SungKyunKwan Univ.
Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware. SungKyunKwan Univ.

109 Module Selection Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized. During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints SungKyunKwan Univ.

110 Estimation SungKyunKwan Univ.
Estimate min and max bounds on the required resources to delimit the design space min bounds to serve as an initial solution serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP Minimum bounds on the number of resources for each resource class Where NRi: the number of resources of class Ri dRi : the duration of a single operation ORi : the number of operations SungKyunKwan Univ.

111 Exploring the Design Space
Find the minimal area solution constrained to the timing constraints By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. After acceptable graph is obtained, the resource allocation process is initiated. change the available hardware (FU's, registers, busses) redistribute the time allocation over the sub-graphs transform the graph to reduce the hardware requirements. Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence. SungKyunKwan Univ.

112 Data path Synthesis SungKyunKwan Univ.

113 Scheduling and Binding
The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle Sharing: Bind a resource to more than one operation. Operations must not execute concurrently. Graph scheduled hierachically in a bottom-up fashion Power tradeoffs Shorter schedules enable supply voltage (Vdd) scaling Schedule directly impacts resource sharing Energy consumption depends what the previous instruction was Reordering to minimize the switching on the control path Clock selection Eliminate slacks Choose optimal system clock period SungKyunKwan Univ.

114 ASAP Scheduling Algorithm HAL Example SungKyunKwan Univ.

115 ALAP Scheduling Algorithm HAL Example SungKyunKwan Univ.

116 Force Directed Scheduling
Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability. SungKyunKwan Univ.

117 Force Directed Scheduling
SungKyunKwan Univ.

118 Example : Operation V6 SungKyunKwan Univ.

119 Force-Directed Scheduling
Algorithm (Paulin) SungKyunKwan Univ.

120 Force-Directed Scheduling Example
Probability of scheduling operations into control steps Probability of scheduling operations into control steps after operation o3 is scheduled to step s2 Operator cost for multiplications in a Operator cost for multiplications in c SungKyunKwan Univ.

121 List Scheduling The scheduled DFG
DFG with mobility labeling (inside <>) ready operation list/resource constraint The scheduled DFG SungKyunKwan Univ.

122 Static-List Scheduling
DFG Partial schedule of five nodes Priority list The final schedule SungKyunKwan Univ.

123 Choosing Optimal Clock Period
SungKyunKwan Univ.

124 Supply Voltage Scaling
Lowering Vdd reduces energy, but increase delays SungKyunKwan Univ.

125 Shut-down을 이용한 Scheduling: |a-b|
SungKyunKwan Univ.

126 Loop Scheduling Sequential Execution Partial loop unrolling
Loop folding SungKyunKwan Univ.

127 Loop folding Reduce execution delay of a loop.
Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model. SungKyunKwan Univ.

128 DFG Restructuring DFG2 DFG2 after redundant operation insertion
SungKyunKwan Univ.

129 Minimizing the bit transitions for constants during Scheduling
SungKyunKwan Univ.

130 Control Synthesis Synthesize circuit that:
Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces. SungKyunKwan Univ.

131 Allocation Bind a resource to more than one operation.
SungKyunKwan Univ.

132 Optimum binding SungKyunKwan Univ.

133 Example SungKyunKwan Univ.

134 RESOURCE SHARING SungKyunKwan Univ.
Parallel vs. time-sharing buses (or execution units) Resource sharing can destroy signal correlations and increase switching activity, should be done between operations that are strongly connected. Map operations with correlated input signals to the same units Regularity: repeated patterns of computation (e.g., (+, * ), ( * ,*), (+,>)) simplifying interconnect (busses, multiplexers, buffers) SungKyunKwan Univ.

135 Datapath interconnections
Bus-oriented datapath Multiplexer-oriented datapath SungKyunKwan Univ.

136 Sequential Execution Example of three micro-operations in the same clock period SungKyunKwan Univ.

137 Insertion of Latch (out)
Insertion of latches at the output ports of the functional units SungKyunKwan Univ.

138 Insertion of Latch (in/out)
Insertion of latches at both the input and output ports of the functional units SungKyunKwan Univ.

139 Overlapping Data Transfer(in)
Overlapping read and write data transfers SungKyunKwan Univ.

140 Overlapping of Data Transfer (in/out)
Overlapping data transfer with functional-unit execution SungKyunKwan Univ.

141 Register Allocation Using Clique Partitioning
Scheduled DFG Graph model Lifetime intervals of variable Clique-partitioning solution SungKyunKwan Univ.

142 Left-Edge Algorithm Register allocation using Left-Edge Algorithm
SungKyunKwan Univ.

143 Register Allocation: Left-Edge Algorithm
Sorted variable lifetime intervals Five-register allocation result SungKyunKwan Univ.

144 Register Allocation Allocation : bind registers and functional modules to variables and operations in the CDFG and specify the interconnection among modules and registers in terms of MUX or BUS. Reduce capacitance during allocation by minimizing the number of functional modules, registers, and multiplexers. Composite weight w.r.t transition activity and capacitance loads is incorporated into CDFG. Find the highest composite weight and merge the two nodes it joins, i.e., maps the corresponding variable to the same register. Allocation continues till no edges are left in the CDFG while updating the composite weight values. Set the maximum # of operations alive in any control step to be one. Sequence operations/variables to enhance signal correlations SungKyunKwan Univ.

145 Exploiting spatial locality for interconnect power reduction
A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low. A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus). The partitioning information is passed to the architecture netlist and floorplanning tools. Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs SungKyunKwan Univ.

146 Hardware Mapping The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks. The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment). The mapping process transforms the flow graph into three structural sub-graphs: the data path structure graph the controller state machine graph the interface graph (between data path control inputs and the controller output signals) SungKyunKwan Univ.

147 Spectral Partitioning in High-Level Synthesis
The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together. The relative distances is a measure of the tightness of connections. Use the eigenvector ordering to generate several partitioning solutions The area estimates are based on distribution graphs. A distribution graph displays the expected number of operations executed in each time slot. Local bus power: the number of global data transfers times the area of the cluster Global bus power: the number of global data transfer times the total area: SungKyunKwan Univ.

148 Finding a good Partition
SungKyunKwan Univ.

149 Interconnection Estimation
For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height. Average global bus length : square root of the estimated chip area. The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically. SungKyunKwan Univ.

150 Incorporating into HYPER-LP
SungKyunKwan Univ.

151 Experiments SungKyunKwan Univ.

152 Datapath Generation Register file recognition and the multiplexer reduction: Individual registers are merged as much as possible into register files reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder). Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used) Data path partitioning is to optimize the processor floorplan The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes. SungKyunKwan Univ.

153 Hardware Mapper SungKyunKwan Univ.

154 Hyper's Basic Architecture Model
SungKyunKwan Univ.

155 Hyper's Crossbar Network
SungKyunKwan Univ.

156 Refined Architecture Model
SungKyunKwan Univ.

157 Bus Merging SungKyunKwan Univ.

158 Fanin Bus Merging SungKyunKwan Univ.

159 Fanout Bus merging SungKyunKwan Univ.

160 Global bus Merging SungKyunKwan Univ.

161 Test Example SungKyunKwan Univ.

162 Control Signal Assignment
SungKyunKwan Univ.

163 Efficient High Level Synthesis Algorithm for Lower Power Design
임세진, 조 준 동 SungKyunKwan Univ.

164 목차 상위 수준 합성 기존의 상위 수준의 저전력 방법
최소 비용 할당 알고리즘( Minimum Cost Flow Algorithm ) 저전력을 위한 스케쥴링 레지스터 리소스 할당 방법 실험 방법 및 결과 결론 SungKyunKwan Univ.

165 저전력 설계의 필요성 현재 IC회로의 전력 소모의 계속적인 증가 SungKyunKwan Univ.
- Single Chip에서의 트랜지스터 수의 증가 - 회로의 복잡한 기능의 증가 - 클럭 속도의 증가 최근 저 전력 필요하는 시스템 등장 - 휴대용 셀룰러 전화기, 호출기, 노트북 컴퓨터, PDA LCDs등의 Battery 전원의 제품 등장 - ULSI Microprocessors - Parallel Computer 기타 -특수 cooling 장치의 고비용과 제한된 회로의 열 발산 - Battery 수명의 느린 증가 SungKyunKwan Univ.

166 상위 수준 합성 단계 SungKyunKwan Univ.

167 상위 수준 합성 ( High Level Synthesis )
for(I=0;I<=2;I=I+1begin @(posedge clk); if(fgb[I]%8; begin p=rgb[I]%8; g=filter(x,y)*8; end Control Datapath Memory Scheduling Hardware allocation Memory inferencing Register sharing Control interencing Instructions Operations Variables Arrays signals Operators, Registers, Memory, Multiplexor Control constraints 회로의 동작적 기술 RTL(register transfer level) architecture 상위 수준 합성 SungKyunKwan Univ.

168 기본적인 상위 수준 합성 과정 SungKyunKwan Univ.

169 상위 레벨에서 제안된 저전력 방법 Sibling 연산의 연산자 공유 [ Fang , 96 ]
상위 레벨에서 제안된 저전력 방법 Sibling 연산의 연산자 공유 [ Fang , 96 ] 데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ] FU 의 shut down 방법(Demand-driven operation) [ Alidina, 94 ] 연산의 규칙성 이용 [ Rabaey, 96 ] Dual 전압 사용 [ Sarrafzadeh, 96 ] Spurious 연산의 최소화 [ Hwang, 96 ] Need to account for switching activity: depends on sequence of operations/variables assigned to a resource 제안된 알고리즘: 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97] SungKyunKwan Univ.

170 레지스터의 전력 소모 모델 Power(Register) =
switching(x)(Cout,Mux+Cin,Register)+switching(y) x (Cout,Register+Cin,DeMux) switching(x)=switching(y)이므로 Power(Register)=switching(y) x Ctotal SungKyunKwan Univ.

171 레지스터와 리소스의 수 결정 a b c d e f g h 1 2 3 4 SungKyunKwan Univ.

172 CDFG( control data flow graph )
b c d e=a+b; g=c+d; f=e+b; h=f*g; +1 +2 e g +3 f *1 h CDFG( control data flow graph ) SungKyunKwan Univ.

173 Spurious 연산을 최소화 f SungKyunKwan Univ.

174 저전력을 위한 스케쥴링 방법 저전력을 고려하지 않은 스케쥴링 저전력을 고려한 스케쥴링 SungKyunKwan Univ.

175 레지스터 할당을 위한 가중치와 캐패시티 Cij=1, Bij=L Cij=1, Bij=W
입력( input ) : V, 레지스터 공유를 위한 네트워크 출력( output ) : 분리된 V개의 경로 W = - ( M - Wa*N ) : 에지 가중치 Wa : 두 노드간의 스위칭 확률 N : 스위칭 확률을 정수화하는 값 M : Wa * N의 최대값과 크거나 같은 정수 L : W의 최대값과 크거나 같은 정수 V : 레지스터의 갯수) Cij=1, Bij=L Cij=1, Bij=W SungKyunKwan Univ.

176 최소 비용 흐름 알고리즘(Minimum Cost Flows)의 목적함수와 제한조건
Minimize Z= SungKyunKwan Univ.

177 최소 비용 흐름 알고리즘 SungKyunKwan Univ. 단계 1 : flow = 0;
단계 2 : 네트워크 상에서 존재하는 흐름에 의해 결정되는 변형된 비용 Bij*를 다음과 같이 정의한다. 단계 3 : 단계 2에서 변형된 비용으로 S에서 T까지의 최단 경로 알고리즘을 사용하여 개의 flow를 그경로를 통하여 보낸다.  = min (1, 2) 1 = 모든 정방향 에지의 min{Cij - Xij} 2 = 모든 역방향 에지의 min{Xij} 단계 4: 현재흐름의 양을 만큼 증가시키고 단계 2로 돌아간다. 단계 5: 현재흐름의 양이 V일때까지 위 단계를 반복한다. SungKyunKwan Univ.

178 초기 네트워크 그래프 구성 G (capacity/cost) Residue graph:Gr Flow graph: Gf
SungKyunKwan Univ.

179 최단 거리 S-1-2-T < G > < Gf > < Gr > SungKyunKwan Univ.

180 최단 거리 S-1-T < G > < Gr > < Gf > SungKyunKwan Univ.

181 최단 거리 S-2-T < G > < Gr > < Gf > SungKyunKwan Univ.

182 최단 거리 S-2-1-T < G > < Gr > < Gf > SungKyunKwan Univ.

183 결과 최소 비용 Z=48 SungKyunKwan Univ.

184 레지스터 호환 그래프에서 네트워크 형성 호환가능 그래프 노드 분리 전의 네트워크 형성 SungKyunKwan Univ.

185 노드분리와 알고리즘 적용 노드 분리 후의 네트워크 형성 알고리즘 적용 결과 SungKyunKwan Univ.

186 적용 결과 PATH 1 : S-a-e-f-T REG1 : a, e, f PATH 2 : S-b-T REG2 : b
PATH 3 : S-c-g-T REG3 : c, g PATH 4 : S -d-T REG4 : d SungKyunKwan Univ.

187 저전력을 위한 리소스 할당 방법 (a,e) +(b, b), (a,b)+(b+e) 중 작은 경우 선택 두 연산자를 공유시
발생하는 일련의 입력 (a,e) +(b, b), (a,b)+(b+e) 중 작은 경우 선택 SungKyunKwan Univ.

188 변수의 저장에 따른 멀티플렉서의 증가와 감소 SungKyunKwan Univ.

189 리소스 할당을 위한 가중치와 캐패시티 Cij=1, Bij=L Cij=1, Bij=W SungKyunKwan Univ.
입력( input ) : V, Network for resource sharing 출력(output ) : 분리된 V개의 경로 W : -[ M - ( Waí*N + Wmux*K ) ] ( edge weight ) Wmux : 연결 구조 가중치 K : 정규화하기 위한 상수 Wmux 0: 변수 i, j가 같은 레지스터에 할당되고 모듈의 동일 입력단으로 할당될시 : 1: 변수 i, j가 다른 레지스터에 할당되고 모듈의 V : 리소스의 수 Cij=1, Bij=L Cij=1, Bij=W SungKyunKwan Univ.

190 리소스 할당을 위한 최소 흐름 비용 알고리즘 적용 과정
리소스 할당을 위한 최소 흐름 비용 알고리즘 적용 과정 노드 분리 전의 네트워크형성 호환 가능 그라프 PATH 1 : S-1-2-T adder 1 : +1 , +2 PATH 2 : S-3-T adder 2 : +3 노드 분리 후의 네트워크 형성 SungKyunKwan Univ.

191 레지스터와 리소스 할당 후의 최종 데이터 경로 SungKyunKwan Univ.

192 실험 과정 SungKyunKwan Univ.

193 스위칭율 계산 ( Hamming Distance ratio , Wsa)
CDFG 기능적 시뮬레이션 두 변수의 exclusive-OR bit-width로 정규화 논리 수준이나 레이아웃 수준 보다 빠른 측정 SungKyunKwan Univ.

194 벤치마크 회로의 특성 SungKyunKwan Univ. Resource allocation +(2), *(2), reg(7)
+(2) *(2),sub(1), reg(7) SungKyunKwan Univ.

195 실험 결과 SungKyunKwan Univ.

196 데이터 경로를 Compass에서 Placement & Routing 한 모습 ( 0.6 마이크론 gate array 사용)
SungKyunKwan Univ.

197 결 론 스위칭 동작 최소화를 위해 해밍거리(Hamming distance) 의 목적 함수
저전력 구현을 위한 스케쥴링, 리소스 할당과정 평균 15%의 전력 감소 제한된 시간내의 최적의 결과 알고리즘 적용 (polynomial time and optimal solution algorithm) 고성능 저전력 TOP-DOWN 상위수준 설계에 적용 (DSP, Microcontroller, ASIC, etc) SungKyunKwan Univ.

198 Cascade Filter SungKyunKwan Univ.

199 Cascade Filter Scheduling
SungKyunKwan Univ.

200 Finite Impulse Response Filter
SungKyunKwan Univ.

201 FIR Scheduling SungKyunKwan Univ.

202 Infinite Impulse Response Filter
SungKyunKwan Univ.

203 IIR Filter Scheduling SungKyunKwan Univ.

204 참고문헌 SungKyunKwan Univ.
[1] D. Gajski and N. Dutt, High-level Synthesis : Introduction to Chip and System Design. Kluwer Academic Publishers, 1992. [2] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York : McGraw Hill. Inc, 1994. [3] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-Power CMOS digital design", IEEE J. of Solid-State Circuits, pp , 1992. [4] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen, "Optimizing power using transformation," IEEE Tr. on CAD/ICAS, pp , Jan [5] E. Musool and J. Cortadella, "Scheduling and resource binding for low power", Int'l Symp on Synstem Syntheiss, pp , Apr [6] Y. Fang and A. Albicki, "Joint scheduling and allocation for low power," in Proc. of Int'l Symp. on Circuits & Systems, pp , May [7] J. Monteiro and Pranav Ashar, "Scheduling techniques to enable power management", 33rd Design Automation Conference, 1996. [8] R. S. Martin, J. P. Knight, "Optimizing Power in ASIC Behavioral Synthesis", IEEE Design & Test of Computers, pp , 1995. SungKyunKwan Univ.

205 [9] R. Mehra, J. Rabaey, "Exploting Regularity for Low Power Design", IEEE Custom Integrated Circuits Conference, pp [10] A. Chandrakasan, T. Sheng, and R. W. Brodersen, "Low Power CMOS Digital Design", Journal of Solid State Circuits, pp , 1992. [11] R. Mehra and J. Rabaey, "Behavioral level power estimation and exploration," in Proc. of Int'l Symp. on Low Power Design, pp , Apr [12] A. Raghunathan and N. K. Jha, "An iterative improvement algorithm for low power data path synthesis," in Proc. of Int'l Conf. on Computer-Aided Design, pp , Nov [13] R. Mehra, J. Rabaey, "Low power architectural synthesis and the impact of exploiting locality," Journal of VLSI Signal Processing, [14] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, "Predictive system shutdown and other architectural techniques for energy efficient programmable computation," IEEE Tr. on VLSI Systems, pp , Mar [15] A. Abnous and J. M. Rabaey, "Ultra low power domain specific multimedia processors," in Proc. of IEEE VLSI Signal Processing Workshop, Oct SungKyunKwan Univ.

206 [16] M. C. Mcfarland, A. C. Parker, R
[16] M. C. Mcfarland, A. C. Parker, R. Camposano, "The high level synthesis of digital systems," Proceedings of the IEEE. Vol No 2 , February, 1990. [17] A. Chandrakasan, S. Sheng, R. Brodersen, "Low power CMOS digital design,", IEEE Solid State Circuit, April, 1992. [18] A. Chandrakasan, R. Brodersen, "Low power digital CMOS design, Kluwer Academic Publishers, 1995. [19] M. Alidina, J. Moteiro, S. Devadas, A. Ghosh, M. Papaefthymiou, "Precomputation based sequential logic optimization for low power," IEEE International Conference on Computer Aided Design, 1994. [20] J. Monterio, S. Devadas and A. Ghosh, "Retiming sequential circuits for low power," In Proceeding of the IEEE International Conference on Computer Aided Design, November, 1993. [21] F. J. Kurdahi, A. C. Parker, REAL: A Program for Register Allocation,: in Proc. of the 24th Design Automation Conference, ACM/IEEE, June. pp , 1987. SungKyunKwan Univ.

207 5. Partitioning SungKyunKwan Univ.

208 Partitioning in VLSI CAD
Partitioning is a technique widely used to solve diverse problems occurring in VLSI CAD. Applications of partitioning can be found in logic synthesis, logic optimization, testing, and layout synthesis. High-quality partitioning is critical in high-level synthesis. To be useful, high-level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. However, logic decomposition of the design into procedures may not be appropriate for high-level and logic-level synthesis [60]. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance. Some technology mapping programs use partitioning techniques to map a circuit specified as a network of modules performing simple Boolean operations onto a network composed of specific modules available in an FPGA. SungKyunKwan Univ.

209 Partitioning in VLSI CAD
Since the test generation problem for large circuits may be extremely intensive computationally, circuit partitioning may provide the means to speed it up. Generally, the problem of test pattern generation is NP-complete. To date, all test generation algorithms that guarantee finding a test for a given fault exhibit the worst-case behavior requiring CPU times exponentially increasing with the circuit size. If the circuit can be partitioned into k parts (k not fixed), each of bounded size c, then the worst-case test generation time would be reduced linearly related to the circuit size. Partitioning is often utilized in layout synthesis to produce and/or improve the placement of the circuit modules. Partitioning is used to find strongly connected subcircuits in the design, and the resulting information is utilized by some placement algorithms to place in mutual proximity components belonging to such subcircuits, thus minimizing delays and routing lengths. SungKyunKwan Univ.

210 Partitioning in VLSI CAD
Another important class of partitioning problems occurs at the system design level. Since IC packages can hold only a limited number of logic components and external terminals, the components must be partitioned into subcircuits small enough to be implemented in the available packages. Partitioning has been used as well to estimate some properties of physical IC designs, such as the expected IC area. SungKyunKwan Univ.

211 Circuit Partitioning SungKyunKwan Univ.
The early attempts to solve the circuit partitioning problem were based on the representation of the circuit as a graph G = (V,E), where V is a set of nodes (vertices) representing the fundamental components, such as gates, flip-flops, inputs and outputs and E is a set of edges representing nets present in the network. Graph partitioning problems representing VLSI design problems usually involve separating the set of the graph nodes into disjoint subsets while optimizing some objective function defined on the graph vertices and edges. In the partitioned graph, edges can be divided into two classes: inter-subset edges whose vertices belong to different subsets, and intra-subset edges whose vertices belong to the same subset. The objective functions associated with the graph partitioning problems usually treat these classes of edges in different ways. One classic graph partitioning problem is the minimum cut (mincut) problem. Its objective is to divide V into two disjoint parts, U and W, such that the number of the inter-subset edges is minimized. The set e(U,W) is referred to as a cut set, and the number of edges in cut set as the cut value. SungKyunKwan Univ.

212 Circuit Partitioning graph and physical representation
SungKyunKwan Univ.

213 VHDL example Behavioral description process communication
control/data flow graph Behavioral description SungKyunKwan Univ.

214 Mincut Partitioning SungKyunKwan Univ.
An exact solution to the mincut problem was provided by Ford and Fulkerson [11], who transformed the mincut problem into the maximum flow (maxflow) problem. The maxflow-mincut algorithm finds a maximum flow in a network; the maxflow value is equal to the mincut value. The first heuristic algorithm for a two-way graph partitioning into equal-sized subsets was proposed by Kernighan and Lin, Their method consists of choosing an initial partition randomly and reducing the cut value by exchanging appropriately selected pairs of nodes from the subsets. After exchanging the positions, nodes are locked in new positions. In subsequent steps, pair of unlocked nodes are selected and exchanged until all nodes are locked. The execution of the algorithm stops, when it riches the local minimum. Most nets in digital circuits are multi-point connections among more than two modules (logic gates, flip-flops, etc.). Therefore, modeling VLSI circuit partitioning problems as graph partitioning problems may lead to poor results caused by inadequate representation of multi-point nets which have to be decomposed into two-point connections. One way to approximate circuit partitioning problems is to transform the circuit into a weighted graph G' representation via a net model. For example, a multi-point net connecting n nodes may be modeled as a complete graph (clique) spanned on these nodes, i.e., containing all possible edges among these nodes. SungKyunKwan Univ.

215 Clustering (Cont’d) Clustering based on criterion B below the first cut-line, then criterion A Clustering based on criterion A below the second cut-line, then criterion B SungKyunKwan Univ.

216 Clustering Example Two-cluster Partition Three-cluster Partition
SungKyunKwan Univ.

217 Complexity of Partitioning
In general, computing the optimal partitioning is an NP-complete problem, which means that the best known algorithms take time which is an exponential function of n=|N| and p, and it is widely believed that no algorithm whose running time is a polynomial function of n=|N| and p exists (see ``Computers and Intractability'', M. Garey and D. Johnson, W. H. Freeman, 1979, for details.) Therefore we need to use heuristics to get approximate solutions for problems where n is large. The picture below illustrates a larger graph partitioning problem; it was generated using the spectral partitioning algorithm as implemented in the graph partitioning software by Gilbert et al, described below. The partition is N = Nblue U Nblack, with red edges connecting nodes in the two partitions. SungKyunKwan Univ.

218 Edge Separator and Vertex Separator
Bisecting a graph G=(N,E) can be done in two ways. In the last section, we discussed finding the smallest subset Es of E such that removing Es from E divided G into two disconnected subgraphs G1 and G2, with nodes N1 and N2 respectively, where N1 U N2 = N and N1 and N2 are disjoint and equally large. (If the number of nodes is odd, we obviously cannot make |N1|=|N2|. So we will call Es an edge separator if |N1| and |N2| are sufficiently close; we will be more explicit about how different |N1| and |N2| can be only when necessary.) The edges in Es connect nodes in N1 to nodes in N2. Since removing Es disconnects G, Es is called an edge separator. The other way to bisect a graph is to find a vertex separator, a subset Ns of N, such that removing Ns and all incident edges from G also results in two disconnected subgraphs G1 and G2 of G. In other words N = N1 U Ns U N2, where all three subsets of N are disjoint, N1 and N2 are equally large, and no edges connect N1 and N2. The following figure illustrates these ideas. The green edges, Es1, form an edge separator, as well as the blue edges Es2. The red nodes, Ns, are a vertex separator, since removing them and the indicident edges (Es1, Es2, and the purple edges), leaves two disjoint subgraphs. Theorem. (Tarjan, Lipton, "A separator theorem for planar graphs", SIAM J. Appl. Math., 36: , April 1979). Let G=(N,E) be an planar graph. Then we can find a vertex separator Ns, so that N = N1 U Ns U N2 is a disjoint partition of N, |N1| <= (2/3)*|N|, |N2| <= (2/3)*|N|, and |Ns| <= sqrt(8*|N|). SungKyunKwan Univ.

219 Kernighan and Lin Algorithm
B. Kernighan and S. Lin ("An effective heuristic procedure for partitioning graphs", The Bell System Technial Journal, pp , Feb 1970), which takes O(|N|3) time per iteration. A more complicated and efficient implementation, which takes only O(|E|) time per iteration, was presented by C. Fiduccia and R. Mattheyses, "A linear-time heuristic for improving network partitions", Technical Report 82CRD130, General Electric Co., Corporate Research and Development Ceter, Schenectady, NY 1982. We start with an edge weighted graph G=(N,E,WE), and a partitioning G = A U B into equal parts: |A| = |B|. Let w(e) = w(i,j) be the weight of edge e=(i,j), where the weight is 0 if no edge e=(i,j) exists. The goal is to find equal-sized subsets X in A and Y in B, such that exchanging X and Y reduces the total cost of edges from A to B. More precisely, we let T = sum[ a in A and b in B ] w(a,b) = cost of edges from A to B and seek X and Y such that new_A = A - X U Y and new_B = B - Y U X has a lower cost new_T. To compute new_T efficiently, we introduce: E(a) = external cost of a = sum[ b in B ] w(a,b) I(a) = internal cost of a = sum[ a' in A, a'!=a]w(a,a') D(a) = cost of a = E(a) - I(a) and analogously E(b) = external cost of b = sum[ a in A ] w(a,b) I(b) = internal cost of b = sum[ b' in B, b' !=b]w(b,b') D(b) = cost of b = E(b) - I(b) Then it is easy to show that swapping a in A and b in B changes T to new_T = T - ( D(a) + D(b) -2*w(a,b) ) = T - gain(a,b) In other words, gain(a,b) = D(a)+D(b)-2*w(a,b) measures the improvement in the partitioning by swapping a and b. D(a') and D(b') also change to new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b) for all a' in A, a' !=a new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a) for all b' in B, b' != b SungKyunKwan Univ.

220 Kernighan and Lin Algorithm
... At this point, we have computed a sequence of pairs (a1,b1), ... , (ak,bk) and gains gain(1), ..., gain(k) ... where k = |N|/2, ordered by the order in which ... we marked them (4) Pick j maximizing Gain = sumi=1...j gain(i) ... Gain is the reduction in cost from swapping (a1,b1),...,(aj,bj) (5) If Gain > 0 then (5.2) Update A = A - {a1,...,ak} U {b1,...,bk} ... cost = O(|N|) (5.2) Update B = B - {b1,...,bk} U {a1,...,ak} (5.3) Update T = T - Gain ... cost = O(1) End if Until Gain <= 0 (0) Compute T = cost of partition N = A U B ... cost = O(|N|2) Repeat (1) Compute costs D(n) for all n in N (2) Unmark all nodes in G ... cost = O(|N|) (3) While there are unmarked nodes ... |N|/2 iterations (3.1) Find an unmarked pair (a,b) maximizing gain(a,b) (3.2) Mark a and b (but do not swap them) ... cost = O(1) (3.3) Update D(n) for all unmarked n, as though a and b had been swapped End while SungKyunKwan Univ.

221 Spectral Partitioning
This is a powerful but expensive technique, based on techniques introduced by Fiedler in the 1970s, but popularized in 1990 by A. Pothen, H. Simon, and K.-P. Liou, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11: We will first describe the algorithm, and then give three related justifications for its efficacy. Let G=(N,E) be an undirected, unweighted graph without self edges (i,i) or multiple edges from one node to another. We define two matrices related to this graph. Definition The incidence matrix In(G) of G is an |N|-by-|E| matrix, with one row for each node and one column for each edge. Suppose edge e=(i,j). Then column e of In(G) is zero except for the the i-th and j-th entries, which are +1 and -1, respectively. Note that there is some ambiguity in this definition, since G is undirected; writing edge e=(i,j) instead of (j,i) is equivalent to multiplying column e of In(G) by -1. We will see that this ambiguity will not be important to us. Definition The Laplacian matrix L(G) of G is an |N|-by-|N| symmetric matrix, with one row and column for each node. It is defined as follows. (L(G))(i,j) = degree of node i if i=j (number of incident edges) = -1 if i!=j and there is an edge (i,j) SungKyunKwan Univ.

222 Spatial Locality: Hardware Partitioning
The interface logic should be properly partitioned for area and timing reasons. Minimization of global busses leads to lower bus capacitance, and thus lower interconnect power. Signal values within the clusters tend to be more highly correlated. Data path should be partitioned into approximately equal size. In the DSP area, data paths tens to occupy far more area than the control paths. Wiring is still one of the domain area consumers The method used to identify clusters is based on the eigenvalues and eigenvectors of the Laplacian of the graph. The eigen vector corresponding to the second smallest eigen value provides a 1-D placement of the nodes which minimizes the mean-squared connection length. SungKyunKwan Univ.

223 Spectral Partitioning in VLSI placement
SungKyunKwan Univ.

224 Spectral Partitioning in VLSI placement
Setting the derivative of the Lagrangian, L, to zero gives: The solution to the above equation are those is the eigenvalue and x is the corresponding eigenvector. The smallest eigenvalue 0 gives a trivial solution with all nodes at the same point. The eigenvector corresponding to the second smallest eigenvalue minimizes the cost function while giving a non-trivial solution SungKyunKwan Univ.

225 Key Ideas in Spectral Partitioning
SungKyunKwan Univ.

226 Spectral Partitioning
SungKyunKwan Univ.

227 Spectral Partitioning
The following theorem state some important facts about In(G) and L(G). It introduces us to the idea that the eigenvalues and eigen vectors of L(G) are related to the connectivity of G. Theorem 1. Given a graph G, its associated matrices In(G) and L(G) have the following properties. 1.L(G) is a symmetric matrix. This means the eigenvalues of L(G) are real, and its eigenvectors are real and orthogonal. 2.Let e=[1,...,1]', where ' means transpose, i.e. the column vector of all ones. Then L(G)*e = 0. 3.In(G)*(In(G))' = L(G). This is independent of the signs chosen in each column of In(G). 4.Suppose L(G)*v = lambda*v, where v is nonzero. Then norm(In(G)'*v)2 lambda = norm(v)2 where norm(z)2 = sumi z(i)2 = sum{all edges e=(i,j)} (v(i)-v(j))2 sumi v(i)2 5. The eigenvalues of L(G) are nonnegative: 0 <= lambda1 <= lambda2 <= ... <= lambdan 6.The number of of connected components of G is equal to the number of lambdai) equal to 0. In particular, lambda2 != 0 if and only if G is connected. SungKyunKwan Univ.

228 Spectral Partitioning
Compute the eigenvector v2 corresponding to lambda2 of L(G) for each node n of G if v2(n) < 0 put node n in partition N- else put node n in partition N+ endif endfor First we show that this partition is at least reasonable, because it tends to give connected components N- and N+: Theorem 2. (M. Fiedler, "A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory", Czech.Math. J. 25: , 1975.) Let G be connected, and N- and N+ be defined by the above algorithm. Then N- is connected. If no v2(n) = 0, N+ is also connected. There are a number of reasons lambda2 is called the algebraic connectivity. Here is another. Theorem 3. (Fiedler). Let G=(N,E) be a graph, and G1=(N,E1) a subgraph, i.e. with the same nodes and subset of the edges, so that G1 is "less connected" than G. Then lambda2(L(G1)) <= lambda2(L(G)), i.e. the algebraic connectivity of G1 is also less than or equal to the algebraic connectivity of G. Motivation for spectral bisection, by analogy with a vibrating string How does a taut string vibrate when it is plucked? From our background in either physics or music, we know that it has certain modes of vibration or harmonics. If we were to take snapshots of these modes, they would look like this: SungKyunKwan Univ.

229 Spectral Partitioning
SungKyunKwan Univ.

230 Multilevel Kernighan-Lin
Given a matching, Gc is computed as follows. We let there be a node r in Nc for each edge in Em. Then we construct Ec as follows: for r = 1 to |Em| ... for each node in Nc let (i,j) be the edge in Em corresponding to node r for each other edge e=(i,k) in E incident on i let ek be the edge in Em incident on k, and let rk be the corresponding node in Nc add the edge (r,rk) to Ec end for for each other edge e=(j,k) in E incident on j if there are multiple edges between pairs of nodes of Nc, collapse them into single edges Gc is computed in step (1) of Recursive_partition as follows. We define a matching of a graph G=(N,E) as a subset Em of the edges. E with the property that no two edges in Em share an endpoint. A maximal matching is one to which no more edges can be added and remain a matching. We can compute a maximal matching by a simple random algorithm: let Em be empty mark all nodes in N as unmatched for i = 1 to |N| ... visit the nodes in a random order if node i has not been matched, choose an edge e=(i,j) where j is also unmatched, and add it to Em mark i and j as matched end if end for SungKyunKwan Univ.

231 Multilevel Kernighan-Lin
Note that we can take node weights into account by letting the weight of a node (i,j) in Nc be the sum of the weights of the nodes I and j. We can similarly take edge weights into account by letting the weight of an edge in Ec be the sum of the weights of the edges "collapsed" into it. Furthermore, we can choose the edge (i,j) which matches j to i in the construction of Nc above to have the large weight of all edges incident on i; this will tend to minimize the weights of the cut edges. This is called heavy edge matching in METIS, and is illustrated on the right. SungKyunKwan Univ.

232 Multilevel Kernighan-Lin
Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below: Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin. SungKyunKwan Univ.

233 Multilevel Spectral Partitioning
Now we turn to the divide-and-conquer algorithm of Barnard and Simon, which is based on spectral partitioning rather than Kernighan-Lin. The expensive part of spectral bisection is finding the eigenvector v2, which requires a possibly large number of matrix-vector multiplications with the Laplacian matrix L(G) of the graph G. The divide-and-conquer approach of Recursive_partition will dramatically decrease the cost. Barnard and Simon perform step (1) of Recursive_partition, computing Gc = (Nc,Ec) from G=(N,E), slightly differently than above: They find a maximal independent subset Nc of N. This means that N contains Nc and E contains Ec, no nodes in Nc are directly connected by edges in E (independence), and Nc is as large as possible (maximality). There is a simple "greedy" algorithm for finding an Nc: Nc = empty set for i = 1 to |N| if node i is not adjacent to any node already in Nc add i to Nc end if end for This is shown below in the case where G is simply a chain of 9 nodes with nearest neighbor connections, in which case Nc consists simply of every other node of N. SungKyunKwan Univ.

234 hMETIS hMETIS is a set of programs for partitioning hypergraphs such as those corresponding to VLSI circuits. The algorithms implemented by hMETIS are based on the multilevel hypergraph partitioning scheme described in [KAKS97]. hMETIS produces bisections that cut 10% to 300% fewer hyperedges than those cut by other popular algorithms such as PARABOLI, PROP, and CLIP-PROP, especially for circuits with over 100,000 cells, and circuits with non-unit cell areaIt is extremely fast!A single run of hMETIS is faster than a single run of simpler schemes such as FM, KL, or CLIP. Furthermore, because of its very good average cut characteristics, it produces high quality partitionings in significantly fewer runs. It can bisect circuits with over 100,000 vertices in a couple of minutes on Pentium-class workstations. The performance of hMETIS on the new ISPD98 benchmark suite can be found in the paper by Chuck Alpert. SungKyunKwan Univ.

235 How good is Recursive Bisection?
Horst D. Simon and Shang-Hua Teng , Report RNR , August 1993 The most commonly used p-way partitioning method is recursive bisection. It first "optimally" divides the graph (mesh) into two equal sized pieces and then recursively divides the two pieces.We show that,due to the greedy nature and the lack of global information,recursive bisection, in the worst case,may produce a partition that is very far from the optimal one. Our negative result is complemented by two positive ones.First, we show that for some important classes of graphs that occur in practical applications,such as well shaped finite element and finite difference meshes,recursive bisection is normally within a constant factor of the optimal one. Secondly,we show that if the balanced condition is relaxed so that each block in the partition is bounded by (1+e)n/p,then there exists a approximately balanced recursive partitioning scheme that finds a partition whose cost is within an 0(log p) factor of the cost of the optimal p-way partition. SungKyunKwan Univ.

236 Partitioning Algorithm with Multiple Constraints
조 준 동 SungKyunKwan Univ.

237 스위칭에 의한 충전과 방전 전체 전력소모의 최대 90%까지 차지 SungKyunKwan Univ.

238 저전력을 위한 분할 기존의 방법 : cut을 지나가는 간선의 수 저전력 : 간선의 스위칭 동작의 수
SungKyunKwan Univ.

239 최소비용흐름 알고리즘 주어진 양을 가장 적은 비용으로 원하는 목적지까지 보낼수 있는 방법 각 통로는 용량과 비용을 가짐
Max-flow min-cut : 간선의 수만 고려 Min-Cost flow : 간선마다 스위칭 동작의 가중치를 부여 비용 : 스위칭 동작 vs. 간선의 수 용량 : 간선에 흐를 수 있는 최대양 비용이 적을수록 선택되도록 큰 용량 SungKyunKwan Univ.

240 Network and Mincost Flow
SungKyunKwan Univ.

241 그래프 변환 알고리즘 Min-Cost Flow 경로를 찾음 Cut 을 찾기 위해서 그래프의 변환이 필요
레벨에 따른 topological 정렬 SungKyunKwan Univ.

242 그래프 변환 알고리즘 추가된 노드 및 간선 SungKyunKwan Univ.

243 그래프 변환 SungKyunKwan Univ.

244 Partitioning with constraints
SungKyunKwan Univ.

245 Algorithm SungKyunKwan Univ. Input: Flow f, Network
Output: Partition the network into f subnetworks 단계 1: 그래프에 Flow 를 push하여 최소비용흐름 알고리즘 수행; 만약 각각의 partition에 대하여 A_upper 또는 P_upper를 만족하면 마침; 그렇지않으면 f = f+1; 증가시키고 upper bound를 만족할 때까지 단계 1을 반복한다. 단계 2: 만약 A_lower 또는 P_lower를 만족하지 않는두개의 partition p, q 가 있고 라면 p와 q는 merge가 가능하고 모든 가능한{p,q} set에 대하여 최소비용매칭을 적용하여 분할된 partition의 개수를 줄임. SungKyunKwan Univ.

246 참고문헌 SungKyunKwan Univ.
[1] J.D.Cho and P.D.Franzon, "High-Performance Design Automation for Multi-Chip Modules and Packages", World Scientific Pub. Co. 1996 [2] H.J.M.Veendrick, "Short-Circuit Dessipation of Static CMOS Circuitry and its Impact on the Design of Buffer Circuits" IEEE JSSCC, pp , August, 1984 [3] H.B.Bakoglu, "Circuits, Interconnections and Packaging for VLSI", pp , Addison-Wesley Publishing Co., 1990 [4] K.M.hall. "An r-dimensional quadratic placement algorithm", Management Sci., vol.17, pp , Nov, 1970 [5] Cadence Design Systems. "A Vision for Multi-Chip Module design in the nineties", Tech. Rep. Cadence Design Systems Inc., Santa Clara, CA, 1993 [6] R.Raghavan, J.Cohoon, and S.Shani. "Single Bend Wiring", Journal of Algorithms, 7(2): , June, 1986 [7] Kernighan, B.W. and S.lin. "An efficient heuristic procedure to partition graphs" Bell System Technical Journal, 492: , Feb. 1970 [8] Wei, Y.C. and C.K.Cheng "Ratio-Cut Partitioning for Hierachical Designs", IEEE Trans. on Computer-Aided Design. 40(7): , 1991 [9] S.W.Hadley, B.L.Mark, and A.Vanelli, "An Efficient Eigenvector Approach for Finding Netlist Partitions", IEEE Trans. on Computer-Aided Design, vol. CAD-11, pp , July, 1992 [10] L.R.Fold, Jr. and D.R.Fulkerson. "Flows in Networks", Princeton University Press, Princeton, NJ, 1962 [11] Liu H. and D.F.Wong, "Network Flow Based Multi-Way Partitioning With Area and Pin Constraints", IEEE/ACM Symposium on Physical Design, pp , 1997 [12] Kirkpatrick, S. Jr., C.Gelatt, and M.Vecchi. "Optimization by simulated annealing", Science, 220(4598): , May, 1983 [13] Pedram, M. "Power Minimization in IC Design: Principles and Applications," ACM Trans. on Design Automation of Electronics Systems, 1(1), Jan. pp. 3-56, 1996. [14] A.H.Farrahi and M.Sarrafzadeh. "FPGA Technology Mapping for Power Minimizatioin", In International Workshop on Field-Programmable Logic and Applications, pp66-77, Sep. 1994 [15] M.A.Breur, "Min-Cut Placement", J.Design Automation and Fault-Tolerant Computing, pp , Oct. 1977 SungKyunKwan Univ.

247 [16] M. Hanan and M. J. Kutrzberg
[16] M.Hanan and M.J.Kutrzberg. A Review of the Placement and the Quadratic Assignment Problem, Apr [17] N.R.Quinn, "The Placement Problem as Viewed from the Physics of Classical Mechanics", Proc. of the 12th Design Automation Conference, pp , 1975 [18] C.Sehen, and A.Sangiovanni-Vincentelli, "The Timber Wolf placement and routing package", IEEE Journal of Solid-State Circuits, Sc-20, pp , 1985 [19] K.Shahookar, and P.Mazumder, "A Genetic Approach to Standard Cell Placement", First European Design Automation Conference, Mar. 1990 [20] J.D.Cho, S.Raje, M.Sarrafzadeh, M.Sriram, and S.M.Kang, "Crosstalk Minimum Layer Assignment", In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, pp , 1993 [21] J.M.Ho, M.Sarrafzadeh, G,Vijayan, and C.K.Wong. "Layer Assignment for Multi-Chip Modules", IEEE Trans. on Computer-Aided Design, CAD-9(12): , Dec., 1991 [22] G.Devaraj. "Distributed placement and crosstalk driven router for multichip modules", In MS Thesis, Univ. of Cincinnati, 1994 [23] J.D.Cho. "Min-Cost Flow based Minimum-Cost Rectilinear Steiner Distance-Preserving Tree", International Symposium on Physical Desigh, pp-82-87, 1997 [24] A.Vitttal and M.Marek-Sadowska. "Minimal Delay Interconnection Design using Alphabetic Trees", In Design Automation Conference, pp , 1994 [25] M.C.Golumbic. "Algorithmic Graph Theory and Perfect Graph", pp , New York : Academic. 1980 [26] R.Vemuri. "Genetic Algorithms for partitioning, placement, and layer assignment for multichip modules", Ph.D. Thesis, Univ. of Cincinnati, 1994 [27] J.L.Kennington and R.V.Helgason, "Algorithms for Network Programmin", John Wiley, 1980 [28] J.Y.Cho and J.D.Cho "Improving Performance and Routability Estimation in MCM Placement", In InterPack'97, Hawaii, June, 1997 [29] J.Y.Cho and J.D.Cho "Partitioning for Low Power Using Min-Cost Flow Algorithm", submitted to 한국반도체학술대회, Feb, 1998 SungKyunKwan Univ.

248 6. Logic Level Design SungKyunKwan Univ.

249 Node Transition Activity
SungKyunKwan Univ.

250 Low Activity XOR Function
SungKyunKwan Univ.

251 GLITCH (Spurious transitions)
15-20% of the total power is due to glitching. SungKyunKwan Univ.

252 Glitches SungKyunKwan Univ.

253 Hazard Generation in Logic Circuits
Static hazard: A transient pulse of width w (= the delay of the inverter). Dynamic hazard: the transient consists of three edges, two rising and one falling with w of two units. Each input can have several arriving paths. SungKyunKwan Univ.

254 High-Performance PowerDistribution
(S: Switching probability; C: Capacitance) Start with all logic at the lowest power level; then, successive iterations of delay calculation, identifying the failing blocks, and powering up are done until either all of the nets pass their delay criteria or the maximum power level is reached. Voltage drops in ground and supply wires use up a more serious fraction of the total noise margin SungKyunKwan Univ.

255 Logic Transformation SungKyunKwan Univ.
Use a signal with low switching activity to reduce the activity on a highly active signal. Done by the addition of a redundant connection between the gate with low activity (source gate) to the gate with a high switching activity (target gate). Signals a, b, and g1 have very high switching activity and most of time its value is zero Suppose c and g1 are selected as the source and target of a new connection ` 1 is undetectable, hence the function of the new circuit remains the same. Signal c has a long run of zero, and zero is the controlling value of the and gate g1 , most of the switching activities at the input of g1 will not be seen at the output, thus switching activity of the gate g1 is reduced. The redundant connection in a circuit may result in some irredundant connections becoming redundant. By adding ` 1 , the connections from c to g3 become redundant. SungKyunKwan Univ.

256 Logic Transformation SungKyunKwan Univ.

257 Logic Transformation SungKyunKwan Univ.

258 Frequency Reduction Power saving Opportunity Cost
Reduces capacitance on the clock network Reduces internal power in the affected registers Reduces need for muxes(data recirculation) Opportunity Large opportunity for power reduction, dependent on; Number of registers gated percentage of time clock is enabled Cost Testability Complicates clock tree synthesis Complicates clock skew balancing SungKyunKwan Univ.

259 GATED-CLOCK D-FLIP-FLOP
Flip- op present a large internal capacitance on the internal clock node. If the DFF output does not switch, the DFF does not have to be clocked. SungKyunKwan Univ.

260 Frequency Reduction Clock Gating Example - When D is not equal to Q
SungKyunKwan Univ.

261 Frequency Reduction Clock Gating Example - Before Code
library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity nongate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0)); end nongate; architecture behave of nongate is signal load_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15; begin FSM : process begin wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM; enable_logic : process(count,load_en) begin if(count=9) then load_en <= '1'; else load_en <= '0'; end if; end process enable_logic; datapath : process begin wait until clk'event and clk='1'; if load_en='1' then data_reg <= data_in; end if; end process datapath; data_out <= data_reg; end behave; configuration cfg_nongate of nongate is for behave end for; end cfg_nongate; SungKyunKwan Univ.

262 Frequency Reduction Clock Gating Example - After Code
library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity gate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0)); end gate; architecture behave of gate is signal load_en,load_en_latched,clk_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15; begin SungKyunKwan Univ.

263 Frequency Reduction SungKyunKwan Univ. FSM : process begin
wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM; enable_logic : process(count,load_en) if(count=9) then load_en <= '1'; else load_en <= '0'; end process enable_logic; deglitch : PROCESS(clk,load_en) if(clk='0') then load_en_latched <= load_en; end if; end process deglitch; clk_en <= clk and load_en_latched; datapath : process begin wait until clk_en'event and clk_en='1'; data_reg <= data_in; end process datapath; data_out <= data_reg; end behave; configuration cfg_gate of gate is for behave end for; end cfg_gate; SungKyunKwan Univ.

264 Frequency Reduction Clock Gating Example - Report SungKyunKwan Univ.

265 Frequency Reduction 4-bit Synchronous & Ripple counter - code
 4-bit Synchronous Counter Library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity BINARY is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0)); end BINARY; architecture BEHAVIORAL of BINARY is begin process(reset,clk,count) begin if (reset = '0') then count <= "0000” elsif (clk'event and clk = '1') then if (count = UNSIGNED'("1111")) then count <= "0000"; else count <=count+UNSIGNED'("1"); end if; end process; end BEHAVIORAL; configuration CFG_BINARY_BLOCK_BEHAVIORAL of BINARY is for BEHAVIORAL end for; end CFG_BINARY_BLOCK_BEHAVIORAL; SungKyunKwan Univ.

266 Frequency Reduction  4-bit Ripple Counter SungKyunKwan Univ.
Library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; entity RIPPLE is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0)); end RIPPLE; architecture BEHAVIORAL of RIPPLE is signal count0, count1, count2 : std_logic; begin process(count) count0 <= count(0); count1 <= count(1); count2 <= count(2); end process; process(reset,clk) begin if (reset = '0') then count(0) <= '0'; elsif (clk'event and clk = '1') then if (count(0) = '1') then count(0) <= '0'; else count(0) <= '1'; end if; process(reset,count0) if (reset = '0') then count(1) <= '0'; elsif (count0'event and count0 = '1') then SungKyunKwan Univ.

267 Frequency Reduction SungKyunKwan Univ.
if (count(1) = '1') then count(1) <= '0'; else count(1) <= '1'; end if; end process; process(reset,count1) begin if (reset = '0') then count(2) <= '0'; elsif (count1'event and count1 = '1') then if (count(2) = '1') then count(2) <= '0'; else count(2) <= '1'; process(reset,count2) if (reset = '0') then count(3) <= '0'; elsif (count2'event and count2 = '1') then if (count(3) = '1') then count(3) <= '0'; else count(3) <= '1'; end if; end process; end BEHAVIORAL; configuration CFG_RIPPLE_BLOCK_BEHAVIORAL of RIPPLE is for BEHAVIORAL end for; end CFG_RIPPLE_BLOCK_BEHAVIORAL; SungKyunKwan Univ.

268 Frequency Reduction 4-bit Synchronous & Ripple counter - Report
SungKyunKwan Univ.

269 Bus-Invert Coding for Low Power I/O
An eight-bit bus on which all eight lines toggle at the same time and which has a high peak (worst-case) power dissipation. There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle). SungKyunKwan Univ.

270 Peak Power Dissipation
An eight-bit bus on which the eight lines toggle at different moments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation SungKyunKwan Univ.

271 Bus-Invert - Coding for low power
The Bus-Invert method proposed here uses one extra control bit called invert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value. 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value). 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value). 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1). SungKyunKwan Univ.

272 Example SungKyunKwan Univ.
A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time slot. SungKyunKwan Univ.

273 Bus encoding SungKyunKwan Univ.
The same sequence of data coded using the Bus Invert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot. The maximum number of transitions for any time slot is now 4. SungKyunKwan Univ.

274 Comparisons Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths. SungKyunKwan Univ.

275 Remarks SungKyunKwan Univ.
The increase in the delay of the data-path: By looking at the power-delay product which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered. The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease. Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side. SungKyunKwan Univ.

276 References SungKyunKwan Univ.
[9] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level Synthesis, Introduction to Chip and System Design, Kluwer Academic Publishers, 1992. [10] J. S. Gardner, \Designing with the IDT SyncFIFO: the Architecture of the Future", 1992 Synchronous (Clocked) FIFO Design Guide, Integrated Device Technology AN-60, pp. 7-10, 1992, Santa Clara, CA. [11] A. Ghosh, S. Devadas, K. Keutzer, J. White, \Estimation of Average Switching Activity in Combinational and Sequential Circuits", Proceedings of the 29th DAC, pp , June 1992, Anaheim, CA. [12] J. L. Hennessy, D. A. Patterson, Computer Architecture - A Quantitative Approach, Morgan Kaufmann Publishers, Palo Alto, CA, 1990. [13] S. Kodical, \Simultaneous Switching Noise", 1993 IDT High-Speed CMOS Logic Design Guide, Integrated Device Technology AN-47, pp , 1993, Santa Clara, CA. [14] F. Najm, \Transition Density, A Stochastic Measure of Activity in Digital Circuits", Proceedings of the 28th DAC, pp , June 1991, Anaheim, CA. [1] H. B. Bakoglu, Circuits, Interconnections and Packaging for VLSI, Addison-Wesley, 1990. [2] T. K. Callaway, E. E. Swartzlander, \Estimating the Power Con- sumption of CMOS Adders", 11th Symp. on Comp. Arithmetic, pp , Windsor, Ontario, 1993. [3] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, \Low-Power CMOS Digital Design", IEEE Journal of Solid-State Circuits, pp , April 1992. [4] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen, \HYPER-LP: A System for Power Minimization Using Archi- tectural Transformations", ICCAD-92, pp , Nov. 1992, Santa Clara, CA. [5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen, \An Approach to Power Minimization Using Transformations", IEEE VLSI for Signal Processing Workshop, pp. , 1992, CA. [6] S. Devadas, K. Keutzer, J. White, \Estimation of Power Dissi- pation in CMOS Combinational Circuits", IEEE Custom Inte- grated Circuits Conference, pp , 1990. [7] D. Dobberpuhl et al. \A 200-MHz 64-bit Dual-Issue CMOS Mi- croprocessor", IEEE Journal of Solid-State Circuits, pp 1567, Nov [8] R. J. Fletcher, \Integrated Circuit Having Outputs Congured for Reduced State Changes", U.S. Patent no. 4,667,337, May, 1987. SungKyunKwan Univ.

277 References SungKyunKwan Univ.
[16] A. Park, R. Maeder, \Codes to Reduce Switching Transients Across VLSI I/O Pins", Computer Architecture News, pp , Sept [17] Rambus - Architectural Overview, Rambus Inc., Mountain View, CA, Contact [18] A. Shen, A. Ghosh, S. Devadas, K. Keutzer, \On Average Power Dissipation and Random Pattern Testability", ICCAD-92, pp , Nov. 1992, Santa Clara, CA. [19] M. R. Stan, \Shift register generators for circular FIFOs", Electronic Engineering, pp , February 1991, Morgan Grampian House, London, England. [20] M. R. Stan, W. P. Burleson, \Limited-weight codes for low power I/O", International Workshop on Low Power Design, April 1994, Napa, CA. [21] J. Tabor, Noise Reduction Using Low Weight and Constant Weight Coding Techniques, Master's Thesis, EECS Dept., MIT, May 1990. [22] W.-C. Tan, T. H.-Y. Meng, \Low-power polygon renderer for computer graphics", Int. Conf. on A.S.A.P., pp , 1993. [23] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective, Addison-Wesley Publishing Company, 1988. [24] R. Wilson, \Low power and paradox", Electronic Engineering Times, pp. 38, November 1, 1993. [25] J. Ziv, A. Lempel, A universal Algorithm for Sequential Data Compression", IEEE Trans. on Inf. Theory, vol. IT-23, pp , 1977. SungKyunKwan Univ.

278 DesignPower Gate Level Power Model
Switching Power Power dissipated when a load capacitance(gate+wire) is charged or discharged at the driver’s output If the technology library contains the correct capacitance value of the cell and if capacitive_load_unit attribute is specified then no additional information is needed for switching power modeling Output pin capacitance need not be modeled if the switching power is incorporated into the internal power SungKyunKwan Univ.

279 DesignPower Gate Level Power Model
Internal Power power dissipated internal to a library cell Modeled using energy lookup table indexed by input transition time and output load Library cells may contain one or more internal energy lookup tables SungKyunKwan Univ.

280 DesignPower Gate Level Power Model
Leakage Power Leakage power model supports a signal value for each library cell State dependent leakage power is not supported SungKyunKwan Univ.

281 Operand Isolation Combinational logic dissipates significant power when output is unused Inputs to combination logic held stable when output is unused SungKyunKwan Univ.

282 Operation Isolation Example -Diagram
Before Operand Isolation After Operand Isolation SungKyunKwan Univ.

283 Operand Isolation Example - Before Code
Library IEEE; Use IEEE.STD_LOGIC_1164.ALL; Use IEEE.STD_LOGIC_SIGNED.ALL; Entity Logic is Port( a, b, c : in std_logic_vector(7 downto 0); do : out std_logic_vector(15 downto 0); rst : in std_logic; clk : in std_logic ); End Logic; Architecture Behave of Logic is Signal Count : integer; Signal Load_En : std_logic; Signal Load_En_Latched : std_logic; Signal Clk_En : std_logic; Signal Data_Add : std_logic_vector(7 downto 0); Signal Data_Mul : std_logic_vector(15 downto 0); Begin Process(clk,rst) -- Counter Logic in FSM If(clk='1' and clk'event) then If(rst='0') then Count <= 0; Elsif(Count=9) then Else Count <= Count + 1; End If; End Process; SungKyunKwan Univ.

284 Operand Isolation Example - Before Code
Data_Add <= a + b; Data_Mul <= Data_Add * c; Process(Data_Mul,Clk_En) -- Data Reg Logic Begin If(Clk_En='1' and Clk_En'event) then Do <= Data_Mul; End If; End Process; End Behave; Configuration CFG_Logic of Logic is for Behave End for; End CFG_Logic; Process(Count) -- Enable Logic in FSM Begin If(Count=9) then Load_En <= '1'; Else Load_EN <= '0'; End If; End Process; Process(clk,Load_En) -- Latch(for Deglitch) Logic If(clk='0') then Load_En_Latched <= Load_En; clk_En <= clk and Load_En_Latched; SungKyunKwan Univ.

285 Operand Isolation Example - After Code
Library IEEE; Use IEEE.STD_LOGIC_1164.ALL; Use IEEE.STD_LOGIC_SIGNED.ALL; Entity Logic1 is Port( a, b, c : in std_logic_vector(7 downto 0); do : out std_logic_vector(15 downto 0); rst : in std_logic; clk : in std_logic ); End Logic1; Architecture Behave of Logic1 is Signal Count : integer; Signal Load_En : std_logic; Signal Load_En_Latched : std_logic; Signal Clk_En : std_logic; Signal Data_Add : std_logic_vector(7 downto 0); Signal Data_Mul : std_logic_vector(15 downto 0); Signal Iso_Data_Add : std_logic_vector(7 downto 0); Begin Process(clk,rst) -- Counter Logic in FSM If(clk='1' and clk'event) then If(rst='0') then Count <= 0; Elsif(Count=9) then Else Count <= Count + 1; End If; End Process; SungKyunKwan Univ.

286 Operand Isolation Example - After Code
Process(Load_En_Latched,Data_Add) -- Latch Begin -- for Operand Isolation If(Load_En_Latched='1' and Load_En_Latched'event) then Iso_Data_Add <= Data_Add; End If; End Process; Data_Mul <= Iso_Data_Add * c; Process(Data_Mul,Clk_En) -- Data Reg Logic Begin If(Clk_En='1' and Clk_En'event) then Do <= Data_Mul; End Behave; Process(Count) -- Enable Logic in FSM Begin If(Count=9) then Load_En <= '1'; Else Load_EN <= '0'; End If; End Process; Process(clk,Load_En) -- Latch(for Deglitch) Logic If(clk='0') then Load_En_Latched <= Load_En; clk_En <= clk and Load_En_Latched; Data_Add <= a + b; SungKyunKwan Univ.

287 Operand Isolation Example - Report
Before Code After Code SungKyunKwan Univ.

288 Precomputation Power saving Opportunity Cost
Reduces power dissipation of combinational logic Reduces internal power to precomputed registers Opportunity Can be significant, dependent on; percentage of time latch precomputation is successful Cost Increase area Impact circuit timing Increase design complexity number of bits to precompute Testability may generate redundant logic SungKyunKwan Univ.

289 Precomputation Entire function is computed.
Smaller function is defined, Enable is precomputed. SungKyunKwan Univ.

290 Precomputation Before Precomputation Diagram SungKyunKwan Univ.

291 Precomputation After Precomputation Diagram SungKyunKwan Univ.

292 Precomputation Before Precomputation - Report SungKyunKwan Univ.

293 Precomputation After Precomputation - Report SungKyunKwan Univ.

294 Precomputation Example - Before Code
Library IEEE; Use IEEE.STD_LOGIC_1164.ALL; Entity before_precomputation is port ( a,b : in std_logic_vector(7 downto 0); CLK: in std_logic; D_out: out std_logic); end before_precomputation; Architecture Behav of before_precomputation is signal a_in, b_in : std_logic_vector(7 downto 0); signal comp : std_logic; Begin process (a,b,CLK) if (CLK = '1' and CLK'event) then a_in <= a; b_in<= b; end if; if (a_in > b_in) then comp <= '1'; else comp <= '0'; if (CLK'event and CLK='1') then D_out <= comp; end process; end Behav; SungKyunKwan Univ.

295 Precomputation Example - After Code
Begin process(a,b,CLK) if (CLK='1' and CLK'event) then a_in(7) <= a(7); b_in(7) <= b(7); end if; pcom <= a xor b; if (CLK='0') then pcom_D <= pcom; CLK_en <= pcom_D and CLK; Library IEEE; Use IEEE.STD_LOGIC_1164.ALL; Entity after_precomputation is port (a, b : in std_logic_vector(7 downto 0); CLK: in std_logic; D_out: out std_logic); end after_precomputation; Architecture Behav of after_precomputation is signal a_in, b_in : std_logic_vector(7 downto 0); signal pcom, pcom_D : std_logic; signal CLK_en, comp : std_logic; SungKyunKwan Univ.

296 Precomputation - Example After Code
if (CLK_en='1' and CLK_en'event) then a_in(6 downto 0) <= a(6 downto 0); b_in(6 downto 0) <= b(6 downto 0); end if; if (a_in > b_in) then comp <= '1'; else comp <= '0'; if (CLK='1' and CLK'event) then D_out <= comp; end if; end process; end Behav; SungKyunKwan Univ.

297 Peak Power Reduction Before Peak Power Reduction
Peak Power has relation to EMI Reducing concurrent switching makes peak power reduction Adjust delay  within the speed of system clock in Bus/Port driver Consider the power consumption of delay element Maintaining total power consumption, we improve EMI in peak power reduction Before Peak Power Reduction After Peak Power Reduction SungKyunKwan Univ.

298 Factoring Example f = ad + bc + cd
Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced. SungKyunKwan Univ.

299 Block diagram of low-voltage, high-speed of LSI
Power Management Processor controls the low-Vt circuit using the sleep signal. Extend the sleep period as much as possible, because leakage power is reduced during this time SungKyunKwan Univ.

300 Operations of low-V t LSI
Request signal from an I/O device, output the results, waits for the next request signal. During the waiting period, the low-Vt circuit can sleep. SungKyunKwan Univ.

301 Waking/Sleeping operation
Waking operation Sleeping operation SungKyunKwan Univ.

302 Creating sleep period: Operation during calculation
Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing light operations. reduce the power by 10% SungKyunKwan Univ.

303 Low Power Logic Gate Resynthesis on Mapped Circuit
김현상 조준동 전기전자컴퓨터공학부 성균관대학교 SungKyunKwan Univ.

304 Transition Probability
Transition Probability: Prob. Of a transition at the output of a gate, given a change at the inputs Use signal probabilities Example: F = X’Y + XY’ Signal Prob. Of F: Pf = Px(1-Py)+(1-Px)Py Transistion Prob. Of F = 2Pf(1-Pf) Assumption of independence of inputs Use BDDs to compute these References: Najm’91 SungKyunKwan Univ.

305 Technology Mapping Implementing a Boolean network in terms of gates from a given library Popular technique: Tree-based mapping Library gates and circuits decomposed into canonical patterns Pattern matching and dynamic programming to find the best cover NP-complete for general DAG circuits Ref: Keutzer’87, Rudell’89 Idea: High transition probability points are hidden within gates SungKyunKwan Univ.

306 Low Power Cell Mapping Example of High Switching Activity Node
Internal Mapping in Complex Gate SungKyunKwan Univ.

307 Signal Probability vs. Power
SungKyunKwan Univ.

308 Spatial Correlation SungKyunKwan Univ.

309 Low Power Logic Synthesis
SungKyunKwan Univ.

310 Technology Mapping SungKyunKwan Univ.

311 Tree Decomposition SungKyunKwan Univ.

312 Huffman Algorithm SungKyunKwan Univ.

313 Depth-Constrained Decomposition
Algorithm problem : minimize SUM from i=1 to m p_t (x_i ) input : 입력 시그널 확률(p1, p2,íñíñíñ, pn), 높이(h), 말단 노드의 수(n), 게이트당 fanin limit(k) output : k-ary 트리 topology Begin sort (signal probability of p1, p2,íñíñíñ, pn); while (n!=0) if (h>logkn) assign k nodes to level L(=h+1); /*레벨 L(=h+1)에 노드 k개만큼 할당*/ h=h-1, n=n-(k-1); /*upward*/ else if (h<logkn) assign k nodes to level L(=h+2); /*이전 레벨 L(=h+2)에 노드 k개만큼 할당*/ h=h, n=n-(k-1); /*downward*/ else (h=logkn) assign the remaining nodes to level L(=h+1); /*complete; 레벨 L(=h+1)에 나머지 노드를 모두 할당하고 complete k-ary 트리 구성*/ for (bottom level L; L>1; L--) min_edge_weight_matching (nodes in level L); End SungKyunKwan Univ.

314 Example SungKyunKwan Univ.

315 After Decomposition SungKyunKwan Univ.

316 After Tech. Mapping SungKyunKwan Univ.

317 7. Circuit Level Design SungKyunKwan Univ.

318 Buffer Chain Delay analysis of buffer chain
Delay analysis considering parasitic capacitance,Cp Ck,Pk: stage k buffer output의 total capacitance, power PT: buffer chain의 power consumption Pn: load capacitance CL의 power consumption Eff: power efficiency pn/pT SungKyunKwan Univ.

319 Slew Rate Determining rise/fall time SungKyunKwan Univ.

320 Slew Rate(Cont’d) Power consumption of Short circuit current in Oscillation Circuit SungKyunKwan Univ.

321 Pass Transistor Logic SungKyunKwan Univ. Reducing Area/Power
Macro cell(Large part in chip area) XOR/XNOR/MUX(Primitive)  Pass Tr. Logic Not using charge/discharge scheme  Appropriate in Low Power Logic Pass Tr logic Family CPL (Complementary Pass Transistor Logic) DPL (Dual Pass Transistor Logic) SRPL (Swing Restored Pass Transistor Logic) CPL Basic Scheme Inverter Buffering SungKyunKwan Univ.

322 Pass Transistor Logic(Cont’d)
DPL Pass Tr Network + Dual p-MOS Enables rail-to-rail swing Characteristics Increasing input capacitance(delay) Increasing driving ability for existing 2 ON-path equals CPL in input loading capacitance SRPL Pass Tr network + Cross coupled inverter Restoring logic level Inverter size must not be too big SungKyunKwan Univ.

323 Dynamic Logic Basic architecture of Domino logic SungKyunKwan Univ.
Using Precharge/Evaluation scheme Family Domino logic NORA(NO RAce) logic Characteristics Decreasing input loading capacitance Power consumption in precharge clock Increasing useless switching in precharging period Basic architecture of Domino logic SungKyunKwan Univ.

324 Input Pin Ordering Example of N-input CMOS logic SungKyunKwan Univ.
Reorder the equivalent inputs to a transistor based on critical path delays and power consumption N- input Primitive CMOS logic symmetrical in function level antisymmetrical in Tr level capacitance of output stage body effect Scheme The signal that has many transition must be far from output If it is hard to estimate switching frequency, we must determine pin ordering considering path and path delay balance from primary input to input of Tr. Example of N-input CMOS logic Experimentd with gate array of TI For a 4-input NAND gate in TI’s BiCMOS gate array library (with a load of 13 inverters), the delay varies by 20% while power dissipation by 10% between a good and bad ordering SungKyunKwan Univ.

325 INPUT PIN Reordering CL A B C D CB CC CD VDD Simulation result
MPA MPB MPC MPD MNA MNB MNC MND 1 (a) (b) (c) (d) Simulation result ( tcycle=50ns, tf/tr=1ns) : A가 critical input인 경우 =38.4uW, D가 critical input인 경우 =47.2uW SungKyunKwan Univ.

326 Sensitization Definition
sensitization : input signal that forces output transition event sensitization vector : the other inputs if one signal is sensitized Example SungKyunKwan Univ.

327 Sensitization(Cont’d)
Considering Sensitization in Combinational logic:Remove unnecessary transitions in the C.L Considering Sensitization in Sequential logic: Also reduces the power consumption in the flip-flops. SungKyunKwan Univ.

328 TTL-Compatible TTL level signal  CMOS input
Characteristic Curve of CMOS Inverter SungKyunKwan Univ.

329 TTL Compatible(Cont’d)
CMOS output signal  TTL input Because of sink current IOL, CMOS gets a large amount of heat Increased chip operating temperature Power consumption of whole system SungKyunKwan Univ.

330 INPUT PIN Reordering To reduce the power dissipation one should place the input with low transition density near the ground end. (a) If MNA turns off , only CL needs to be charged (b) If MND turns off , all CL, CB, CC and CD needs to be charged (c) If the critical input is rising and placed near output node, the initial charge of CB, CC and CD are zero and the delay time of CL discharging is less than (d) (d) If the critical input is rising and placed near ground end, the charge of CB, CC and CD must dischagge before the charge of CL discharge to zero SungKyunKwan Univ.

331 저전력 Booth Multiplier 설계
성균관대학교 전기전자컴퓨터공학부 김 진 혁, 이 준 성, 조 준 동 SungKyunKwan Univ.

332 < Generation and operation of recoded digit >
Modified Booth 곱셈기 Multibit Recoding을 사용하여 부분합의 갯수를 1/2로 줄여 고속의 곱셈을 가능하게 한다. 피승수(multiplicand) : X , 승수(multiplier) : Y Recoded digit = Y2i-1 + Y2i -2Y2i+1 ( Y-1=0 ) < Generation and operation of recoded digit > SungKyunKwan Univ.

333 Modified Booth 곱셈기 - 예 Example SungKyunKwan Univ.

334 Wallace Tree - 4:2 Compressor
SungKyunKwan Univ.

335 Multipliers - Area 16-bit Multiplier Area SungKyunKwan Univ.

336 Multiplier - Delay Average Power Dissipation (16-bit)
SungKyunKwan Univ.

337 Multiplier - Power Worst-Case Delay (16-bit) SungKyunKwan Univ.

338 Instruction Level Power Analysis
Estimate power dissipation of instruction sequences and power dissipation of a program Eb : base cost of individual instructions Es : circuit state change effects EM : the overall energy cost of a program Bi : the base cost of type i instruction Ni : the number of type i instruction Oi,j : the cost occurred when a type i instruction is followed by a type j instruction Ni,j : the number of occurrences when a type i instruction is immediately followed by a type j instruction SungKyunKwan Univ.

339 Instruction ordering Develop a technique of operand swapping
Recoding weight : necessary operation cost of operands Wtotal : total recoding weight of input operand Wi : weight of individual recoded digit i in Booth Multiplier Wb : base weight of an instruction Winter : inter-operation weight of instructions Therefore, if an operand has lower Wtotal , put it in the second input(multiplier). SungKyunKwan Univ.

340 RESULT SungKyunKwan Univ.

341 Conclusion SungKyunKwan Univ. % of instances with
circuit states effects 9.0% reduction Power[pJ] 12.0% reduction 4.0% reduction bits bits SungKyunKwan Univ.

342 8. Layout Level Design SungKyunKwan Univ.

343 Device Scaling of Factor of S
Constant scaled wire increases coupling capacitance by S and wire resistance by S Supply Voltage by 1/S, Theshold Voltage by 1/S, Current Drive by 1/S Gate Capaitance by 1/S, Gate Delay by 1/S Global Interconnection Delay, RC load+para by S Interconnect Delay: 50-70% of Clock Cycle Area: 1/S2 Power dissipation by 1/S - 1/S2 ( P = nCVdd2f, where nC is the sum of capacitance times #transitions) SIA (Semiconductor Industry Association): On 2007, physical limitation: 0.1 m 20 billion transistors, 10 sqare centimeters, 12 or 16 inch wafer SungKyunKwan Univ.

344 Delay Variations at Low-Voltage
At high supply voltage, the delay increases with temperature (mobility is decreasing with temperature) while at very low supply voltages the delay decreases with temperature (VT is decreasing with temperature). At low supply voltages, the delay ratio between large and minimum transistor widths W increases in several factors. Delay balancing of clock trees based on wire snaking in order to avoid clock-skew. In this case, at low supply voltages, slightly VT variations can significantly modify the delay balancing. SungKyunKwan Univ.

345 Quarter Micron Challenge
Computers/peripherals (SOC): 1996 ($50 Billion) 1999 ($70 Billion) Wiring dominates delay: wire R comparable to gate driver R; wire/wire coupling C > C to ground Push beyond 0.07 micron Quest for area(past), speed-speed (now), power-power-power(future) Accelerated increases of clock frequencies Signal integrity-based tools Design styles (chip + packages) System-level design(system partitioning) Synthesis with multiple constraints (power,area,timing) Partitioning/MCM Increasing speed limits complicate clock and power distribution Design bounded by wires, vias, via resistance, coupling Reverse scaling: adding area/spacing as needed: widening, thickening of wires, metal shielding & noise avoidance - adding metal SungKyunKwan Univ.

346 CLOCK POWER CONSUMPTION
Clock power consumption is as large as the logic power; Clock Signal carrying the heaviest load and switching at high frequency, clock distribution is a major source of power dissipation. In a microprocessor, 18% of the total power is consumed by clocking Clock distribution is designed as a hierarchical clock tree, according to the decomposition principle. SungKyunKwan Univ.

347 Power Consumption per block in typical microprocessor
SungKyunKwan Univ.

348 Crosstalk SungKyunKwan Univ.

349 Solution for Clock Skew
Dynamic Effects on Skew Capacitance Coupling Supply Voltage Deviation (Clock driver and receiver voltage difference) Capacitance deviation by circuit operation Global and local temperature Layout Issues: clocks routed first Must aware of all sources of delay Increased spacing Wider wires Insert buffers Specialized clock need net matching Two approaches: Single Driver, H-tree driver Gated Clocks: The local clocks that are conditionally enabled so that the registers are only clocked during the write cycles. The clock is partitioned in different blocks and each block is clocked with its own clock. Gating the clocks to infrequently used blocks does not provide and acceptable level of power savings Divide the basic clock frequency to provide the lowest clock frequency needed to different parts of the circuit Clock Distribution: large clock buffer waste power. Use smaller clock buffers with a well-balanced clock tree. SungKyunKwan Univ.

350 PowerPC Clocking Scheme
SungKyunKwan Univ.

351 CLOCK DRIVERS IN THE DEC ALPHA 21164
SungKyunKwan Univ.

352 DRIVER for PADS or LARGE CAPACITANCES
Off-chip power (drivers and pads) are increasing and is very difficult to reduce such a power, as the pads or drivers sizes cannot be decreased with the new technologies. SungKyunKwan Univ.

353 Layout-Driven Resynthesis for Lower Power
SungKyunKwan Univ.

354 Low Power Process Dynamic Power Dissipation SungKyunKwan Univ.

355 Crosstalk SungKyunKwan Univ.
In deep-submicron layouts, some of the netlengths for connection between modules can be so long that they have a resistance which is comparable to the resistance of the driver. Each net in the mixed analog/digital circuits is identified depending upon its crosstalk sensitivity 1. Noisy = high impedance signal that can disturb other signals, e.g., clock signals. 2. High-Sensitivity = high impedance analog nets; the most noise sensitive nets such as the input nets to operational amplifiers. 3. Mid-Sensitivity = low/medium impedance analog nets. 4. Low-Sensitivity = digital nets that directly affect the analog part in some cells such as control signals. 5. Non-Sensitivity = The most noise insensitive nets such as pure digital nets, The crosstalk between two interconnection wires also depends on the frequencies (i.e., signal activities) of the signals traveling on the wires. Recently, deep-submicron designs require crosstalk-free channel routing. SungKyunKwan Univ.

356 Power Measure in Layout
The average dynamic power consumed by a CMOS gate is given below, where C_l is the load capacity at the output of the node, V_dd is the supply voltage, T_cycle is the global clock period, N is the number of transitions of the gate output per clock cycle, C_g is the load capacity due to input capacitance of fanout gates, and C_w is the load capacity due to the interconnection tree formed between the driver and its fanout gates. Pav = (0.5 Vdd2) / (Tcycle Cl N) = (0.5 Vdd2) / (Tcycle (Cg + Cw )N) Logic synthesis for low power attempts to minimize SUMi Cgi Ni Physical design for low power tries to minimize SUMi Cwi Ni . Here Cwi consists of Cxi + CsI, where Cxi is the capacitance of net i due to its crosstalk, and CsI is the substrate capacitance of net i. For low power layout applications, power dissipation due to crosstalk is minimized by ensuring that wires carrying high activity signals are placed sufficiently far from the other wires. Similarly, power dissipation due to substrate capacitance is proportional to the wirelength and its signal activity. SungKyunKwan Univ.

357 성균관대학교 전기전자컴퓨터공학부 김 진 혁, 이 준 성, 조 준 동
이중 전압을 이용한 저전력 레이아웃 설계 성균관대학교 전기전자컴퓨터공학부 김 진 혁, 이 준 성, 조 준 동 SungKyunKwan Univ.

358 목 차 연구목적 연구배경 Clustered Voltage Scaling 구조 Row by Row Power Supply 구조
목 차 연구목적 연구배경 Clustered Voltage Scaling 구조 Row by Row Power Supply 구조 Mix-And-Match Power Supply 구조 Level Converter 구조 Mix-And-Match Power Supply 설계흐름 실험결과 결론 SungKyunKwan Univ.

359 연 구 목 적 및 배경 조합회로의 전력 소모량을 줄이는 이중 전압 레이아웃 기법 제안
연 구 목 적 및 배경 조합회로의 전력 소모량을 줄이는 이중 전압 레이아웃 기법 제안 이중 전압 셀을 사용할 때, 한 cell row에 같은 전압의 cell이 배치되면서 증가하는 wiring 과 track 의 수를 줄임 최소 트랜지스터 개수를 사용하는 Level Converter 회로의 구현 디바이스의 성능을 유지하면서 이중 전압을 사용하는 Clustered Voltage Scaling [Usami, ’95]을 적용 제안된 Mix-And-Match Power Supply 레이 아웃 구조는 기존의 Row by Row Power Supply [Usami, ’97] 레이 아웃 구조를 개선하여 전력과 면적을 줄임 SungKyunKwan Univ.

360 Clustered Voltage Scaling
저전력 netlist 를 생성 SungKyunKwan Univ.

361 Row by Row Power Supply 구조
SungKyunKwan Univ.

362 Mix-And-Match Power Supply 구조
SungKyunKwan Univ.

363 구 조 비 교 Conventional RRPS MAMPS Circuit SungKyunKwan Univ.

364 Level Converter 구조 Transistor의 갯수 : 6개 4개 전력과 면적면에서 효과적 기 존 제 안
기 존 제 안 SungKyunKwan Univ.

365 Mix-And-Match Power Supply Design Flow
SungKyunKwan Univ.

366 실 험 결 과 전체 Area 전체 Power SungKyunKwan Univ.

367 결 론 단일 전압 회로와 비교하여 49.4%의 Power 감소를 얻은 반면 5.6%의 Area overhead가 발생
결 론 단일 전압 회로와 비교하여 49.4%의 Power 감소를 얻은 반면 5.6%의 Area overhead가 발생 기존의 RRPS 구조보다 10%의 Area 감소와 2%의 Power 감소 제안된 Level Converter는 기존의 Level Converter보다 30%의 Area 감소와 35%의 Power 감소 SungKyunKwan Univ.

368 9. CAD tools SungKyunKwan Univ.

369 Low Power Design Tools Transistor Level Tools (5-10% of silicon)
SPICE, PowerMill(Epic), ADM(Avanti/Anagram), Lsim Power Analyst(mentor) Logic Level Tools (10-15%) Design Power and PowerGate (Synopsys), WattWatcher/Gate (Sente), PowerSim (System Sciences), POET (Viewlogic), and QuickPower (Mentor) Architectural (RTL) Level Tools (20-25%) WattWatcher/Architect (Sente): 20-25% accuracy Behavioral (spreadsheet) Level Tools (50-100%) Active area of academic research SungKyunKwan Univ.

370 Commercial synthesis systems
SungKyunKwan Univ.

371 Research synthesis systems
A - Architectural synthesis. L - Logic synthesis. SungKyunKwan Univ.

372 Low-Power CAD sites SungKyunKwan Univ.
Alternative System Concepts, Inc, : 7X power reduction throigh optimization, contact and Jake Karrfalt at or (603) Reduction of glitch and clock power; modeling and optimization of interconnect power; power optimization for data-dominated designs with limited control flow. Mentor Graphics QuickPower: Hierarchical of determining overall benet of exchanging the blocks for lower power. powering down or disabling blocks when not in use by gated-clock choose candidates for power-down Calculate the effect of the power-down logic Synopsys's Power Compiler Sente's WattWatcher/Architect (first commerical tool operating at the architecture level(20-25 %accuracy). Behavioral Tool: Hyper-LP (Optimization), Explore (Estimation) by J. Rabaey SungKyunKwan Univ.

373 Design Power(Synopsys)
DesignPower(TM) provides a single, integrated environment for power analysis in multiple phases of the design process: Early, quick feedback at the HDL or gate level through probabilistic analysis. Improved accuracy through simulation-based analysis for gate level and library exploration. DesignPower estimates switching, internal cell and leakage power. It accepts user-defined probabilities, simulation toggle data or a combination of both as input. DesignPower propagates switching information through sequential devices, including flip-flops and latches. It supports sequential, hierarchical, gated-clock, and multiple-clock designs. For simulation toggle data, it links directly to Verilog and VHDL simulators, including Synopsys' VSS. SungKyunKwan Univ.

374 10. References SungKyunKwan Univ.

375 References [1] Gary K. Yeap, "Practical Low Power Digital VLSI Design", Kluwer Academic Publishers. [2] Jan M. Rabaey, Massoud Pedram, "Low Power Design Methodologies", [3] Abdellatif Bellaouar, Mohamed I. Elmasry, "Low-Power Digital VLSI Design Circuits And Systems", Kluwer Academic Publishers. [4] Anantha P. Chandrakasan, Robert W. Brodersen, "Low Power Digital CMOS Design", Kluwer Academic Publishers. [5] Dr. Ralph Cavin, Dr. Wentai Liu, "1996 Emerging Technologies : Designing Low Power Digital Systems" [6] Muhammad S. Elrabaa, Issam S. Abu-Khater, Mohamed I. Elmasry, "Advanced Low-Power Digital Circuit Techniques", Kluwer Academic Publishers. SungKyunKwan Univ.

376 References SungKyunKwan Univ.
[BFKea94] R. Bechade, R. Flaker, B. Kaumann, and et. al. A 32b 66 mhz 1.8W Microprocessor". In IEEE Int. Solid-State Circuit Conference, pages , 1994. [BM95] Bohr and T. Mark. Interconnect Scaling - The real limiter to high performance ULSI". In proceedings of 1995 IEEE international electron devices meeting, pages , 1995. [BSM94] L. Benini, P. Siegel, and G. De Micheli. Saving Power by Synthesizing Gated Clocks for Sequential Circuits". IEEE Design and Test of Computers, 11(4):32-41, 1994. [GH95] S. Ganguly and S. Hojat. Clock Distribution Design and Verification for PowerPC Microprocessor". In International Conference on Computer-Aided Design, page Issues in Clock Designs, 1995. [MGR96] R. Mehra, L. M. Guerra, and J. Rabaey. Low Power Architecture Synthesis and the Impact of Exploiting Locality". In Journal of VLSI Signal Processing,, 1996. SungKyunKwan Univ.


Download ppt "Lower Power Design Guide"

Similar presentations


Ads by Google