Download presentation
Presentation is loading. Please wait.
Published byKelly Sparks Modified over 5 years ago
1
Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy
Seongmoo Heo, Ronny Krashinsky, Krste Asanović MIT - Laboratory for Computer Science ARVLSI March 15, 2001
2
Flip-Flop and Latch (collectively timing elements)
Critical Timing Elements (TEs) in modern synchronous VLSI systems Significant impact on cycle time Big portion of energy consumption Energy breakdown of a MIPS 5 stage pipeline datapath for SPECint 95 programs Flip-flop Latch [Heo, MS Thesis, ’00]
3
Motivation Previous work tried to find the most energy-efficient and fastest TEs assuming a single TE design used uniformly throughout a circuit. using a very limited set of data patterns and un-gated clock signal. Two important observations There is a wide variation in clock and data activity across different TEs. Many TEs are not in the critical path, and thus have ample time slack.
4
Basic Idea Selection from a heterogeneous library of designs, each tuned to different operating regimes Operating regimes : Different input and clock signal activities Different speed requirements
5
Related Work The use of timing slack for reduced energy Examples :
- Traditional transistor sizing - Cluster voltage scaling [Usami and Horowitz ’95] - Multiple threshold voltage or series transistor for reducing leakage current [McPherson et al. ’00, Yamashita et al. ’00, Johnson et al. ’99]
6
Our Contribution Detailed energy characterization of wide range of TEs as a function of signal activities. Detailed measurement of TE signal activities for a micro-processor running complete programs Exploit signal activity to reduce TE energy by using different TE structures.
7
Overview Flip-Flop and Latch Designs Test Bench and Simulation Setup
Delay and Energy Characterization Energy Analysis with Test Waveforms Evaluation with Processor Conclusion
8
Latch Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
9
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
10
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
11
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
12
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
13
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
14
Flip-Flop Designs Transistor sizes optimized for two extremes:
Highest speed vs. Lowest power
15
Test Bench Used fixed, realistic input driver
Determined appropriate output load As large as 200fF output load was used by previous work. We used 7.2fF (4 min-inv cap) because 60% of output loads in the VP microprocessor datapath are smaller than 14.4fF. Further work on load-sensitive analysis at upcoming WVLSI Sized clock buffer to give equal rise/fall time 7.2fF
16
Simulation Setup Custom layout in 0.25μm TSMC CMOS process with Magic layout program Layout extraction with SPACE 2D extractor Circuit simulation with Hspice under nominal condition of Vdd=2.5V and T=25°C Hspice .Measure command to measure delay and energy
17
Delay Characterization
Flip-flop : Minimum D-Q delay [Stojanovic et al. ’99] Latch : D-Q delay (a) Flip-flops (b) Latches
18
Energy Characterization
Total energy = input energy + internal energy + clock energy – output energy Accurate energy characterization State-transition technique based on [Zyuban and Kogge ’99] D Q C 1 1 2 3 3 2 C D Q
19
Energy Tables (a) Flip-flops (b) Latches
20
Energy Tables (a) Flip-flops (b) Latches Low-Power Flip-Flop 000 100
001 010 111 011 110 101 Low-Power Flip-Flop PPCFF 48.4 95.5 95.4 89.2 89.0 47.6 46.3 46.0 91.5 49.1 46.8 68.1 19.4 19.2 68.0 49.7 6.9 51.2 (b) Latches
21
Test Waveforms Test 1 and 2 : high clock activity, no data and output activity Test 3 and 4 : high data activity, no clock and output activity Test 5, 6, and 7 : high clock, data, and output activity (Traditional) Test 8 : high clock and data activity, no output activity
22
Energy Analysis Low-power flip-flops and latches (a) Flip-flops
1131 (b) Latches Low-power flip-flops and latches
23
Processor Design and Simulation
Evaluation on a microprocessor datapath Vanilla Pekoe Processor A classic 32-bit MIPS RISC 5 stage pipeline with caches and system coprocessor registers (R3000-compatible) Aggressive clock gating to save energy 22 multi-bit flip-flops and latches, totaling 675 individual bits Simulation with 5 programs of SPECint95 benchmarks A fast cycle-accurate simulator [Krashinsky, Heo, Zhang, and Asanovic ’00] with the ability of counting TE state transitions 1.71 billion instructions and 2.69 billion cycles Some constraints Cannot track the exact timing of signals Cannot model glitches
24
Flip-Flops and Latches in Processor
25
Flip-Flops and Latches in Processor
26
Flip-Flops and Latches in Processor
27
Energy Breakdown 32-bit MIPS 5 stage pipeline datapath
Flip-flops HLFF-hs Lowest-Energy f_recovpc 25.1 SSAFF-lp 3.57 d_inst 31.2 6.52 d_epc 20.5 2.74 x_epc 20.3 2.62 m_epc 20.2 2.55 x_sd 2.6 SAFF-lp 1.06 x_addr 8.0 2.57 m_exe 24.6 4.76 cp0_count 42.6 4.80 cp0_comp 0.1 HLFF-lp 0.03 cp0_baddr 0.3 0.18 cp0_epc 0.05 Latches PPCLA-hs Lowest-Energy p_pc 3.22 SSALA-lp 2.25 f_pc 2.95 1.72 d_rsalu 3.27 3.16 d_rtalu 2.81 2.28 d_rsshmd 0.75 PPCLA-lp 0.70 d_rtshmd 0.65 0.63 d_aluctrl 1.26 0.97 m_exe 3.88 3.65 x_sdalign 0.30 SSA2LA-lp 0.27 w_result 2.74 2.42 (unit: mJ) (unit: mJ) 32-bit MIPS 5 stage pipeline datapath SPECint95 benchmarks: perl(test, primes), ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)
28
Energy Breakdown 32-bit MIPS 5 stage pipeline datapath
Flip-flops HLFF-hs Lowest-Energy f_recovpc 25.1 SSAFF-lp 3.57 d_inst 31.2 6.52 d_epc 20.5 2.74 x_epc 20.3 2.62 m_epc 20.2 2.55 x_sd 2.6 SAFF-lp 1.06 x_addr 8.0 2.57 m_exe 24.6 4.76 cp0_count 42.6 4.80 cp0_comp 0.1 HLFF-lp 0.03 cp0_baddr 0.3 0.18 cp0_epc 0.05 Latches PPCLA-hs Lowest-Energy p_pc 3.22 SSALA-lp 2.25 f_pc 2.95 1.72 d_rsalu 3.27 3.16 d_rtalu 2.81 2.28 d_rsshmd 0.75 PPCLA-lp 0.70 d_rtshmd 0.65 0.63 d_aluctrl 1.26 0.97 m_exe 3.88 3.65 x_sdalign 0.30 SSA2LA-lp 0.27 w_result 2.74 2.42 (unit: mJ) (unit: mJ) 32-bit MIPS 5 stage pipeline datapath SPECint95 benchmarks: perl(test, primes), ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)
29
Energy Breakdown 32-bit MIPS 5 stage pipeline datapath
Flip-flops HLFF-hs Lowest-Energy f_recovpc 25.1 SSAFF-lp 3.57 d_inst 31.2 6.52 d_epc 20.5 2.74 x_epc 20.3 2.62 m_epc 20.2 2.55 x_sd 2.6 SAFF-lp 1.06 x_addr 8.0 2.57 m_exe 24.6 4.76 cp0_count 42.6 4.80 cp0_comp 0.1 HLFF-lp 0.03 cp0_baddr 0.3 0.18 cp0_epc 0.05 Latches PPCLA-hs Lowest-Energy p_pc 3.22 SSALA-lp 2.25 f_pc 2.95 1.72 d_rsalu 3.27 3.16 d_rtalu 2.81 2.28 d_rsshmd 0.75 PPCLA-lp 0.70 d_rtshmd 0.65 0.63 d_aluctrl 1.26 0.97 m_exe 3.88 3.65 x_sdalign 0.30 SSA2LA-lp 0.27 w_result 2.74 2.42 (unit: mJ) (unit: mJ) 32-bit MIPS 5 stage pipeline datapath SPECint95 benchmarks: perl(test, primes), ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)
30
Processor Energy Results - Flip-Flop
HS: Highest-Speed LP: Lowest-Power HLFF-hs HLFF-lp (A single design used uniformly throughout a circuit) SSAFF-hs SSASPL-hs SSAFF-lp SSASPL-lp Ref : Total datapath energy – Total TE energy = around 0.21J
31
Processor Energy Results - Flip-Flop
HLFF-hs 34% energy saving 34% energy saving with conventional transistor sizing
32
Processor Energy Results - Flip-Flop
HSLE: Activity-Sensitive selection HLFF-hs 69% energy saving 52% energy saving 52% energy saving over just transistor sizing with the best performance (HLFF-hs)
33
Processor Energy Results - Latch
2 1 PPCLA-hs SSA2LA-lp 6.1% energy saving over just transistor sizing (1) 8.3% energy saving compared to homogeneous design with PPCLA-hs (2) PPCLA is the fastest and also very energy-efficient.
34
Summary of Energy Results
63% TE energy saving compared to a homogeneous design with HLFF-hs and PPCLA-hs 46% TE energy saving compared to a design with conventional transistor sizing while keeping the best performance
35
Conclusion We showed that activation patterns for various TEs in a circuit differ considerably. We found that there is wide variation in the optimal TE designs for different regimes. We provided complete energy and delay characterization. We applied our technique to a real processor which we simulated 2.7 billion cycles of programs and showed over 63% TE energy reduction without losing any performance. Difficulty of using a heterogeneous mix of TEs? - Already designers have been doing verification for each local clock and added complexity is minimal. - Timing verification for non-critical TEs is simple.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.