CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.

Managing Static (Leakage) Power S. Kaxiras, M Martonosi, “Computer Architecture Techniques for Power Effecience”, Chapter 5.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

CS 7810 Lecture 5 The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler,

CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.

Combining Branch Predictors

Hot-and-Cold: Using Criticality in the Design of Energy-Efficient Caches Rajeev Balasubramonian, University of Utah Viji Srinivasan, IBM T.J. Watson Sandhya.

CS Lecture 24 Exceeding the Dataflow Limit via Value Prediction M.H. Lipasti, J.P. Shen Proceedings of MICRO-29 December 1996.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Spring 07, Feb 22 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Power Aware Microprocessors Vishwani D. Agrawal.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

Low Power Techniques in Processor Design

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

GREEN COMPUTING Power Consumption Basics in ICT Products

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

1 Two research studies related to branch prediction and instruction sequencing André Seznec INRIA/IRISA.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Branch Hazards and Static Branch Prediction Techniques

11/15/05ELEC / Lecture 191 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Lecture: Out-of-order Processors

Lecture: Branch Prediction

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Hot Chips, Slow Wires, Leaky Transistors

Lecture: Branch Prediction

Lecture: Pipelining Extensions

Microarchitectural Techniques for Power Gating of Execution Units

Lecture: SMT, Cache Hierarchies

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Computer Architecture Lecture 4 17th May, 2006

Lecture: Static ILP, Branch Prediction

Lecture: SMT, Cache Hierarchies

EE 382N Guest Lecture Wish Branches

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 5: Pipelining Basics

Lecture 18: Pipelining Today’s topics:

Lecture: Branch Prediction

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture: Pipelining Extensions

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: SMT, Cache Hierarchies

Presentation transcript:

CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998

Cost of Speculation Mispredict rates 

Pipeline Gating Low confidence branches throttle instr fetch until they are resolved Pipeline gating usually lasts for fewer than five cycles

Metrics SPEC (specificity): fraction of all mispredicted branches detected as low-confidence by the confidence estimator (coverage) PVN (predictive value of a negative test): probability of a low-confidence branch being incorrectly branch-predicted (accuracy)

Confidence Estimators Perfect: to gauge potential benefits Static: branches that have low prediction rates JRS: if a branch has yielded N successive correct predictions, it has high confidence Saturating counters: unbiased counter value or disagreement in two predictors  low confidence Distance: mpreds are clustered, hence the first 4 branches after a mispredict have low confidence

SPEC and PVN It is easier to achieve a high SPEC value than PVN A high PVN value can be achieved by using N low-confidence branches to invoke gating – if PVN is 30%, re-defining low-confidence as two low-confidence branches increases PVN to 51% SPEC (coverage): mispred branches detected by low-confidence estimator PVN (accuracy): % of low-confidence branches that are branch mpreds

Perfect

Gating Results

Results Can gating improve performance? – only if cache pollution is significant Less than 1% performance loss and up to 38% reduction in extra work Energy consumption could go up – some work is independent of number of executed instrs (clock distribution) – incr. execution time can incr. Energy Pipeline gating should reduce power consumption

Results

CS 7810 Lecture 13 Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power S. Kaxiras, Z. Hu, M. Martonosi Proceedings of ISCA-28 July 2001

Leakage Power Trends Circuit delay  1/(V – V th ) Leakage  num transistors (incr) supply voltage (decr) (exp) low thresh. voltage (incr) L1 and L2 caches are the biggest contributors (high transistor budgets)

V dd -Gating Leakage can be reduced by gating off the supply voltage to the circuit When applied to a cache, the contents of the SRAM cell are lost Cache decay: apply Vdd-gating when you do not care about cache contents

Lifetime of a Cache Line

Overheads Hardware to determine when to decay Introduces additional cache misses Normalized cache leakage power = Activeratio (fraction of cache that is powered on) + (Counter overhead : Leak) x activity + (L2 access energy : Leak) x num-misses Increased execution time (< 0.7%) L2 access/leakage ratio is ~9

Skier’s Dilemma New skis: $400 Ski rentals: $20 Heuristic: Buy skis after rental cost = purchase price Ski trips: Optimal: $100 $200 $300 $400 $400 $400 Heuristic: $100 $200 $300 $800 $800 $800 Likewise, decay a cache line when the cost of an additional miss equals leakage dissipated so far

Tracking Dead Time Each line has a 2-bit counter that gets reset on every access and gets incremented every 2500 cycles through a global signal (negligible overhead) After 10,000 clock cycles, the counter reaches the max value and triggers a decay Adaptive decay: Start with a short decay period; if you have a quick miss, double the period; if there is no miss, halve the period

Results

Overheads

Other Results L2 cache is equally suitable to decay techniques -- lifetimes are scaled by a factor of 10, an extra miss also costs a lot more For their experiments, there is little interference from multiprogramming Some instructions can easily be identified as last touches to a cache block – potential for early cache decay Can this apply to bpred, register file?

Title Bullet