Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

EE384y: Packet Switch Architectures

Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.

Exploring the Potential of CMP Core Count Management on Data Center Energy Savings Ozlem Bilgir * Margaret Martonosi * Qiang Wu * Princeton University.

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

6: Opportunistic Communication and Multiuser Diversity

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

SE-292 High Performance Computing

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.

Chapter 4 Memory Management Basic memory management Swapping

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Cache and Virtual Memory Replacement Algorithms

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Approximating the Optimal Replacement Algorithm Ben Juurlink.

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

CRUISE: Cache Replacement and Utility-Aware Scheduling

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

SE-292 High Performance Computing

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Unit 1 Kinematics Chapter 1 Day

Chapter 3 General-Purpose Processors: Software

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Lu Peng, Jih-Kwon Peir, Konrad Lai

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Adaptive Single-Chip Multiprocessing

Presentation transcript:

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Paula Petrica WEED Motivation Hardware failures expected to become prominent in future generations Front End (FE) Back End (BE) Load-Store Queue (LSQ) Core

Paula Petrica WEED Motivation Deconfiguration tolerates defects at the expense of performance Pipeline imbalance Units correlated with deconfigured one might become overprovisioned Power inefficiencies Application specific Front End (FE) Back End (BE) Load-Store Queue (LSQ) Core

Paula Petrica WEED Research Goal Given a CMP with a set of failures and a power budget: Eliminate power inefficiencies Improve performance

Paula Petrica WEED Outline Motivation Architecture Power Harnessing Performance Boosting Power Transfer Runtime Manager Conclusions and future work

Paula Petrica WEED Core 2 Front End (FE) Load-Store Queue (LSQ) Architecture Two-step approach Transfer power Harness Power Back End (BE) Core 1 Front End (FE) Load-Store Queue (LSQ) Back End (BE)

Paula Petrica WEED Power Harnessing FQ Decode/ Rename Dispatch ROB IQ Select D-Cache RF BPred I-Cache FE BE LSQ

Paula Petrica WEED Pipeline Imbalance Performance Loss Power Saved

Paula Petrica WEED Performance Boosting Distribute accumulated margin of power to boost performance Temporarily enable a previously dormant feature Requirements Small area and fast power-up Small PPR (Power-Performance Ratio)

Paula Petrica WEED Performance Boosting Techniques Speculative Cache Access Speculatively send L1 requests to the L2 cache Speculatively access both tag and data in the L2 cache at the same time (rather than serially) Turned on independently or in combination Approximately linear power-performance relationship Benefits applications limited by L1 cache capacity Load L1 Cache L2 Cache L1 MissTagData Lower Hierarchy Level miss hit L2 Cache Tag Data Lower Hierarchy Level miss L2 Cache

Paula Petrica WEED Performance Boosting Techniques Boosting main memory performance CLEAR [N. Kirman et al, HPCA 2005] Predict and speculatively retire long latency loads Supply predicted values to destination registers Free processor resources for non-dependent instructions Linear power-performance relationship Benefits memory bound applications

Paula Petrica WEED Performance Boosting Techniques DVFS Scale up voltage and frequency Already built in Cubic power cost for linear performance benefit Benefits high-IPC applications

Paula Petrica WEED Comparison of Boosting Techniques Performance Improvement

Paula Petrica WEED Core 2 Front End (FE) Load-Store Queue (LSQ) Architecture Two-step approach Transfer power Harness Power Back End (BE) Core 1 Front End (FE) Load-Store Queue (LSQ) Back End (BE)

Paula Petrica WEED Power Transfer Runtime Manager Periodically coordinate chip-wide effort to relocate power among cores Obtain current local hardware deconfiguration status (due to faults) Determine additional components to be deconfigured Transfer power to one or more mechanisms that make best use of it

Paula Petrica WEED Power Transfer Runtime Manager Sampling Phase Steady Phase Sample deconfigurations Choose additional deconfiguration Sample performance boosting Compute global throughput with fairness Choose best 4-core configuration Apply DVFS (greedy) Local decisions Global Decisions

Paula Petrica WEED Global vs Local Optimization core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 10.0% Speedup

Paula Petrica WEED Diversity of Boosting Techniques core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 6.3% Speedup

Paula Petrica WEED Power Transfer Runtime Manager core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 15.3% 10.0% 6.3% Speedup

Paula Petrica WEED Conclusions We proposed a technique to increase performance given a certain power budget in the presence of hard faults Exploited the deconfiguration capabilities already built in microprocessors Demonstrated that pipeline imbalances and additional deconfiguration are application-dependent Proposed several boosting techniques Demonstrated the potential for substantial performance gains for a 4-core CMP

Paula Petrica WEED Future Work Heuristic approaches to scale this problem to many cores Simulated Annealing, Genetic Algorithm Pareto optimal fronts to reduce the number of combinations Hierarchical optimization

Questions?