Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Paula Petrica WEED Motivation Hardware failures expected to become prominent in future generations Front End (FE) Back End (BE) Load-Store Queue (LSQ) Core
Paula Petrica WEED Motivation Deconfiguration tolerates defects at the expense of performance Pipeline imbalance Units correlated with deconfigured one might become overprovisioned Power inefficiencies Application specific Front End (FE) Back End (BE) Load-Store Queue (LSQ) Core
Paula Petrica WEED Research Goal Given a CMP with a set of failures and a power budget: Eliminate power inefficiencies Improve performance
Paula Petrica WEED Outline Motivation Architecture Power Harnessing Performance Boosting Power Transfer Runtime Manager Conclusions and future work
Paula Petrica WEED Core 2 Front End (FE) Load-Store Queue (LSQ) Architecture Two-step approach Transfer power Harness Power Back End (BE) Core 1 Front End (FE) Load-Store Queue (LSQ) Back End (BE)
Paula Petrica WEED Power Harnessing FQ Decode/ Rename Dispatch ROB IQ Select D-Cache RF BPred I-Cache FE BE LSQ
Paula Petrica WEED Pipeline Imbalance Performance Loss Power Saved
Paula Petrica WEED Performance Boosting Distribute accumulated margin of power to boost performance Temporarily enable a previously dormant feature Requirements Small area and fast power-up Small PPR (Power-Performance Ratio)
Paula Petrica WEED Performance Boosting Techniques Speculative Cache Access Speculatively send L1 requests to the L2 cache Speculatively access both tag and data in the L2 cache at the same time (rather than serially) Turned on independently or in combination Approximately linear power-performance relationship Benefits applications limited by L1 cache capacity Load L1 Cache L2 Cache L1 MissTagData Lower Hierarchy Level miss hit L2 Cache Tag Data Lower Hierarchy Level miss L2 Cache
Paula Petrica WEED Performance Boosting Techniques Boosting main memory performance CLEAR [N. Kirman et al, HPCA 2005] Predict and speculatively retire long latency loads Supply predicted values to destination registers Free processor resources for non-dependent instructions Linear power-performance relationship Benefits memory bound applications
Paula Petrica WEED Performance Boosting Techniques DVFS Scale up voltage and frequency Already built in Cubic power cost for linear performance benefit Benefits high-IPC applications
Paula Petrica WEED Comparison of Boosting Techniques Performance Improvement
Paula Petrica WEED Core 2 Front End (FE) Load-Store Queue (LSQ) Architecture Two-step approach Transfer power Harness Power Back End (BE) Core 1 Front End (FE) Load-Store Queue (LSQ) Back End (BE)
Paula Petrica WEED Power Transfer Runtime Manager Periodically coordinate chip-wide effort to relocate power among cores Obtain current local hardware deconfiguration status (due to faults) Determine additional components to be deconfigured Transfer power to one or more mechanisms that make best use of it
Paula Petrica WEED Power Transfer Runtime Manager Sampling Phase Steady Phase Sample deconfigurations Choose additional deconfiguration Sample performance boosting Compute global throughput with fairness Choose best 4-core configuration Apply DVFS (greedy) Local decisions Global Decisions
Paula Petrica WEED Global vs Local Optimization core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 10.0% Speedup
Paula Petrica WEED Diversity of Boosting Techniques core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 6.3% Speedup
Paula Petrica WEED Power Transfer Runtime Manager core configurations, random errors and random SPEC CPU2000 benchmarks 22.2% 15.3% 10.0% 6.3% Speedup
Paula Petrica WEED Conclusions We proposed a technique to increase performance given a certain power budget in the presence of hard faults Exploited the deconfiguration capabilities already built in microprocessors Demonstrated that pipeline imbalances and additional deconfiguration are application-dependent Proposed several boosting techniques Demonstrated the potential for substantial performance gains for a 4-core CMP
Paula Petrica WEED Future Work Heuristic approaches to scale this problem to many cores Simulated Annealing, Genetic Algorithm Pareto optimal fronts to reduce the number of combinations Hierarchical optimization
Questions?