Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Slides:

Advertisements

Similar presentations

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Advertisements

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

Power Analysis of WEP Encryption Jack Kang Benjamin Lee CS252 Final Project Fall 2003.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Low Power Techniques in Processor Design

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ronny Krashinsky Seongmoo Heo Michael Zhang Krste Asanovic MIT Laboratory for Computer Science SyCHOSys Synchronous.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.

Issue Logic and Power/Performance Tradeoffs Edwin Olson Andrew Menard December 5, 2000.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Basics of Energy & Power Dissipation

Runtime Software Power Estimation and Minimization Tao Li.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

By: C. Eldracher, T. McKee, A Morrill, R. Robson. Supervised by: Professor Shams.

Sunpyo Hong, Hyesoon Kim

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

PipeliningPipelining Computer Architecture (Fall 2006)

COE 360 Principles of VLSI Design Delay. 2 Definitions.

Lynn Choi Dept. Of Computer and Electronics Engineering

Dave Maze Edwin Olson Andrew Menard

A Review of Processor Design Flow

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

A High Performance SoC: PkunityTM

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Chapter 1 Introduction.

Performance of computer systems

Application-Specific Customization of Soft Processor Microarchitecture

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma University of Connecticut School of Engineering Department of Electrical & Computer Engineering Wattch: A Framework for Architecture-Level Power Analysis and Optimizations Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma May, 1st 2008

Contents Overview of this Work Power Modeling Methodology Model Validation Case Studies Conclusions

Overview: motivation Power is increasingly important in modern processors The need that power/performance tradeoffs be made more visible Circuit level power analysis tools are slow and late in the design process Power dissipation issues become are increasingly important in modern processors It is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designer. As clock rate and die sizes increase, power dissipation is predicted to become the key limiting factor on the performance of single-chip

Overview: contribution Wattch: a framework for analyzing and optimizing CPU power consumption at the architecture level; Achieve 1000X or more faster than layout-level power tools; Maintain accuracy within 10% of estimates from layout-level power tools Based on parameterized power models + per-cycle resource usage counts The power estimates are based on a suite of parameterizable power models for different hardware structures And on per-cycle resource usage counts generated through cycle-level simulation.

Overview: contribution Scenario B: Compiler Optimizations App Binary Config 1 Config 2 SimPower Watts-1 Watts-2 App Binary 1 Binary 2 Common Cofig SimPower Watts-1 Watts-2 Scenario A: Microachitectural tradeoffs

Overview: contribution App Binary Config 1 Additional Hardware? Custom Structure? SimPower Watts-1 Watts-2 Array Structure? Use Current Models Estimate Power of Structure Scenario C: Hardware Optimizations

Power Modeling Methodology-1 Foundation of this work Cycle-Level Performance Simulator Parameterizable Power Models Cycle-by-Cycle Hardware Access Counts Binary Hardware Config Power Estimate Performance Estimate The foundation of this power modeling infrastructure are parameterized power models of common structures present in modern superscalar microprocessors Overall Structure of the Power Simulator

Power Modeling Methodology-2 Main processor units: Array Structure Fully Associative Content-Addressable Memories Combinational Logic and Wires Clocking Load Cap Clock frequency A: Activity factor, is a fraction between 0 and 1 indicating how often clock ticks lead to switching activity on average. This model estimates C based on the circuit and the transistor sizing. Vdd and f depends on the assumed process technology In this work, a 0.35um process technology parameters The activity factor is related to the benchmark programs being executed. For circuits that pre-charge and discharge on every cycle, an a of 1 is used. The activity factors for certain critical subcircuits (single ended array bitlines) are measured from the benchmark programs using the architectural simulator. For subcircuits in which we cannot measure activity factors, we assume a base activity factor of 0.5. (random switching activity) Supply Voltage Switch activity

Power Modeling Methodology-3 Equations for Capacitance of Critical Nodes

Cell Access Transistors Array Structure PreCharge From Decoder Wordline Driver Cell Access Transistors To Sense amps Num. of Wordlines Num. of Bitlines

CAM Structure Key sizing Parameters in this CAM: The issue/commit width of the machine (W) The instruction window size (impacts CAM’s height) Physical register tag size (impacts CAM’s width)

Complex Logic Blocks Two larger complex logic blocks considered: i): instruction selection logic; ii): dependency check logic For result buses: model the power consumption of result buses by estimating the length For ALU: scale based on previous research results These lenghts are multiplied by the metal capacitance per unit length (equation shown in Table 1)

Clocking Clocking network can be the most significant source of power consumption Sources of clock power consumption: i) Global clock metal lines ii) Global clock buffers ii) clock loading Global clock metal lines: long metal lines route the clock throught the processor Clock buffer: very large transistors are used to drive the clock throughout the processor Clock loading: consider both explicit and implicit clock loading

Common CPU hardware structures and model type used Use SimpleScalar’s hardware configuration Parameters as inputs

SimpleScalar Interface The power models are interfaced with SimpleScalar keeps track of which units are accessed per cycle records the total energy consumed for an application SimpleScalar provides simulation environment with out-of-order processors with 5-stage pipelines. Speculative execution is supported This work extended SimpleScalar to provide variable # of additional pipestages between fetch and issue. Assume 7 cycles of mispredict penalty SimpleScalar tool set is an architecture simulator that reproduces the behavior of a computing device. It takes system inputs and produces system outputs and system metrics. Program codes written in either C or FORTRAN are compiled and executed by the simulator. The simulator tracks microarchitecture state for each cycle and produces detailed history of all instruction executed.

Conditional Clocking Styles Current CPU designs increasingly use conditional clocking to disable all or part of a hardware unit to reduce power consumption when it is not need. In this paper, three different options for clock gating are considered. And this figure shows the power consumption of benchmarks with conditional clocking on multi-ported hardware. The 1st bar assumes simple clock gating where a unit is fully on if any of its ports are accessed on that cycle, and fully off otherwise. The 2nd bar assumes clock gating where the power scales linearly with port usage, and disabled ports consumes 10% of their maximum power The 3rd bar assumes ideal clock gating where the power scales linearly with port usage as in the second bar, but disabled units are entirely shut off. The first and simplest clock gating style assumes the full modeled power will be consumed if any accesses occur in a given cycle, and zero power consumption otherwise. The second possibility assumes that if only a portion of a unit’s port are accessed, the power is scaled linearly. For example, if two ports of a 4-port register file are used in a given cycle, the power estimates returned will be one-half of the power if four ports are used. Wattch tracks how many ports are used on each hardware structure per cycle and scales power numbers accordingly. Power consumption of benchmarks with conditional clocking on multi ported hardwares

Simulation Speed For lower-level tools, running Power Mill on a 64-bit adder for 100 test vectors takes ~1 hr In the same amount of time, Wattch can simulate a full CPU running roughly 280M SimpleScalar instrucitons and generate both power and performance estimates!

Model Validation-1 Three methods of validation Validation 1: Model Capacitance vs. Physical Schematics Fast power simulation is only useful if it is reasonably accurate Percentage difference between lower-level tool capacitance values and the values estimated by our model Total capacitance values are within 6~11%

Model Validation-2 Validation 2: Relative power consumption by structure The clock power model used by Wattch is based on H-tree style which was used in Alpha 21264, not in Intel processors Compare this model aginst published power breakdown numbers available to seveal high end microprocessors. The downside to this comparision is that we have no way of knowing whether the design style modeled for each unit matches the design style that they actually use. In spite of downside, it is reassuring to see that these power breakdowns track quite well. As shown in the table, the relative power breakdown numbers for the Wattch model are within 10~13% on average of reported data Comparison for Pentium Pro Comparison for Alpha 21264

Model Validation-3 Validation 3: Max power consumption for three CPUs Maximum power, modeled vs reported 1, this work has concentrated on the units that are most immediately important for architects to consider, neglecting I/O circuitry, fuse and test circuits, and other miscellaneous logic. 2. Circuit implementation techniques, transistor sizing, and process parameters will vary from company to company. Average 30% lower than reported; reflect systematic underestimation Configuration of Processors

Validation Summary For capacitance estimates: ~10% (validation 1) Relative accuracy: 10~13% (validation 2) Limitations Don’t model all of the miscellaneous logic in real microprocessors Different circuit design styles can lead to different results Most up-to-date industrial fabrication data is unavailable The model will be most accurate when comparing CPUs of similar fabrication technology

Baseline Configuration Of Simulated Processor Case Studies Baseline Configuration Of Simulated Processor Use SPECint95 and SPECfp95 benchmark suites; Benchmarks are compiled using Compaq Alpha cc compiler For each program simulate 200M instructions Metrics used: i) Power ii) Performance iii) Energy iv) Energy-Delay Product

Case Study – A Microarchitectural Exploration IPC for gcc IPC for turb3d Power for gcc Power for turb3d Energy-delay product for gcc Energy-delay product for turb3d IPC: instructions per cycle IPC graphs show that gcc get significant performance benefit from increasing the data cache size In contrast, turb3d get relatively little performance benefit from increasing the data cache size, but is highly sensitive to increases in the RUU size. Power graphy are similar, both show steady increases in average power as the size of either unit is increased The two benchmarks have strikingly different energy characteristics. The energy-delay product combines performance and power consumption into a single metric in which lower values are considered better both from a power and performance standpoint The engegy-delay product for gcc reaches its optimal point for moderate (64kb) caches and small RUUs This indicates that although large caches continue to offer gcc small performance improments, their contribution to increased power begins to outweigh the performance increase For turb3d, energy-delay increases monotonically with cache size, reflecting the fact that lager caches draw more power and yet offer this benchmark little performance improvements in return Moderate-sized RUU’s offer the optimal energy-delay for turb3d.

Case Study – Power Analysis of Loop Unrolling A simple matrix multiply benchmark with 200X200 entry matrices is used As we would hope, the execution time and the number of total instructions committed decreases This is because loop unrolling reduces loop overhead and address calculation instrucitons. The power-results are more complicated, however, which makes the tradeoffs interesting in a power-aware compiler The total instruction count for the program continues to decrease steadily for larger unrolling factors, even though the execution time tends to level out after unrolliing by 4. The combined effect of this is that the energy-delay product continues to decrease slightly for larger unrolling factors, even though execution time does not. Thus a power-aware compiler might unroll more aggressively than other compilers. This example is intended to highlight the fact that design choices are slightly different when power metrics are taken into account. Wattch is intended to help explore these tradeoffs. Effect of loop unrolling on performance and power

Case Study – Power Analysis of Loop Unrolling This figure shows a breakdown of the power dissipation of individual processor units normalized to the case with no unrolling There are two important side-effects of loop unrolling Loop unrolling leads to decreased branch predictor accuracy, because the branch predictor has fewer branch accesses to “warm-up” the predictors and become mis-predicting the final fall-through branch represents a larger fraction of total predictions removing branches leads to more efficient front-end. That means the fetch unit is able to fetch large basic blocks without being interrupted by taken branches. This, in turn, provide more work for the renaming unit and fills up the RUU faster. Thus, there average fetch unit power dissipation decrease for two reasons Because the branch prediction accuracy has decreased, there are more misprediciton stall cycles in which no instruction are fetched. at larger unrolling factors, the fetch unit is stalled during cycles when the instruction queue and RUU are full The reduced number of branch instructions also significantly reduces the power dissipation of the branch prediction hardwares (The energy required to run the full program would increase because of the stall cycles) Renaming hardware shows small increase in power dissipation at an unrolling factor of 2. This is because the front-end is operating at full-tilt, Sending more instructions to the renamer per cycle. As the fetch unit starts to experience more stalls with unrolling factor of 4 and beyond, the renamer unit also begins to remain idle more frequently, leading to lower power dissipation. Detailed breakdown of power dissipation

Conclusions This paper presents a simulator frame work for a wide range of architectural and compiler evaluation; Wattch has the benefit of low-level validation which can help researchers do power modeling at abstraction levels; Wattch can provide feedback to compilers on power consumption — power aware compiler The design choices are slightly different when power metrics are taken into account; Wattch is intended to help explore these tradeoffs. Wattch has limitations and need improvements