Runtime Software Power Estimation and Minimization Tao Li.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Computer System Overview
Computer System Overview
Chapter 1 and 2 Computer System and Operating System Overview
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Low Power Techniques in Processor Design
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Ronny Krashinsky Seongmoo Heo Michael Zhang Krste Asanovic MIT Laboratory for Computer Science SyCHOSys Synchronous.
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Operating Systems and Networks AE4B33OSS Introduction.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Issue Logic and Power/Performance Tradeoffs Edwin Olson Andrew Menard December 5, 2000.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Operating System Isfahan University of Technology Note: most of the slides used in this course are derived from those of the textbook (see slide 4)
1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.
Pipelining and Parallelism Mark Staveley
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Best detection scheme achieves 100% hit detection with
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Chapter 1 Computer System Overview
Computer Structure Multi-Threading
Tosiron Adegbija and Ann Gordon-Ross+
Superscalar Pipelines Part 2
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Performance of computer systems
A High Performance SoC: PkunityTM
Performance of computer systems
Chapter 1 Computer System Overview
Performance of computer systems
Presentation transcript:

Runtime Software Power Estimation and Minimization Tao Li

Power-aware Computing

Power: Software Perspective & Impact  Power estimation: the first step to power management & optimization  Software contributes to & largely impacts power consumption

 It is crucial to model power from the perspective of software  Evaluate software energy in early design stage  Understand impact of software optimizations on energy  Support run-time power management and optimizations Power: Software Perspective & Impact (Contd.)

 Instruction level modeling  Computation intensive  High level macro-modeling  Difficult to apply to general code  Event counting based modeling  Impacted by the availability of performance counters  Architecture level simulation  Large slowdown Software Power Estimation: Current Techniques

Challenges in Run-time Power Estimation  High fidelity & fast speed  On-the-fly estimation capability, non- intrusive & low overhead  Simplicity, availability and generality

Experimental Methodology  SoftWatt: cycle-accurate & full-system power simulation framework  SimOS infrastructure, Wattch power model  Commercial OS & real applications  Out-of-order superscalar processor  Caches & memory hierarchy  Low-power disk

Experimental Methodology (Contd.)  Applications  and file management (sendmail, fileman)  Java (SPECjvm98: db, jess, javac, jack, mtrt, compress)  SPECInt95 (gcc, vortex)  Database (Postgres: select, update, join)  Miscellaneous (pmake, osboot)

OS Power Characterization  OS power varies from one application to another  29 Watt (gcc) ~ 66 Watt (fileman)  Variance of power consumption in OS service routines & invocations

OS Power Characterization (Contd.)  OS routine power correlates with its performance  Circuits used to exploit ILP burn significant portion of power  The number of in-flight instructions that flow through impacts circuit switching activity  For a given OS routine, similar IPC indicates similar circuit switching activity and therefore, similar power

OS Routine Power-Performance Correlation SCSI Disk Interrupt Handler Read File System Call

Routine Level OS Power Model  Idea: use a linear regression model P routine =k 1 *IPC routine +k 0 to track the OS routine power showing different performance  Energy(OS)= Sum [ Energy(OS routines) ] = Sum [ Power(OS routines)*Time(OS routines) ]

Routine Level OS Power Model (Contd.)  : Model Fitting Error

 Pre-characterization  Low level energy simulation  Model fitting  Run-time estimation  OS routine boundaries  Evaluation using counter values Routine Level OS Power Modeling

Routine based Regression Model P routine =k 1 *IPC routine +k 0 Flat Regression Model P OS =g 1 *IPC OS +g 0 Cumulative Estimation Error

Flat Regression Model P OS =g 1 *IPC OS +g 0 Per-routine Estimation Error

Routine based Regression Model P routine =k 1 *IPC routine +k 0 Per-routine Estimation Error (Contd.)

OS Energy Dissipation 92% 89%

Phases in Programs (8-issue machine) Benchmark: SPECjvm98 jess  Resources are utilized differently during different phases of program execution  Average IPC - User: 2.1, OS: 1.1

Power Minimization via Processor Resource Adaptations  Adapt processor resources to program needs  What can be adapted?  Bandwidth of fetch/decode/issue/retire…  Size of instruction window, re-order buffer, load store queue…  Reduce power, retain performance

Effects of Tuning Processor Resource for the OS 8-issue -> 4-issue OS Performance degradation: 4% OS Power savings: 50%

Previous Approach for Adaptations  Sampling Cycles Sampling Window IPC (Inst. Per Cycle) Adaptation ABCDEF

Problems with Sampling based Adaptations (Contd.)  OS executions  Short-lived

OS-aware Routine based Adaptations  OS-aware:  Identify OS executions via processor execution modes  Just-in-time & full coverage of OS activities  Routine-based:  Adapt processor resources at OS routine boundaries  Precise exceptions: drained pipeline  Achieve minimum adaptation overhead

OS-aware Routine based Adaptations (Contd.)  Apply optimal adaptation for individual OS routine  Exploit the routine level Energy-Delay Product variance OS Services

Routine based Adaptations: OS Power

OS Performance

OS Power & Performance Tradeoff