Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

Slides:

Advertisements

Similar presentations

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Revisiting Load Value Speculation:

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Runtime Software Power Estimation and Minimization Tao Li.

Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Canturk ISCI Margaret MARTONOSI

SECTIONS 1-7 By Astha Chawla

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Haishan Zhu, Mattan Erez

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Research Topics Embedded, Real-time, Sensor Systems Frank Mueller moss

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden

2 Introduction Optimize power efficiency Reduce power without harming performance Goal: minimize power efficiency metrics —Energy delay product (EDP), energy delay square product (ED 2 P) etc. Exploit memory slack Applications with many LLC misses  memory becomes bottleneck Performance insensitive to processor frequency —Scaling frequency down  high energy benefit at low performance cost Develop analytical models to predict impact of frequency scaling No empirical parameters No training period Suitable for run-time use

3 Modeling DVFS Theoretical (work in simulator) Extend previous Interval-based models (Karkhanis and Smith, ISCA 2004, Eyerman et. al, ACM TOCS, 2010)  Two models for runtime DVFS management Miss-based & Stall-based models  differ in accuracy and ease of implementation Estimate energy benefits – performance loss G. Keramidas, V. Spiliopoulos, and S. Kaxiras. Interval-Based Models for Run-Time DVFS Orchestration in SuperScalar Processors. Proc. of Int. Conference on Computing Frontiers, 2010 Implementation in real hardware Apply model for power-performance adaptation in real processors —Case study: Intel Core i7 —Approximate models based on available performance monitoring hardware Estimate power characteristics of real hardware V. Spiliopoulos, S. Kaxiras, G. Keramidas "Green governors: A framework for Continuously Adaptive DVFS" International Green Computing Conference (IGCC'11).

44 Interval-based Performance Model Break the execution time of a program to intervals Steady-state intervals: the IPC is limited by the machine width and program’s ILP Miss-intervals: introduce stall cycles due to branch mispredictions, on-chip instruction/data misses, LLC misses (off-chip misses) Instr. rate (IPC) cycles Steady-State IPC Branch MissPred. Inst. Miss (on-chip) Data Miss (on-chip) LLC Miss (off-chip)

55 Interval-based DVFS Model (step 1) Miss Intervals and Frequency scaling (time measured in cycles) Branch-MissPredictions Miss Intervals  —same penalty (in cycles) in all frequencies On-chip data/instruction Miss-Intervals  —same penalty (in cycles) in all frequencies LLC (off-chip) Miss intervals  —for DVFS only account for this interval Instr. rate (IPC) cycles Steady-State IPC Branch MissPred. Instr Miss (on-chip) Data Miss (on-chip) LLC Miss (off-chip)

66 Interval-based DVFS Model (step 2) LLC Miss Interval and Frequency scaling Model core frequency scaling as change in memory latency in cycles Example: memory access time = 100ns f = 1GHz  T = 1ns  mem_lat = 100 cycles f = 500MHz  T = 2ns  mem_lat = 50 cycles

77 RoB fill Interval-based DVFS Model (step 2) LLC Miss Interval and Frequency scaling Model core frequency scaling as change in memory latency in cycles Instr. rate (IPC) cycles Steady-State IPC LLC Miss (off-chip) LLC Miss IQ Drain Full-stall Ramp-up Mem. latency

88 Frequency scaling == Change in memory latency  Frequency:  memory latency,  full stall area —Other areas (ROB–fill, IQ-drain and ramp-up) remain intact RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss IQ Drain Full-stall Ramp-up Mem. latency Ramp-up Mem. latency

99 DVFS target: Eliminate the slack  Memory latency up to ROB fill time No more available slack due to off chip misses Further reduction  performance penalty RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss IQ Drain Full-stall Ramp-up Mem. latency RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss Mem. latency

10 Elastic and Non-Elastic Areas Target: Eliminate “slack” by reducing Memory Latency but: ROB fill area: DOES NOT shrink  inelastic area Full-stall, IQ drain and Ramp-up: DO shrink  elastic areas RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss IQ Drain Full-stall Ramp-up Mem. latency

11 Two Simple Interval-Based Models Stall-based Model Fed by in-core information Assumes all stalls scale with frequency —Disregards ROB fill area Can be used in real hardware Miss-based Model Fed by information from the memory system Accounts for both elastic-inelastic areas Required information not available in current hardware

12 Stall-based Model Assume (all) stalls scale with f Not true due to RoB Fill Exec cycles at f/k: c init – stalls + (stalls/k) 12 RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss Mem. latency stalls

13 Miss-based Model Assumes whole miss interval scales with f Exec cycles at f/k: c init – misses*mem_lat + (misses*mem_lat/k) 13 RoB fill Instr. rate (IPC) cycles Steady-State IPC LLC Miss Mem. latency

14 Miss-based Model, more … But important implication for overlapping misses! Stalls of misses under a miss do not scale because of the inelastic Rob fill 14 d Instr. rate (IPC) cycles Steady-State IPC Miss1 Miss2 Miss based model predicts execution cycles based on the number of clusters of misses Mem. latency d d

15 Real Hardware Approximations Cannot apply miss-based model No cluster of misses counter available Cannot apply stall-based model as it is No stalls due to LLC misses counter available Approximate stall-based model Approximate LLC stalls with the minimum between all pipeline stalls and worst case stalls due to LLC misses (LLC misses * mem_lat) Good accuracy Predict execution time going from f min to f max and vice versa Less than 5% avg error

Measuring power 16

Power prediction Previous researchers correlated total power (P = a C f V 2 + P static ) with performance counter events We correlate effective capacitance (P = a C f V 2 + P static ) with performance counter events Run a set of benchmarks Compute effective C of benchmark i as Estimate C i as Minimize 17

Power prediction Only need to train the model for a single frequency: Prediction in other frequencies: Events monitored Uops executed L2 misses L2 accesses Resource stalls FP operations Branch mispredictions 18

19 Implementing Linux Frequency Governors Linux kernel module that selects frequency Window-based approach Run application for a time window Estimate performance (using stall-based model) and power in any frequency Scale frequency based on policy of interest Implement different policies Optimize EDP/ED 2 P with/without performance constraints Single & multi-process management Experimental framework Intel Core i7 SPEC2006 benchmark suite

Intel i7 single process (OptEDP) 20

Intel i7 single process (OptEDPlimit) 21

Intel i7 multi-process (OptEDP) 22

23 Conclusions DVFS modeling in simulators Implement the model in real processors Apply, explain and validate our model for SPEC2006 Contribution: optimize power efficiency using linux frequency governors Other uses of the models PowerSleuth: combine models with phase detection to characterize the power behavior of applications Future work Multi-threading applications

24 Thank You! Any questions?