1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU

Slides:



Advertisements
Similar presentations
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Computer Abstractions and Technology
Power Reduction Techniques For Microprocessor Systems
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Power Management (Application of Autonomic Computing Concepts) Omer Rana.
Shimin Chen Big Data Reading Group Presented and modified by Randall Parabicoli.
Introduction CS 524 – High-Performance Computing.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Power-aware Computing n Dramatic increases in computer power consumption: » Some processors now draw more than 100 watts » Memory power consumption is.
Green Computing Omer Rana
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Task Alloc. In Dist. Embed. Systems Murat Semerci A.Yasin Çitkaya CMPE 511 COMPUTER ARCHITECTURE.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Department of Computer Science Engineering SRM University
Low Power Techniques in Processor Design
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Low-Power Wireless Sensor Networks
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
Cloud Computing Energy efficient cloud computing Keke Chen.
1 Scheduling Processes. 2 Processes Each process has state, that includes its text and data, procedure call stack, etc. This state resides in memory.
Energy-Efficient Soft Real-Time CPU Scheduling for Mobile Multimedia Systems Wanghong Yuan, Klara Nahrstedt Department of Computer Science University of.
Last Time Performance Analysis It’s all relative
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 EE5900 Advanced Embedded System For Smart Infrastructure Energy Efficient Scheduling.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
Managing Server Energy and Operational Costs Chen, Das, Qin, Sivasubramaniam, Wang, Gautam (Penn State) Sigmetrics 2005.
Hard Real-Time Scheduling for Low- Energy Using Stochastic Data and DVS Processors Flavius Gruian Department of Computer Science, Lund University Box 118.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
Static Process Scheduling
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems Energy-aware QoS packet scheduling.
Tackling I/O Issues 1 David Race 16 March 2010.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
PipeliningPipelining Computer Architecture (Fall 2006)
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
Lecture 2: Performance Evaluation
Green cloud computing 2 Cs 595 Lecture 15.
Wayne Wolf Dept. of EE Princeton University
High-Performance Power-Aware Computing
Some challenges in heterogeneous multi-core systems
Hui Chen, Shinan Wang and Weisong Shi Wayne State University
CMSC 611: Advanced Computer Architecture
Where Does the Power go in DCs & How to get it Back
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Presentation transcript:

1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU

2 Acknowledgements  Students  Mark E. Femal – NCSU  Nandini Kappiah – NCSU  Feng Pan – NCSU  Robert Springer – Georgia  Faculty  Vincent W. Freeh – NCSU  David K. Lowenthal – Georgia  Sponsor  IBM UPP Award

3 The case for power management in HPC  Power/energy consumption a critical issue  Energy = Heat; Heat dissipation is costly  Limited power supply  Non-trivial amount of money  Consequence  Performance limited by available power  Fewer nodes can operate concurrently  Opportunity: bottlenecks  Bottleneck component limits performance of other components  Reduce power of some components, not overall performance  Today, CPU is:  Major power consumer (~100W),  Rarely bottleneck and  Scalable in power/performance (frequency & voltage) Power/performance “gears”

4 Is CPU scaling a win?  Two reasons: 1.Frequency and voltage scaling Performance reduction less than Power reduction 2.Application throughput Throughput reduction less than Performance reduction  Assumptions  CPU large power consumer  CPU driver  Diminishing throughput gains performance (freq) power application throughput performance (freq) (1) (2) CPU power P = ½ CVf 2

5 AMD Athlon-64  x86 ISA  64-bit technology  Hypertransport technology – fast memory bus  Performance  Slower clock frequency  Shorter pipeline (12 vs. 20)  SPEC2K results  2GHz AMD-64 is comparable to 2.8GHz P4  P4 better on average by 10% & 30% (INT & FP)  Frequency and voltage scaling  2000 – 800 MHz  1.5 – 1.1 Volts

6 LMBench results  LMBench  Benchmarking suite  Low-level, micro data  Test each “gear” GearFrequency (Mhz)Voltage

7 Operations

8 Operating system functions

9 Communication

10 Energy-time tradeoff in HPC  Measure application performance  Different than micro benchmarks  Different between applications  Look at NAS  Standard suite  Several HP application  Scientific  Regular

11 Single node – EP CPU bound: Big time penalty No (little) energy savings +11% -2% +45% +8% +150% +52% +25% +2% +66% +15%

12 Single node – CG +1% -9% +10% -20% Not CPU bound: Little time penalty Large energy savings

13 Operations per miss  Metric for memory pressure  Must be independent of time  Uses hardware performance counters  Micro-operations  x86 instructions become one or more micro-operations  Better measure of CPU activity  Operations per miss (subset of NAS)  Suggestion: Decrease gear as ops/miss decreases EPBTLUMGSPCG

14 Single node – LU +4% -8% +10% -10% Modest memory pressure: Gears offer E-T tradeoff

15 Ops per miss, LU

16 Results – LU Shift 0/1 +1%, -6% Auto shift +3%, -8% Gear 1 +5%, -8% Gear 2 +10%, -10% Shift 1/2 +1%, -6% Shift 0/2 +5%, -8%

17 Bottlenecks  Intra-node  Memory  Disk  Inter-node  Communication  Load (im)balance

18 Multiple nodes – EP S 2 = 2.0 S 4 = 4.0 S 8 = 7.9 Perfect speedup: E constant as N increases E = 1.02

19 Multiple nodes – LU S 2 = 1.9 E 2 = 1.03 S 4 = 3.3 E 4 = 1.15 S 8 = 5.8 E 8 = 1.28 Good speedup: E-T tradeoff as N increases S 8 = 5.3 E 8 = 1.16 Gear 2

20 Multiple nodes – MG S 2 = 1.2 E 2 = 1.41 S 4 = 1.6 E 4 = 1.99 S 8 = 2.7 E 8 = 2.29 Poor speedup: Increased E as N increases

21 Normalized – MG With communication bottleneck E-T tradeoff improves as N increases

22 Jacobi iteration Can increase N decrease T and decrease E

23 Future work  We are working on inter-node bottleneck

24 Safe overprovisioning

25 The problem  Peak power limit, P  Rack power  Room/utility  Heat dissipation  Static solution, number of servers is  N = P/P max  Where P max maximum power of individual node  Problem  Peak power > average power (P max > P average )  Does not use all power – N * (P max - P average ) unused  Under performs – performance proportional to N  Power consumption is not predictable

26 Safe over provisioning in a cluster  Allocate and manage power among M > N nodes  Pick M > N  Eg, M = P/P average  MP max > P  P limit = P/M  Goal  Use more power, safely under limit  Reduce power (& peak CPU performance) of individual nodes  Increase overall application performance time power P max P average P(t) time power P limit P average P(t) P max

27 Safe over provisioning in a cluster  Benefits  Less “unused” power/energy  More efficient power use  More performance under same power limitation  Let P be performance  Then more performance means: M P * > N P  Or P * / P > N/M or P * / P > P limit /P max time power P max P average P(t) time power P limit P average P(t) P max unused energy

28 When is this a win?  When P * / P > N/M or P * / P > P limit /P max In words: power reduction more than performance reduction  Two reasons: 1.Frequency and voltage scaling 2.Application throughput performance (freq) power application throughput P * / P < P average /P max P * / P > P average /P max performance (freq) (1) (2)

29 Feedback-directed, adaptive power control  Uses feedback to control power/energy consumption  Given power goal  Monitor energy consumption  Adjust power/performance of CPU  Paper: [COLP ’02]  Several policies  Average power  Maximum power  Energy efficiency: select slowest gear (g) such that

30 Implementation  Components  Two components  Integrated into one daemon process  Daemons on each node  Broadcasts information at intervals  Receives information and calculates P i for next interval  Controls power locally  Research issues  Controlling local power  Add guarantee, bound on instantaneous power  Interval length  Shorter: tighter bound on power; more responsive  Longer: less overhead  The function f(L 0, …, L M )  Depends on relationship between power-performance interval (k) PikPik Individual power limit for node i

31 Results – fixed gear

32 Results – dynamic power control

33 Results – dynamic power control (2)

34 Summary

35 End

36 Summary  Safe over provisioning  Deploy M > N nodes  More performance  Less “unused” power  More efficient power use  Two autonomic managers  Local: built on prior research  Global: new, distributed algorithm  Implementation  Linux  AMD  Contact: Vince Freeh, ,

37 Autoshift

38 Phases

39 Allocate power based on energy efficiency  Allocate power to maximize throughput  Maximize number of tasks completed per unit energy  Using energy-time profiles  Statically generate table for each task  Tuple (gear, energy/task)  Modifications  Nodes exchange pending tasks  P i determined using table and population of tasks  Benefit  Maximizes task throughput  Problems  Must avoid starvation

40 Memory bandwidth

41 Power management –ICK: need better 1 st slide  What  Controlling power  Achieving desired goal  Why  Conserve energy consumption  Contain instantaneous power consumption  Reduce heat generation  Good engineering

42 Related work: Energy conservation  Goal: conserve energy  Performance degradation acceptable  Usually in mobile environments (finite energy source, battery)  Primary goal:  Extend battery life  Secondary goal:  Re-allocate energy  Increase “value” of energy use  Tertiary goal:  Increase energy efficiency  More tasks per unit energy  Example  Feedback-driven, energy conservation  Control average power usage  P ave = (E 0 – E f )/T E0E0 EfEf T power freq

43 Related work: Realtime DVS  Goal:  Reduce energy consumption  With no performance degradation  Mechanism:  Eliminate slack time in system  Savings  E idle  with F scaling  Additional E task –E task ’  with V scaling T P E task deadline P max T P E task ’ deadline P max E idle

44 Related work: Fixed installations  Goal:  Reduce cost (in heat generation or $)  Goal is not to conserve a battery  Mechanisms  Scaling  Fine-grain – DVS  Coarse-grain – power down  Load balancing

45 Single node – MG

46 Single node – EP

47 Single node – LU

48 Power, energy, heat – oh, my  Relationship  E = P * T  H  E  Thus: control power  Goal  Conserve (reduce) energy consumption  Reduce heat generation  Regulate instantaneous power consumption  Situations (benefits)  Mobile/embedded computing (finite energy store)  Desktops (save $)  Servers, etc (increase performance)

49 Power usage  CPU power  Dominated by dynamic power  System power dominated by  CPU  Disk  Memory  CPU notes  Scalable  Driver of other system  Measure of performance performance (freq) power CMOS dynamic power equation: P = ½CfV 2

50 Power management in HPC  Goals  Reduce heat generation (and $)  Increase performance  Mechanisms  Scaling  Feedback  Load balancing

51 Single node – MG +6% -7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff

52  Power management vs. energy conservation  Power management is mechanism  Energy conservation is policy  Two elements  Energy efficiency  Ie, Decrease energy consumed per task  (Instantaneous) power consumption  Ie, Limit maximum Watts used  Power-performance tradeoff  Less power & less performance  Ultimately energy-time Power management 2GHz 800MHz AMD system 6 gears

53 Autonomic managers  Implementation uses two autonomic managers  Local – power control  Global – power allocation  Local  Uses prior research project (new implementation)  Requires new policy  Daemon process  Reads power meter  Adjusts processor performance gear (freq)  Global  At regular intervals  Collects appropriate information from all nodes  Allocates power budget for next quantum  Optimize for one of several objectives

54 Example: Load imbalance  Uniform allocation of power  P i = P limit = P/M, for node i  Not ideal if nodes unevenly loaded  Tasks execute more slowly on busy nodes  Lightly loaded nodes may not use all power  Allocate power based on load*  At regular intervals, nodes exchange load information  Each computes individual power limit for next interval (k) *Note: Load is one of several possible objective functions. individual power limit for node i at interval k Ensure: