Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU

Similar presentations


Presentation on theme: "1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU"— Presentation transcript:

1 1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin@csc.ncsu.edu

2 2 Acknowledgements  Students  Mark E. Femal – NCSU  Nandini Kappiah – NCSU  Feng Pan – NCSU  Robert Springer – Georgia  Faculty  Vincent W. Freeh – NCSU  David K. Lowenthal – Georgia  Sponsor  IBM UPP Award

3 3 The case for power management in HPC  Power/energy consumption a critical issue  Energy = Heat; Heat dissipation is costly  Limited power supply  Non-trivial amount of money  Consequence  Performance limited by available power  Fewer nodes can operate concurrently  Opportunity: bottlenecks  Bottleneck component limits performance of other components  Reduce power of some components, not overall performance  Today, CPU is:  Major power consumer (~100W),  Rarely bottleneck and  Scalable in power/performance (frequency & voltage) Power/performance “gears”

4 4 Is CPU scaling a win?  Two reasons: 1.Frequency and voltage scaling Performance reduction less than Power reduction 2.Application throughput Throughput reduction less than Performance reduction  Assumptions  CPU large power consumer  CPU driver  Diminishing throughput gains performance (freq) power application throughput performance (freq) (1) (2) CPU power P = ½ CVf 2

5 5 AMD Athlon-64  x86 ISA  64-bit technology  Hypertransport technology – fast memory bus  Performance  Slower clock frequency  Shorter pipeline (12 vs. 20)  SPEC2K results  2GHz AMD-64 is comparable to 2.8GHz P4  P4 better on average by 10% & 30% (INT & FP)  Frequency and voltage scaling  2000 – 800 MHz  1.5 – 1.1 Volts

6 6 LMBench results  LMBench  Benchmarking suite  Low-level, micro data  Test each “gear” GearFrequency (Mhz)Voltage 020001.5 118001.4 216001.3 314001.2 412001.1 68000.9

7 7 Operations

8 8 Operating system functions

9 9 Communication

10 10 Energy-time tradeoff in HPC  Measure application performance  Different than micro benchmarks  Different between applications  Look at NAS  Standard suite  Several HP application  Scientific  Regular

11 11 Single node – EP CPU bound: Big time penalty No (little) energy savings +11% -2% +45% +8% +150% +52% +25% +2% +66% +15%

12 12 Single node – CG +1% -9% +10% -20% Not CPU bound: Little time penalty Large energy savings

13 13 Operations per miss  Metric for memory pressure  Must be independent of time  Uses hardware performance counters  Micro-operations  x86 instructions become one or more micro-operations  Better measure of CPU activity  Operations per miss (subset of NAS)  Suggestion: Decrease gear as ops/miss decreases EPBTLUMGSPCG 84479.673.570.649.58.60

14 14 Single node – LU +4% -8% +10% -10% Modest memory pressure: Gears offer E-T tradeoff

15 15 Ops per miss, LU

16 16 Results – LU Shift 0/1 +1%, -6% Auto shift +3%, -8% Gear 1 +5%, -8% Gear 2 +10%, -10% Shift 1/2 +1%, -6% Shift 0/2 +5%, -8%

17 17 Bottlenecks  Intra-node  Memory  Disk  Inter-node  Communication  Load (im)balance

18 18 Multiple nodes – EP S 2 = 2.0 S 4 = 4.0 S 8 = 7.9 Perfect speedup: E constant as N increases E = 1.02

19 19 Multiple nodes – LU S 2 = 1.9 E 2 = 1.03 S 4 = 3.3 E 4 = 1.15 S 8 = 5.8 E 8 = 1.28 Good speedup: E-T tradeoff as N increases S 8 = 5.3 E 8 = 1.16 Gear 2

20 20 Multiple nodes – MG S 2 = 1.2 E 2 = 1.41 S 4 = 1.6 E 4 = 1.99 S 8 = 2.7 E 8 = 2.29 Poor speedup: Increased E as N increases

21 21 Normalized – MG With communication bottleneck E-T tradeoff improves as N increases

22 22 Jacobi iteration Can increase N decrease T and decrease E

23 23 Future work  We are working on inter-node bottleneck

24 24 Safe overprovisioning

25 25 The problem  Peak power limit, P  Rack power  Room/utility  Heat dissipation  Static solution, number of servers is  N = P/P max  Where P max maximum power of individual node  Problem  Peak power > average power (P max > P average )  Does not use all power – N * (P max - P average ) unused  Under performs – performance proportional to N  Power consumption is not predictable

26 26 Safe over provisioning in a cluster  Allocate and manage power among M > N nodes  Pick M > N  Eg, M = P/P average  MP max > P  P limit = P/M  Goal  Use more power, safely under limit  Reduce power (& peak CPU performance) of individual nodes  Increase overall application performance time power P max P average P(t) time power P limit P average P(t) P max

27 27 Safe over provisioning in a cluster  Benefits  Less “unused” power/energy  More efficient power use  More performance under same power limitation  Let P be performance  Then more performance means: M P * > N P  Or P * / P > N/M or P * / P > P limit /P max time power P max P average P(t) time power P limit P average P(t) P max unused energy

28 28 When is this a win?  When P * / P > N/M or P * / P > P limit /P max In words: power reduction more than performance reduction  Two reasons: 1.Frequency and voltage scaling 2.Application throughput performance (freq) power application throughput P * / P < P average /P max P * / P > P average /P max performance (freq) (1) (2)

29 29 Feedback-directed, adaptive power control  Uses feedback to control power/energy consumption  Given power goal  Monitor energy consumption  Adjust power/performance of CPU  Paper: [COLP ’02]  Several policies  Average power  Maximum power  Energy efficiency: select slowest gear (g) such that

30 30 Implementation  Components  Two components  Integrated into one daemon process  Daemons on each node  Broadcasts information at intervals  Receives information and calculates P i for next interval  Controls power locally  Research issues  Controlling local power  Add guarantee, bound on instantaneous power  Interval length  Shorter: tighter bound on power; more responsive  Longer: less overhead  The function f(L 0, …, L M )  Depends on relationship between power-performance interval (k) PikPik Individual power limit for node i

31 31 Results – fixed gear 0 1 2 3 4 5 6

32 32 Results – dynamic power control 0 1 2 3 4 5 6

33 33 Results – dynamic power control (2) 0 1 2 3 4 5 6

34 34 Summary

35 35 End

36 36 Summary  Safe over provisioning  Deploy M > N nodes  More performance  Less “unused” power  More efficient power use  Two autonomic managers  Local: built on prior research  Global: new, distributed algorithm  Implementation  Linux  AMD  Contact: Vince Freeh, 513-7196, vin@csc.ncsu.edu

37 37 Autoshift

38 38 Phases

39 39 Allocate power based on energy efficiency  Allocate power to maximize throughput  Maximize number of tasks completed per unit energy  Using energy-time profiles  Statically generate table for each task  Tuple (gear, energy/task)  Modifications  Nodes exchange pending tasks  P i determined using table and population of tasks  Benefit  Maximizes task throughput  Problems  Must avoid starvation

40 40 Memory bandwidth

41 41 Power management –ICK: need better 1 st slide  What  Controlling power  Achieving desired goal  Why  Conserve energy consumption  Contain instantaneous power consumption  Reduce heat generation  Good engineering

42 42 Related work: Energy conservation  Goal: conserve energy  Performance degradation acceptable  Usually in mobile environments (finite energy source, battery)  Primary goal:  Extend battery life  Secondary goal:  Re-allocate energy  Increase “value” of energy use  Tertiary goal:  Increase energy efficiency  More tasks per unit energy  Example  Feedback-driven, energy conservation  Control average power usage  P ave = (E 0 – E f )/T E0E0 EfEf T power freq

43 43 Related work: Realtime DVS  Goal:  Reduce energy consumption  With no performance degradation  Mechanism:  Eliminate slack time in system  Savings  E idle  with F scaling  Additional E task –E task ’  with V scaling T P E task deadline P max T P E task ’ deadline P max E idle

44 44 Related work: Fixed installations  Goal:  Reduce cost (in heat generation or $)  Goal is not to conserve a battery  Mechanisms  Scaling  Fine-grain – DVS  Coarse-grain – power down  Load balancing

45 45 Single node – MG

46 46 Single node – EP

47 47 Single node – LU

48 48 Power, energy, heat – oh, my  Relationship  E = P * T  H  E  Thus: control power  Goal  Conserve (reduce) energy consumption  Reduce heat generation  Regulate instantaneous power consumption  Situations (benefits)  Mobile/embedded computing (finite energy store)  Desktops (save $)  Servers, etc (increase performance)

49 49 Power usage  CPU power  Dominated by dynamic power  System power dominated by  CPU  Disk  Memory  CPU notes  Scalable  Driver of other system  Measure of performance performance (freq) power CMOS dynamic power equation: P = ½CfV 2

50 50 Power management in HPC  Goals  Reduce heat generation (and $)  Increase performance  Mechanisms  Scaling  Feedback  Load balancing

51 51 Single node – MG +6% -7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff

52 52  Power management vs. energy conservation  Power management is mechanism  Energy conservation is policy  Two elements  Energy efficiency  Ie, Decrease energy consumed per task  (Instantaneous) power consumption  Ie, Limit maximum Watts used  Power-performance tradeoff  Less power & less performance  Ultimately energy-time Power management 2GHz 800MHz AMD system 6 gears

53 53 Autonomic managers  Implementation uses two autonomic managers  Local – power control  Global – power allocation  Local  Uses prior research project (new implementation)  Requires new policy  Daemon process  Reads power meter  Adjusts processor performance gear (freq)  Global  At regular intervals  Collects appropriate information from all nodes  Allocates power budget for next quantum  Optimize for one of several objectives

54 54 Example: Load imbalance  Uniform allocation of power  P i = P limit = P/M, for node i  Not ideal if nodes unevenly loaded  Tasks execute more slowly on busy nodes  Lightly loaded nodes may not use all power  Allocate power based on load*  At regular intervals, nodes exchange load information  Each computes individual power limit for next interval (k) *Note: Load is one of several possible objective functions. individual power limit for node i at interval k Ensure:


Download ppt "1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU"

Similar presentations


Ads by Google