Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Performance Power-Aware Computing

Similar presentations


Presentation on theme: "High-Performance Power-Aware Computing"— Presentation transcript:

1 High-Performance Power-Aware Computing
Vincent W. Freeh Computer Science NCSU

2 Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah
Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal

3 The case for power management
Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy

4 application throughput
CPU scaling power  frequency x voltage2 How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power frequency/voltage application throughput frequency/voltage

5 Is CPU scaling a win? ECPU Eother T full power PCPU Psystem Pother
time full

6 Is CPU scaling a win? benefit ECPU cost Eother T T+DT full reduced
power benefit cost PCPU ECPU PCPU Psystem Eother Psystem Pother Pother T T+DT time full reduced

7 Our work Exploit bottlenecks
Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05]

8 Methodology Cluster used: 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) Voltage (V) Measure Wall clock time (gettimeofday system call) Energy (external power meter)

9 NAS

10 CG – 1 node Not CPU bound: Little time penalty Large energy savings
2000MHz 800MHz +1% -17% Not CPU bound: Little time penalty Large energy savings

11 EP – 1 node CPU bound: Big time penalty No (little) energy savings
+11% -3% CPU bound: Big time penalty No (little) energy savings

12 Operation per miss SP: 49.5 CG: 8.60 BT: 79.6 EP: 844

13 Multiple nodes – EP Perfect speedup: E constant as N increases

14 Multiple nodes – LU Good speedup: E-T tradeoff as N increases S8 = 5.3
Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases

15 Phases

16 Phases: LU

17 Phase detection First, divide program into blocks
All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002]

18 Data collection Use MPI-jack Pre and post hooks For example
application MPI library Use MPI-jack Pre and post hooks For example Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) MPI-jack code

19 Example: bt

20 Comparing two schedules
What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004]

21 Slope metric Project uses slope Energy-time tradeoff
limit i j Project uses slope Energy-time tradeoff Slope = -1  energy savings = time delay User-defines the limit Limit = 0  minimize energy Limit = -∞  minimize time If slope < limit, then better We do not advocate this metric over others

22 Example: bt Solutions Slope < -1.5? 1 00  01 -11.7 true 2 01  02
-1.78 3 02  03 -1.19 false 4 02  12 -1.44 02 is the best

23 Benefit of multiple gears: mg

24 Current work: no. of nodes, gear/phase

25 Load imbalance

26 Node bottleneck Best course is to keep load balanced
Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future

27 Example Reduced performance & power  Energy savings predicted
synch pt synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power  Energy savings

28 Measuring slack Blocking operations Receive Wait Barrier
Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T

29 Slack Slack Varies between nodes Varies between applications
Communication slack Aztec Sweep3d CG Slack Varies between nodes Varies between applications Use net slack Each node individually determines slack Reduction to find min slack

30 Shifting When to reduce performance? When there is enough slack
When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear slack same gear increase gear T

31 Aztec gears

32 Performance Aztec Sweep3d

33 Synthetic benchmark

34 Summary Contributions Improved energy efficiency of HPC applications
Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy

35 End

36 Shifting test NAS LU – 1 node 7.7% 1% 1% 4.5%

37 Beta Hsu & Kremer [PLDI ‘03]
Relates application slowdown to CPU slowdown b = b=1  time is CPU dependent b=0  time is independent of CPU OPM vs. b Correlated Log(OPM)  b

38 OPM and b and slack OPM not strongly correlated to b in multi-node
Why? There is another bottleneck Communication slack Waiting time Eg, MPI_Receive, MPI_Wait, MPI_Barrier MG: OPM = 70.6; slack = 25% LU: OPM = 73.5; slack = 11% Can predict b with Log(OPM) and slack

39 Energy savings (synthetic)

40 Normalized – MG With communication bottleneck E-T tradeoff improves
as N increases

41 SPEC FP

42 SPEC INT

43 Single node – MG Modest memory pressure: Gears offer E-T tradeoff +6%
-7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff

44 Dynamically adjust performance
net slack 2 time 1 2

45 Adjust performance net slack time 1 1 1

46 Dampening net slack time 1 1 1

47 Power consumption Average for NAS suite

48 Related work: Energy conservation
Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef T power freq

49 Related work: Realtime DVS
Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask – Etask’ with V scaling P P Pmax Pmax Etask deadline Etask’ deadline Eidle T T

50 Related work Previous studies in power-aware HPC
Cameron et al., SC 2004 & IPDPS 2005, Freeh et al., IPDPS 2005 Energy-aware server clusters Many projects; e.g., Heath PPoPP 2005 Low-power supercomputer design Green Destiny (Warren et al., 2002) Orion Multisystems

51 Related work: Fixed installations
Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery Mechanisms Scaling Fine-grain – DVS Coarse-grain – power down Load balancing

52 Memory pressure Why different tradeoffs?
CG is memory bound: CPU not on critical path EP is CPU bound: CPU is on critical path Operations per miss Metric of memory pressure Indicates criticality of CPU Use performance counters Count micro operations and cache misses

53 Single node – MG

54 Single node – LU


Download ppt "High-Performance Power-Aware Computing"

Similar presentations


Ads by Google