Presentation is loading. Please wait.

Presentation is loading. Please wait.

Power-Aware Parallel Job Scheduling

Similar presentations


Presentation on theme: "Power-Aware Parallel Job Scheduling"— Presentation transcript:

1 Power-Aware Parallel Job Scheduling
Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero

2 Power Consumption of Supercomputing Systems
Striving for performance has led to enormous power dissipation of HPC centers (Top500 list) KWatts EEHiPC'10

3 Power reduction approaches in HPC
Application level: - Runtime systems: - exploit certain application characteristics (load imbalance, communication intensive regions) - based on very fine grain DVFS application System level: - Turning off idle nodes: - resource allocation such that there are more completely idle nodes - determining number of online nodes - Operating system power management via DVFS: - linux governors – per core, unawareness of the rest of the system - DVFS taking into the account entire system workload? EEHiPC'10

4 Parallel Job Scheduling
Job scheduler has a global view of the whole system Wait Queue Queued jobs Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager EEHiPC'10

5 DVFS and Job Scheduling
Wait Queue Queued jobs Job submission HPC Job Scheduler Job with its requirements Job Scheduling Resource Manager Power-Aware Component Job CPU frequency assignment based on goals/constraints EEHiPC'10

6 Outline Parallel job scheduling: Power and run time modelling:
short introduction to parallel job scheduling the EASY backfilling policy Power and run time modelling: first we need to understand how frequency scaling affects CPU power dissipation and runtime Energy-saving parallel job scheduling policies: Utilization-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010] BSLD-driven power-aware scheduling [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops Atlanta, GA, April 2010] Power-budgeting: how to maximize job performance under a given power budget? [Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010] EEHiPC'10

7 About Parallel Job Scheduling
Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled: FCFS policy used in the beginning Backfilling policies introduced to improve system utilization Job performance metrics: Response time: WaitTime(J)+RunTime(J) Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J) Bounded Slowdown: max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1) CPUs Job 5 Job 3 Job 2 Job 1 Job 4 Job 6 Time Job Performance Wait Time Run Time EEHiPC'10

8 The EASY backfilling policy
Jobs are executed in FCFS order except when the first job in the wait queue can not start Users have to submit an estimation of job's runtime – requested time When the first job in the WQ can not start, a reservation is made for it based on requested times of running jobs A job is executed before previously arrived ones only if it does not delay the first job in the wait queue CPUs Arrival of Job 5 Arrival of Job 6 Job 5 Job 1 Job 2 MakeJobReservation(Job5) Job 5 Job 3 BackfillJob(Job6) Job 4 Job 6 Time EEHiPC'10

9 High-Level DVFS modelling

10 Power Model CPU power presents one of main system power components
It consists of dynamic and static power: Pcpu = Pdynamic + Pstatic Pdynamic = AcfV Pstatic = α V Fraction of static in total CPU power is a model parameter: Pstatic(Vtop) = X(Pstatic(Vtop) + Pdynamic (ftop,Vtop)) ( X = 25% in our experiments ) Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity) Idle processors: do not consume power/ consume power at the lowest frequency DVFS gear set : EEHiPC'10

11 F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1
Time Model Execution time dependence on frequency is captured by the following model: F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1 [Hsu,Feng SC05: A Power-Aware Run Time System for High- Performance Computing] ß is assumed to have the following normal distributions: Global application ß depends on communication/computation ratio Two ß scenarios: ß is known in advance (at the moment of scheduling) ß is not known in advance (at the moment of scheduling the worst case, ß = 1, is assumed ) N(0.3, 0.064) More than 32 N(0.4, 0.01) Between 4 and 32 N(0.5, 0.01) Less or equal to 4 Distribution Number of CPUs ß=0.7 ß=0.5 ß=0.3 EEHiPC'10

12 Energy Saving Parallel Job Scheduling Policies

13 Utilization-Driven Policy
Frequency assigned once (at jobs start time) for entire job execution based on system utilization Utilization is computed for each interval T: An additional control over system load WQthreshold: If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied Otherwise, job started during interval Jk runs at frequency F ftop Fk fupper flower Uk-1 Ulower Uupper EEHiPC'10

14 Evaluation Alvio simulator
C++ event driven parallel job scheduling simulator has been upgraded Policy parameters: utilization thresholds: Ulower = 50% Uupper = 80% reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz utilization computation interval: T = 10 min wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit Metric of job performance – Bounded Slowdown: BSLD at frequency f : Policy parameters Metric of performance EEHiPC'10

15 Workloads Five workloads from production use have been simulated:
Cornell Theory Center -large jobs with relatively low level of parallelism San Diego Supercomputing Center -less sequential jobs than CTC -runtime distribution similar Lawrence Livermore National Lab - small to medium size jobs Lawrence Livermore National Lab - large parallel jobs San Diego Supercomputing Center - no sequential job Parallel workload archive EEHiPC'10

16 Results: Normalized CPU Energy
short wait queues very similar results for both energy scenarios savings of not highly loaded workloads up to 12% EEHiPC'10

17 Results: Normalized Performance
high penalty in the least conservative case for highly loaded workload WQ threshold has almost no impact an increase in number of backfilled jobs EEHiPC'10

18 Average frequency - SDSCBlue
EEHiPC'10

19 BSLD-Driven Policy Frequency is assigned based on job's predicted performance Lower frequency -> longer execution time -> worse job performance metric BSLDth controls allowable performance penalty (“target BSLD”) In order to be run at lower frequency f a job has to satisfy BSLD condition at frequency f: if the job's predicted BSLD at frequency f is lower than BSLDth than it satisfies the BSLD condition at frequency f Predicted BSLD: Job Ji NO WQsize ≤ WQthreshold Run Ji at Ftop YES f = Flowest find an allocation Alloc satisfiesBSLD(Alloc,Ji,f) or f=Ftop NO f = next higher frequency YES Run job Ji at frequency f EEHiPC'10

20 Results: Normalized CPU Energy
Normalized energies in two energy scenarios behave in the same way Average savings in the most aggressive case: 5% - 23% Difference in savings per workload for the most conservative and the most aggressive threshold combinations goes from 5% (SDSC) to 15% (LLNLThunder) WQthreshold controls DVFS aggressiveness much better than BSLDthreshold BSLDthreshold has stronger impact when WQthreshold is higher EEHiPC'10

21 Average BSLD 24.91 Strong impact on performance in the most aggressive case Impact of WQthreshold higher than of BSLDthreshold BSLDthreshold has stronger impact when WQthreshold is higher 1 4.66 5.15 1.08 Decrease in performance is proportional to energy savings EEHiPC'10

22 Reduced jobs (out of 5000)‏ Performance depends on the number of reduced jobs It depends on used frequencies as well It was remarked that performance of jobs that have been run at the nominal frequency was affected as well When load is very high (SDSC) no DVFS is applied (in order to apply it thresholds have to be set to higher values) EEHiPC'10

23 Wait time Main problem observed: -> high impact on wait time
Zoom of SDSCBlue wait time behavior EEHiPC'10

24 Power-Budgeting Policy

25 PB-Guided Policy: How DVFS can improve overall job performance
NO DVFS CASE J3 J5 ftop J1 Wait Queue: J2 J4 J5 Time T1 T2 ftop J4 J5 J3 DVFS CASE J4 Power Budget flower J2 J3 J1 J2 J1 penalty in run time due to frequency scaling but more jobs can run simultaneously EEHiPC'10

26 Power Budgeting: PB-Guided Policy
Frequency assignment is guided by predicted job performance and current power draw Prediction of BSLD when selecting frequency: BSLD condition: A job satisfies BSLD condition at reduced frequency f if its predicted BSLD at the frequency f is lower than current value of the BSLD threshold The policy is power conservative: A job will be scheduled at the lowest frequency at which both BSLD condition and power limit are satisfied The closer to the PB limit, the higher the BSLD threshold The higher the BSLD threshold, the lower frequency will be selected BSLD threshold Pcurrent Plower Pupper Power Budget EEHiPC'10

27 Power Budgeting: PB-Guided Policy
A job can be scheduled with one of the two functions: MakeJobReservation(J)‏ 1: scheduled <-- false; 2: shiftInTime <-- 0; 3: nextFinishJob <-- next(OrderedRunningQueue); 4: while( !scheduled)‏ { 5: f <-- FlowestReduced 6: while(f < Fnominal) 7: Alloc = findAllocation(J,currentTime + shiftInTime,f); 8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) ) 9: { schedule(J, Alloc); 10: scheduled <-- true; 11: break; } 12: if (f == Fnominal) 13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal) 14: if (satisfiesPowerLimit(Alloc, J,Fnominal)) 15: schedule(J, Alloc); 16: break; 17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime; 18: nextFinishJob <-- next(OrderedRunningQueue); } BackfillJob(J)‏ 1: f <-- Flowest 2: while(f < Fnominal) { 3: Alloc = TryToFindBackfilledAllocation(J,f); 4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and satisfiesPowerLimit(Alloc,J,f)) 5: { schedule(J, Alloc); 6: break; } 7: f <-- nextHigherFrequency } 8: if (f==Fnominal) 9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal); 10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f)) 11: schedule(J, Alloc); } the lowest frequency that satisfies it will be selected BSLD Condition power budget must not be violated during entire job execution Power Limit EEHiPC'10

28 Evaluation Policy parameters: Power budget
Power budget thresholds: Plower = 0.6 , Pupper = 0.9 BSLD threshold values which have been used: BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower Power budget set to 80% of the total CPU power consumed by whole system when running at Fnominal Four workloads from production use have been simulated: 89% 80% 1 LLNLThunder 74% 69% 5.15 20 – 25 SDSCBlue – 1152 95% 85% 24.91 SDSC – 128 72% 70% 4.66 CTC – 430 Over PB Utilization Avg BSLD Jobs(K)‏ Workload - # CPUs

29 Baseline Power Budgeting Policy
Power limited without DVFS: No job will start if it would violate the budget although there are available processors This case is equal to the EASY scheduling with a smaller machine Arrival of Job 6 CPUs Arrival of Job 4 Arrival of Job 5 Job 4 Job 1 Job 2 Job 6 MakeJobReservation(Job5)‏ Job 4 Job 3 Job 5 BackfillJob(Job6)‏ Job 6 can not start because of power budget EEHiPC'10

30 Results: Performance Oracle case: it is assumed that ß values are known at the scheduling time PB-guided policy shows better performance for all workloads! AVG wait time decreases with DVFS under power constraint EEHiPC'10

31 Results: Normalized CPU Energy (idle=0)
Oracle case: it is assumed that ß values are known at the scheduling time EEHiPC'10

32 Utilization Over Time EEHiPC'10

33 Power Budget Consumed EEHiPC'10

34 Comparison of Unknown and Known ß
Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values (EASY-backfilling with power limit and without DVFS) EEHiPC'10

35 Conclusions Energy – performance trade-off must be done carefully as DVFS does not affect only job runtime but it can affect significantly job wait time and additionally decrease job performance Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the wait queue and only the scheduler can estimate potential negative impact on queued jobs DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty Parallel job scheduling policies can be designed such that maximizes job performance under a given power constraint It has been shown that DVFS can improve performance in power constrained HPC centers (using lower CPU frequencies allows more job to run simultaneously)‏ It is not necessary to know ß values in advance, moreover assuming the worst case at scheduling time can give better performance than when they are known in advance EEHiPC'10

36 Thank you for your attention!
HPPAC 2010 Atlanta EEHiPC'10


Download ppt "Power-Aware Parallel Job Scheduling"

Similar presentations


Ads by Google