Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.

Similar presentations


Presentation on theme: "Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs."— Presentation transcript:

1 Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs

2 Overview  Quick description of the University of Liverpool Condor Pool  Power saving at Liverpool  A home-grown approach to dealing with power-saving PCs  Power management using Condor 7.4.X  Implementing Condor power management  Results  Future directions

3 University of Liverpool Condor Pool  Contains around 300 machines running the University’s Managed Windows (XP, soon Windows 7) Service.  Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.  Single combined submit host / central manager running on Sun V445 SMP server.  Currently running Condor 7.0.2 on execute hosts (moving to 7.2.x soon).  Policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours  Jobs are killed rather than suspended

4 Power saving at Liverpool  We have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.  Original power saving policy was to “power-off” machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity  Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.  Makes extensive use of PowerMAN system from Data Synergy comprising:  service which forces machines into a low-power state and reports machine activity to Management Reporting Platform  Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

5 Typical monthly Condor activity

6 A home grown approach to power management  Two main problems to deal with:  how to ensure Condor jobs are not evicted by hibernating PCs  how to wake up dormant PCs to run Condor jobs on-demand  PowerMAN service prevents job eviction:  can provide PowerMAN with a list of “protected programs” which ensures that the machine remains active if running  include condor_starter process as a protected program (only present while a Condor job is running).  Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:  NICs must be remain powered-up during hibernation  NICs must be capable of waking machines on receipt of a “magic packet”  network must be able to route “magic packets” – not a problem for us but YMMV

7 Adapting Condor for use with power-saving PCs  cron runs on the submit host which periodically examines the state of the queue ( condor_status -schedd ) and the pool ( condor_status )  if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines  find out the number of powered up machines machines in each “teaching centre” (classroom)  estimate the number of hibernating machines in each teaching centre from total number of machines in each  sort centres from highest number of available machines to lowest  wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)  MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

8 Problems with the home-grown approach  Assumes that any job can run on any machine:  users cannot choose particular teaching centres or machines in their job Requirements  ideally, pool needs to be homogenous  errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate again)  cron includes a “sanity check” for this  Can only estimate number of hibernating machines in each centre  Same machines get woken up first

9 Power management in Condor 7.4.X  Condor daemons can now place an execute host in a low-power state according to a given policy  Execute hosts signals it is about to enter low-power state to the Condor central manager  Central manager records persistent offline ClassAds for hibernating machines  Negotiator can perform matchmaking with offline ClassAds  Matches are passed to condor_rooster  condor_rooster pipes information to condor_power which wakes up machines using WoL

10 Implementing Condor power management  Still use PowerMAN to power-down inactive PCs rather than using Condor  Need a way of advertising available offline machines to the condor_collector  If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are form the subset O = P – A  cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)  Finding P (the total set of machines which are out there) turns out to be a very difficult problem

11 How do we determine which machines are available to Condor  Try waking them up !  Wake up all machines in each teaching centre once a week using WoL  After wakeup call, wait a few minutes and test each machine in turn with: condor_status –direct  Sanity check similar to UNIX ping  Record which machines respond and publish ClassAds for them

12 Unforeseen problems  Not all woken up machines begin to run jobs  number of wakeups is limited by our “roll-your-own” version of condor_power  condor_rooster originally attempted to wake up all offline machines which matched job requirements  Included another limit in our condor_power script (number of wakeups must be < no of idle jobs)  Condor 7.4.3 should fix this, 7.5.3 adds ROOSTER_MAX_UNHIBERNATE configuration option  Wanted to wake up machines in random order so same machines not used repeatedly  Found that condor_negotiator ignored Rank values  Used condor_power script to implement this (“shuffles the deck”)  Should be fixed in 7.5.3 using ROOSTER_UNHIBERNATE_RANK config option Need a way of advertising available offline machines to the condor_collector  If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are the subset O = P – A  cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)  Finding P (the machines which are out there) turns out to be a very difficult problem

13 Unforeseen problems / cont’d  Condor continued to wakeup machines after jobs removed (or complete)  Use Unhibernate = CurrentTime – MachineLastMatchTime < 300 not Unhibernate =!= Undefined  Difficult to distinguish Unclaimed offline machines from online ones in condor_status:  Also difficult to distinguish in Condor View graphs  to see all offline machines  $ condor_status –constraint Offline==True  to see all powered-up machines  $ condor_status –constraint Offline=!=True

14 Results – wakeup test

15 Future Directions  Condor power management will allow us to expand the pool to include even low-spec machines  If machines are not needed or are unsuitable they need not be woken up  Rank can be used so that newer (more energy efficient machines) used first  We would like a more accurate way of determining which machines are available. One possible method:  Record the amount of time since each machine last appeared in the pool and/or ran a job  Confidence in waking a PC can be described by a monotonically decreasing function of this  May still need to wake machines for testing occasionally  Encourage users to incorporate their own checkpointing code to reduce “badput” and energy wastage (see Liverpool Condor website for details).

16 Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk


Download ppt "Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs."

Similar presentations


Ads by Google