Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian C. Smith The University of Liverpool Condor Pool.

Similar presentations


Presentation on theme: "Ian C. Smith The University of Liverpool Condor Pool."— Presentation transcript:

1 Ian C. Smith The University of Liverpool Condor Pool

2 University of Liverpool Condor Pool  contains around 300 machines running the University’s Managed Windows (XP) Service.  most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.  software updates via a weekly re-imaging process.  single combined submit host / central manager running on Sun Solaris V440 SMP server.  restricted access to submit host for registered Condor users.  currently running Condor 7.0.2 (moving to 7.4.2 soon hopefully).  policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

3 Condor service caveats  only suitable for DOS-based applications running in batch mode  no communication between processes possible (“pleasantly parallel” applications only)  statically linked executables work best (although can cope with DLLs)  all files needed by application must be present on local disk (cannot access network drives)  no built-in check-pointing or standard output/error streaming  shorter jobs more likely to run to completion (10-20 min seems to work best)  very long running jobs can accommodated using Condor DAGMan or user level check-pointing

4 MATLAB advantages  originally developed for development of linear algebra algorithms but now contains many built-in functions geared to different disciplines divided into toolboxes  intuitive interactive environment allows rapid code development  simple but powerful file I/O: save, load ( useful for check-pointing).  allows users to create their own functions stored as M-files  “standalone” applications can be built from M-files:  can run on platforms without MATLAB installed  do not need a licence to be able to run  can include all toolbox functions  APIs available for FORTRAN and C codes (“MEX files”)

5 MATLAB disadvantages  even standalone applications can run slower than equivalent C or FORTRAN implementations.  standalone applications aren’t quite what they may seem:  more than just an.exe – “manifest” file needed to locate run-time libraries  need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting.exe)  luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive)  run-time errors can be difficult to trace when MATLAB jobs are run under Condor:  need to run under Condor on local PC  configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

6 Condor/MATLAB Research Applications  predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science)  modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science)  modelling of disease propagation in fish farms (Mathematical Sciences / Earth and Ocean Science)  testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics)  simulation of the infection of a bacterial cell by a virus (Mathematical Sciences)  modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy)

7 Avian influenza results

8

9

10 Power-saving at Liverpool  have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.  original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity  policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.  makes extensive use of PowerMAN system from Data Synergy comprising:  service which forces machines into a low-power state and reports machine activity to Management Reporting Platform  Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

11 Power-saving at Liverpool  Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.  Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 15 minutes of inactivity  Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.  Makes extensive use of PowerMAN system from Data Synergy comprising:  service which forces machines into a low-power state and reports machine activity to Management Reporting Platform  Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

12 Adapting Condor for use with power-saving PCs  Two main problems:  how to ensure Condor jobs are not evicted by hibernating PCs  how to wake up dormant PCs to run Condor jobs on-demand  Originally used Microsoft system service to power-down PCs after 30 min inactivity:  runs.bat file which checks if a user is logged in and shuts machine down if not  doesn’t detect owner of Condor job as a logged-in user  need to check for presence of condor_exe.bat  PowerMAN service now prevents job eviction:  can provide PowerMAN with a list of “protected programs”  ensures that system remains active if a protected program is running  include condor_starter process as a protected program (only present while a Condor job is running).

13 Adapting Condor for use with a power- saving PCs  Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:  NICs must be remain powered-up during hibernation/power-off  NICs must be capable of waking machines on receipt of a “magic packet”  network must be able to route “magic packets”  cron runs on the submit host which examines state of queue ( condor_q ) and pool ( condor_status ):  if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines  find number of powered up machines machines in each “teaching centre” (classroom)  estimate the number of hibernating machines in each teaching centre from total number of machines in each  sort centres from highest number of available machines to lowest  wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)  MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

14 Automatic wake up issues  Assumes that any job can run on any machine:  users cannot choose particular teaching centres or machines in their job Requirements  ideally, pool needs to be homogenous  errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate)  cron now includes a “sanity check” for this  Can only estimate number of hibernating machines in each centre  May wake up more machines than needed

15

16 Automatic wake up in action – Condor pool machine statistics

17 Automatic wake up in action – PowerMAN statistics

18 Recent and Future Developments  starting to make use of automatic wake-up features of Condor 7.4.1 (condor_rooster)  cron advertises/updates ClassAds for offline machines  Condor matches offline machines to jobs and wakes up machines as needed  use slow ramp-up of wake-ups to prevent server “overload”  users can now specify memory requirements, processor speed, when to run jobs etc  local tools available to assist in the preparation and running of MATLAB jobs: m_file_submit, matlab_build, matlab_submit


Download ppt "Ian C. Smith The University of Liverpool Condor Pool."

Similar presentations


Ads by Google