Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool.

Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Overview  what is Condor ?  High Performance versus High Throughput Computing  Condor fundamentals  setting up and running a Condor Pool  The University of Liverpool Condor Pool  example applications

What is Condor ?  a specialized system for delivering High Throughput Computing  a harvester of unused computing resources  developed by Computer Science Dept at University of Wisconsin in late ‘80s  free and (now) open source software  widely used in academia and increasing in industry  available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

HPC vs HTC (1)  High Performance Computing (HPC)  delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important)  can also provide lots of memory, large amounts of fast (parallel) storage  fairly exotic hardware, may need plenty of TLC  large capital outlay on hardware  need to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources)  users run relatively small numbers of parallel jobs  essential for certain time-critical applications

HPC vs HTC (2)  High Throughput Computing (HTC)  allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important)  users more concerned with running large numbers of jobs over a long time span than a few short burst computations  makes use of existing commodity hardware (e.g. desktop PCs)  small capital outlay on hardware possible  limited memory and storage available generally  mostly aimed at running concurrent serial jobs (although MPI and PVM are supported by Condor)

Types of Condor application  large numbers of independent calculations typically (“pleasantly parallel”)  data parallel applications – split large datasets into smaller parts and analyse independently  biological sequence analysis  processing of census data  optimisation problems  microprocessor design and testing  applications based on Monte Carlo methods  radiotherapy treatment analysis  epidemiological studies

A “typical” Condor pool Central manager Submit/execute host Submit host Execute hosts

A “typical” Condor pool Central manager Submit/execute host Submit host Execute hosts ClassAds

A “typical” Condor pool Central manager Submit/execute host Submit host Execute hosts Match Info

A “typical” Condor pool Central manager Submit/execute host Submit host Execute hosts Jobs

A “typical” Condor pool Central manager Submit/execute host Submit host Execute hosts Output

ClassAds and Matchmaking  ClassAds are a fundamental part of Condor  similar to classified advertisements in a paper  “Job Ads” represent jobs to Condor (similar to “wanted” ads)  “Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads)  Condor central manager matches Machine Ads to Job Ads and hence machines to jobs  Job Ads are created using submit description files

Simple submit description file # simple submit description file # (anything following a # is comment and is ignored by Condor) #this would be used for Windows XP based execute hosts universe = vanilla executable = example.exe # what to run output = stdout.out$(PROCESS)# job`s standard output log = mylog.log$(PROCESS)# log job`s activities transfer_input_files = common.txt, myinput$(PROCESS).txt # input files needed requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" )# what machines to run on queue 2# number of jobs to queue

Requirements and Rank  Requirements expression determines where (and when) a job will run e.g.  Rank is used to express a preference Requirements = ( OpSys==“WINNT51” ) && # Windows XP OS wanted ( Arch==“Intel” ) && \# Intel/compatible processor ( Memory >= 2000 ) && \# want a least 2GB memory and ( Disk >= 33554432 ) && \# at least 32 GB of free disk ( HAS_MATLAB == TRUE ) && \# must have MATLAB installed ( ( ClockMin > 1020 ) || \# only run jobs after 5 pm OR... ( ClockMin == 6 ) || ( ClockDay == 0) )# at weekends Rank = Kflops# run on machines with best floating point performance first

Job submission and monitoring [einstein@submit ~]$ condor_submit example.sub Submitting job(s). 2 job(s) submitted to cluster 100. [einstein@submit ~]$ condor_q -- Submitter: submit.chtc.wisc.edu : : submit.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 sagan 7/22 14:19 172+21:28:36 R 0 22.0 checkprogress.cron 2.0 heisenberg 1/13 13:59 0+00:00:00 I 0 0.0 env 3.0 hawking 1/15 19:18 0+04:29:33 R 0 0.0 script.sh 4.0 hawking 1/15 19:33 0+00:00:00 R 0 0.0 script.sh 5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0 script.sh 6.0 hawking 1/15 19:34 0+00:00:00 R 0 0.0 script.sh... 96.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 97.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 98.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 99.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 100.0 einstein 4/5 13:55 0+00:00:00 I 0 0.0 cosmos 557 jobs; 402 idle, 145 running, 1 held [einstein@submit ~]$

Condor policies  Condor supports a wide range of policies for when to start jobs e.g.  run jobs only outside office hours  run jobs only if load average on host is small and there has been no recent activity  run jobs at any time on one core (at low priority)  run jobs only submitted by certain users  also a wide choice of what to do when a job is about to be interrupted e.g.  suspend the job for a limited time then let it resume  checkpoint the job and migrate it to another machine  kill off the job immediately

UNIX or Windows execute hosts ? (1)  UNIX  Condor’s natural environment  not widely installed on desktop machines (but depends on institution...)  supports the Condor “standard universe” containing many useful features  checkpointing allows jobs to be migrated from one machine to another without loss of useful work  Remote Procedure Calls give transparent access to files on submit host  streaming of standard output (stdout) from jobs to submit host  Network filesystems work well making installation and configration much easier  leverages large amount of scientific and engineering codes which have been developed under UNIX

UNIX or Windows execute hosts ? (2)  Windows  world’s most widely installed OS – rich source of execute hosts  many commercial 3 rd party applications run on Windows  using shared (network) filesystems can be difficult under Condor  only supports the “vanilla” Condor universe  no checkpointing – evicted jobs may waste a lot of cycles  all input and output files need to be transferred to/from execute host  output streaming not supported  may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux can make life easier)  Windows support from the U-W Condor Team tends to lag behind UNIX

Setting up a Condor pool  best to start off small and build up pool slowly  need to understand Condor fundamentals:  role of Condor processes and how they interact  life-cycle of jobs  ClassAds and Matchmaking  avoid firewalls if possible (may be easier said than done...)  talk to central IT services (particularly network and PC teams)  submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally want  multi-core/processor machine (quad core at least)  plenty of memory (say 8 GB or more)  large fast access filestore (e.g. 1 TB RAID)

Where to go for help  Read The Fine Manual !  log files contain a lot of useful information  take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor)www.cs.wisc.edu/condor  search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users)lists.cs.wisc.edu/archive/condor-users  subscribe to the condor-users mail list  join the Campus Grids SIG: (wikis.nesc.ac.uk/escinet/Campus_Grids)wikis.nesc.ac.uk/escinet/Campus_Grids  commercial support is also available (e.g. Cycle Computing)

University of Liverpool Condor Pool  contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon)  most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine  single submission point for Condor jobs provided by Sun Solaris V445 SMP server  policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours  job will be killed off if running when a user logs in to a PC  web interface for specific applications  support for running large numbers of MATLAB jobs

Condor service caveats  only suitable for DOS-based applications running in batch mode  no communication between processes possible (“pleasantly parallel” applications only)  statically linked executables work best (although can cope with DLLs)  all files needed by application must be present on local disk (cannot access network drives)  shorter jobs more likely to run to completion (10-20 min seems to work best)  very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)

Running MATLAB jobs under Condor  many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C)  need to create standalone application from M-file(s) using MATLAB compiler  standalone application can run without a MATLAB license  run-time libraries still need to be accessible to MATLAB jobs  nearly all toolbox functions available to standalone applications  simple (but powerful) file I/O makes checkpointing easier  see Liverpool Condor website for more information

Power-saving and Green IT at Liverpool  we have around 2 000 centrally managed classroom PCs across campus which were powered up overnight, at weekends and during vacations.  original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity  policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.  3 rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are running  Condor’s own power management features allows machines to be woken up automatically according to demand

Condor-G and Grid Computing  Condor-G is an extension to Condor allowing job submission to remote resources using Globus  provides familiar Condor-like interface to users hiding the underlying middleware complexity  we have used Condor-G to give users grid access to a variety of HPC resources:  local HPC clusters (UL-Grid)  NW-Grid resources at Daresbury Lab, Lancaster and Manchester  National Grid Service facilities  Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine)  Web portal removes the need for command line use completely

Radiotherapy example  3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1]  aim is to provide insight into connection between dose- distribution characteristics, different organ architectures and complication rates beyond that of analytical methods  code written in MATLAB and compiled into standalone executable  set of 800 simulations took ~ 36 hours to run on Condor pool  would require 4-5 months of computing time on a single PC  several dozen sets of simulations have since been completed [1] Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic simulation of normal-tissue damage in radiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.

Personalised Medicine example  project is a Genome-Wide Association Study  aims to identify genetic predictors of response to anti-epileptic drugs  try to identify regions of the human genome that differ between individuals (referred to as SNPs)  800 patients genotyped at 500 000 SNPs along the entire genome  test statistically the association between SNPs and outcomes (e.g. time to withdrawl of drug due to adverse effects)  very large data-parallel problem – ideal for Condor  divide datasets into small partitions so that individual jobs run for 15-30 minutes  batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PC

Epidemiology example  researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2]  Monte Carlo type method - highly parallel  original code written in MATLAB and compiled into standalone application  individual simulations take only 10-15 minutes to run – ideal for Condor  require ~ 10 000 - 20 000 simulations per scenario  would have needed several years of compute time on single machine, on Condor needed a few weeks [2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28

Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk

Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool.

Similar presentations

Presentation on theme: "Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool.

Similar presentations

Presentation on theme: "Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool."— Presentation transcript:

Similar presentations

About project

Feedback