Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool.

Similar presentations


Presentation on theme: "Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool."— Presentation transcript:

1 Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

2 Overview  what is Condor and what can it be used for ?  typical Condor pool operation  University of Liverpool Condor Pool  support for MATLAB and R applications  some research computing examples  quick introduction to UNIX with a walk-through example

3 What is Condor ?  a specialized system for delivering High Throughput Computing  a harvester of unused computing resources  developed by Computer Science Dept at University of Wisconsin in late ‘80s  free and (now) open source software  widely used in academia and increasing in industry  available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

4 Types of Condor application  typically - large numbers of independent calculations (“pleasantly parallel”)  data parallel applications – split large datasets into smaller parts and process them in parallel  biological sequence analysis (e.g. BLAST)  processing of field trial data  optimisation problems  microprocessor design and testing  applications based on Monte Carlo methods  radiotherapy treatment analysis  epidemiological studies

5 A “typical” Condor pool Condor Server Desktop PC Execute hosts login and upload input data

6 A “typical” Condor pool Condor Server Desktop PC Execute hosts jobs

7 A “typical” Condor pool Condor Server Desktop PC Execute hosts results

8 A “typical” Condor pool Condor Server Desktop PC Execute hosts download results

9 University of Liverpool Condor Pool  contains around 700 classroom PCs running the CSD Managed Windows 7 Service (mostly 64 bit from next year)  most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots per PC (total of 1400 job slots)  single job submission point for Condor jobs provided by powerful UNIX server  jobs continue to run while classroom PCs are unused but...  if load (or memory use) becomes significant, job will be killed and usually any results will be lost (job will start again from scratch)  tools provided for running large numbers of MATLAB and R jobs

10 Condor caveats  only suitable for non-interactive applications  no communication between jobs possible  all files needed by application must be present on local disk  shorter jobs more likely to run to completion (10-20 min seems to work best)  long running jobs can be run if save/restore mechanism (checkpointing) is built into them  tricky to begin with but usually worth the initial effort

11 Running MATLAB jobs under Condor  need to create standalone application from M-file(s) using MATLAB compiler  standalone application can run without a MATLAB license  run-time libraries still need to be accessible to MATLAB jobs  nearly all toolbox functions available to standalone applications  simple (but powerful) file input/output makes checkpointing easier  tools available to simplify job submission - see Liverpool Condor website for more information

12 Running R jobs under Condor  limited support at present  R is installed on-the-fly as part of the job  currently only R version 2.6.2 available with standard packages  tools available to simplify job submission  checkpointing may be possible for long running jobs

13 Personalised Medicine example  project is a Genome-Wide Association Study  aims to identify genetic predictors of response to anti-epileptic drugs  try to identify regions of the human genome that differ between individuals (referred to as SNPs)  800 patients genotyped at 500 000 SNPs along the entire genome  test statistically the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)  very large data-parallel problem using R – ideal for Condor  divide datasets into small partitions so that individual jobs run for 15-30 minutes  batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC

14 Radiotherapy example  large 3 rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods  tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks  divided problem into 250 then 5 000 and eventually 50 000 Condor jobs  required ~ 2 600 days of cpu time (equivalent to ~ 3.5 years on dual core PC)  Condor simulation completed in less than one week  average run time was ~ 70 min  only ~ 10 % of compute time wasted due to evictions

15 Condor service prerequisites  will need a Sun UNIX service account (contact CSD helpdesk@liv.ac.uk) and a Condor account (http://www.liv.ac.uk/csd/registration/eScienceform.pdf) helpdesk@liv.ac.ukhttp://www.liv.ac.uk/csd/registration/eScienceform.pdf  to login in to the Condor server:  on MWS use PuTTy: Install University Applications | Internet | PuTTy 0.60  Mac/Linux: open terminal window and use ssh  off campus: use Apps Anywhere (PuTTy is in Utilities group)  to upload/download files to/from the Condor server:  on MWS use CoreFTPLite: Install University Applications | Internet | CoreFTP LE2.1  Mac/Linux: open terminal window, use sftp/scp  off campus: need to use virtual private network (VPN), then FTP

16 PuTTy login

17

18

19 CoreFTP Lite

20

21

22

23

24

25 CoreFTP Lite – download files

26

27 Condor server directory tree / or ‘root’ /usr/bin/sbin/tmp/home/condor_data

28 Condor server directory tree / /home/fred/home/smithic/home/jim /home login ‘home’directories /tmp/usr/bin/sbin/condor_data

29 Condor server directory tree /condor_data /condor_data/smithic/condor_data/jim /usr/bin/sbin/home/tmp / ‘home’directories for Condor

30 MATLAB Condor example calculate the sum of p matrix-matrix products:  each product calculation is independent and can be performed in parallel  MATLAB M-file (product.m): function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;

31 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory

32 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples

33 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab

34 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls #list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m

35 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls #list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m [smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503.

36 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls #list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q #get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap

37 Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls #list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q #get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q #job has finished when gone from queue -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held

38 Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m

39 Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5

40 Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511.

41 Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511. [smithic@ulgp5 matlab]$ condor_q#get status of jobs -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.1 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.2 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.3 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.4 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 5 jobs; 0 idle, 5 running, 0 held

42 Job submission example [smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held

43 Job submission example [smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held

44 Job submission example [smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls #check output files input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m

45 Job submission example [smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m [smithic@ulgp5 matlab]$ zip output.zip output*.mat #bundle output files

46 Summary  Condor can speed up processing by running large numbers of jobs in parallel  shorter jobs work best but can deal with jobs of arbitrary length  user-written codes easiest to run (MATLAB, R, C/C++, FORTRAN etc)  commercial 3 rd party software may work  needs to run on standard MWS PC without user interaction  all Condor jobs submitted via central UNIX server

47 Further Information Condor http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk other research computing services http://www.liv.ac.uk/csd/research/ arc-support@liverpool.ac.uk


Download ppt "Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool."

Similar presentations


Ads by Google