Presentation is loading. Please wait.

Presentation is loading. Please wait.

Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Condor.

Similar presentations


Presentation on theme: "Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Condor."— Presentation transcript:

1 Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) thain@cs.wisc.edu http://www.cs.wisc.edu/condor Condor by Example

2 www.cs.wisc.edu/condor Outline › Overview › Submitting Jobs, Getting Feedback › Setting Requirements with ClassAds › Using LOTS of Machines › Which Universe? › Conclusion

3 www.cs.wisc.edu/condor What is Condor? › Condor converts a collection of unrelated workstations into a high- throughput computing facility. › Condor uses matchmaking to make sure that everyone is happy.

4 www.cs.wisc.edu/condor What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances.  “How fast can I run simulation X on this machine?” › High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances.  “How many times can I run simulation X in the next week using all available machines?”

5 www.cs.wisc.edu/condor What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines…  Crash!  Are disconnected  Run out of disk space  Are removed or added from the pool  Are put to other uses

6 www.cs.wisc.edu/condor What is Matchmaking? › Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. › Users (jobs) have constraints:  “I need an Alpha with 256 MB RAM” › Owners (machines) have constraints:  “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

7 www.cs.wisc.edu/condor “What can Condor do for me?” Condor can… › …increase your throughput. › …do your housekeeping. › …improve reliability. › …give performance feedback.

8 www.cs.wisc.edu/condor The INFN Condor Pool

9 www.cs.wisc.edu/condor How many machines now? › The map is out of date! › The system is always changing. › First example: What machines (and of what kind) are in the pool now?

10 www.cs.wisc.edu/condor First Things First › Set your path: setenv PATH /library/condor_nfs/XXX/bin › XXX should be your system: OSF1, LINUX, SOLARIS26, HPUX10 …

11 www.cs.wisc.edu/condor How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle 0.000 30 axpd21.pd.inf OSF1 ALPHA Owner Idle 0.266 96 vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy 0.000 256... Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF1 115 67 46 1 0 1 INTEL/LINUX 53 18 0 35 0 0 INTEL/LINUX-GLIBC 16 7 0 9 0 0 SUN4u/SOLARIS251 1 1 0 0 0 0 SUN4u/SOLARIS26 6 2 0 4 0 0 SUN4u/SOLARIS27 1 1 0 0 0 0 SUN4x/SOLARIS26 2 1 0 1 0 0 Total 194 97 46 50 0 1

12 www.cs.wisc.edu/condor Machine States › Most machines will be:  Owner: The machine’s owner is busy at the console, so no Condor jobs may run.  Claimed: Condor has selected the machine to run jobs for other users.

13 www.cs.wisc.edu/condor Machine States › Only a few should be:  Unclaimed: The owner is gone, but Condor has not yet selected the machine.  Matched: Between claimed and unclaimed.  Preempting: Condor is busy removing a job.

14 www.cs.wisc.edu/condor More Examples % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor.cs.wisc.edu

15 www.cs.wisc.edu/condor Submitting Jobs

16 www.cs.wisc.edu/condor Steps to Running a Job › Re-link for Condor. › Submit the job. › Watch the progess. › Receive email when done.

17 www.cs.wisc.edu/condor Example Job Compute the nth Fibonnaci number. Fib(40) takes about one minute to compute on an Alpha. %./fib 40 fib(40) = 102334155

18 www.cs.wisc.edu/condor #include int fib( int x ) { if( x<=0 ) return 0; if( x==1 ) return 1; return fib(x-1) + fib(x-2); } int main(int argc, char *argv[]) { int n; n = atoi(argv[1]); printf ("fib(%d) = %d\n",n,fib(n)); return 0; }

19 www.cs.wisc.edu/condor Re-link for Condor › Normal compile: gcc –c fib.c –o fib.o › Normal link: gcc fib.o –o fib › Use the same command, but add condor_compile: condor_compile gcc fib.o –o fib

20 www.cs.wisc.edu/condor Submit the Job › Create a submit file: vi fib.submit › Submit the job: condor_submit fib.submit Executable = fib Arguments = 40 Output = fib.out Log = fib.log queue

21 www.cs.wisc.edu/condor Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12:40 0+00:00:15 R 0 2.5 fib 40 Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)

22 www.cs.wisc.edu/condor Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Wed Jun 21 14:24:42 2000 Completed at: Wed Jun 21 14:36:36 2000 Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37...

23 www.cs.wisc.edu/condor Running Many Processes › 100 processes are almost as easy as !. › Each condor_submit makes one cluster of one or more processes. › Add the number of processes to run to the Queue statement. › Use the $(PROCESS) variable to give each process slightly different instructions.

24 www.cs.wisc.edu/condor Running Many Processes › Compute Fib(1) through Fib(50) › Output goes in fib.out.1, fib.out.2, and so on… Executable = fib Arguments = $(PROCESS) Output = fib.out.$(PROCESS) Log = fib.log Queue 50

25 www.cs.wisc.edu/condor Running Many Processes › Another approach: Each process gets its own directory (dir1, dir2, …) and sends the output to dirX/fib.out. Executable = fib Arguments = $(PROCESS) Initial_Dir = dir$(PROCESS) Output = fib.out Log = fib.log Queue 50

26 www.cs.wisc.edu/condor Running Many Processes % condor_q -- Submitter: axpbo8.bo.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.3 thain 6/23 10:47 0+00:05:40 R 0 2.5 fib 3 9.6 thain 6/23 10:47 0+00:05:11 R 0 2.5 fib 6 9.7 thain 6/23 10:47 0+00:05:09 R 0 2.5 fib 7... 21 jobs; 2 idle, 19 running, 0 held Cluster number Process number

27 www.cs.wisc.edu/condor Where Are They Running? › condor_q –run - Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME HOST(S) 9.47 thain 6/23 10:47 0+00:07:03 ax4bbt.bo.infn.it 9.48 thain 6/23 10:47 0+00:06:51 pewobo1.bo.infn.it 9.49 thain 6/23 10:47 0+00:06:30 osde01.pd.infn.it Current Location

28 www.cs.wisc.edu/condor Help! I’m buried in Email! › By default, Condor sends one email for each completed process. › Add these to your submit file:  notification = error  notification = never › To send it to someone else:  notify_user = mazzanti@bo.infn.it

29 www.cs.wisc.edu/condor Removing Processes › Remove one process:  condor_rm 9.47 › Remove a whole cluster:  condor_rm 9 › Remove everything!  condor_rm -a

30 www.cs.wisc.edu/condor Getting Feedback

31 www.cs.wisc.edu/condor What have I done? › The user log file (fib.log) shows a chronological list of everything important that happened to a job. 001 (007.035.000) 06/21 17:03:44 Job executing on host: 004 (007.035.000) 06/21 17:04:58 Job was evicted. 009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.

32 www.cs.wisc.edu/condor What have I done? % condor_history ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD 9.3 thain 6/23 10:47 0+00:00:00 C 6/23 10:58 fib 3 9.40 thain 6/23 10:47 0+00:00:24 C 6/23 10:59 fib 40 9.10 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 10 9.47 thain 6/23 10:47 0+00:05:45 C 6/23 11:01 fib 47 9.7 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 7

33 www.cs.wisc.edu/condor Brief I/O Summary % condor_q –io -- Schedd: c01.cs.wisc.edu : ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE 756.15 joe 244.9 KB 379.8 KB 71 1.3 KB/s 512.0 KB 32.0 KB 758.24 joe 198.8 KB 219.5 KB 78 45.0 B /s 512.0 KB 32.0 KB 758.26 joe 44.7 KB 22.1 KB 2727 13.0 B /s 512.0 KB 32.0 KB 3 jobs; 0 idle, 3 running, 0 held

34 www.cs.wisc.edu/condor Complete I/O Summary in Email Your condor job "/usr/joe/records.remote input output" exited with status 0. Total I/O: 104.2 KB/s effective throughput 5 files opened 104 reads totaling 411.0 KB 316 writes totaling 1.2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling 398.6 KB 311 write totaling 1.2 MB 101 seeks (Only since Condor Version 6.1.11)

35 www.cs.wisc.edu/condor Complete I/O Summary in Email › The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.

36 www.cs.wisc.edu/condor Complete I/O Summary in Email › Example:  CMSSIM - physics analysis program.  “Why is this job so slow?”  Data summary: read 250 MB from 20 MB file.  Very high SEEK total -> random access.  Solution: Increase buffer to 20 MB.

37 www.cs.wisc.edu/condor Who Uses Condor? % condor_q –global -- Schedd: to02xd.to.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 127.0 garzelli 6/21 18:45 1+14:18:16 R 0 17.2 tosti2trisdn -- Schedd: quark.ts.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 600.0 dellaric 4/10 14:57 55+09:20:31 R 0 9.1 john p2.dat 665.0 dellaric 6/2 11:14 20+03:27:30 R 0 9.2 john p1.dat 788.0 pamela 6/20 09:27 3+04:41:43 R 0 15.4 montepamela

38 www.cs.wisc.edu/condor Who uses Condor? % condor_status –submitters Name Machine Running IdleJobs MaxJobsRunning rebuzzin@pv.infn.it decux1.pv. 22 34 200 pamela@ts.infn.it quark.ts.i 6 1 200 giunti@to.infn.it to05xd.to. 21 49 200... RunningJobs IdleJobs cattaneo@pv.infn.it 0 1 pamela@ts.infn.it 6 1 rebuzzin@pv.infn.it 22 34 Total 59 86

39 www.cs.wisc.edu/condor Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16:27 Effective User Name Priority ------------------------------ --------- meucci@pv.infn.it 0.50 longof@ts.infn.it 0.50 thain@bo.infn.it 0.50 dellaric@ts.infn.it 2.00 clueoff@pd.infn.it 3.00 pamela@ts.infn.it 5.81 rebuzzin@pv.infn.it 18.18 giunti@to.infn.it 19.72 ------------------------------ --------- Number of users shown: 8

40 www.cs.wisc.edu/condor Who Uses Condor? › The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter. › Lighter users receive a lower priority: they will be allocated CPUs before heavy users. › Users consuming the same amount of CPU will be allocated an equal amount.

41 www.cs.wisc.edu/condor Measuring Goodput › Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. › This is a big topic all by itself: http://www.cs.wisc.edu/condor/goodput http://www.cs.wisc.edu/condor/goodput

42 www.cs.wisc.edu/condor Measuring Goodput % condor_q –goodput -- Submitter: coral.cs.wisc.edu : : coral.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 719.74 thain 6/23 07:35 2+20:47:59 100.0% 87.6% 0.00 719.75 thain 6/23 07:35 2+20:38:45 40.5% 99.8% 0.00 719.76 thain 6/23 07:35 2+20:38:16 96.9% 98.7% 0.00 719.77 thain 6/23 07:35 2+21:10:06 100.0% 99.8% 0.00

43 www.cs.wisc.edu/condor Setting Requirements with ClassAds

44 www.cs.wisc.edu/condor Setting Requirements › We believe that Condor must allow both users (jobs) and owners (machines) to set requirements. › This is an absolute necessity in order to convince people to participate in the community.

45 www.cs.wisc.edu/condor ClassAds › ClassAds are a simple language for describing both the properties and the requirements of jobs and machines. › Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.

46 www.cs.wisc.edu/condor ClassAd for a Machine › condor_status –l axpbo8 MyType = "Machine" TargetType = "Job" Name = "axpbo8.bo.infn.it" START = TRUE VirtualMemory = 342696 Disk = 28728536 Memory = 160 Cpus = 1 Arch = "ALPHA" OpSys = "OSF1“

47 www.cs.wisc.edu/condor ClassAd for a Job › condor_q –l 9.49 MyType = "Job" TargetType = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib.out.49” Args = “49” ImageSize = 2544 DiskUsage = 2544 Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

48 www.cs.wisc.edu/condor Default Requirements › By default, Condor assumes the requirements for your job are: “I need a machine with…”  The same operating system and architecture as my workstation.  Enough disk to store the program.  Enough virtual memory to run the program.

49 www.cs.wisc.edu/condor Default Requirements › Expressed in ClassAds as: Requirements = (Arch ==“ALPHA”) && (OpSys==“OSF1”) && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

50 www.cs.wisc.edu/condor ClassAd Requirements › Similar to C/C++/Java expressions:  Symbols: Arch, OpSys, Memory, Mips  Values: 15, 6.5, “LINUX”  Operators: ==,, = &&, || ( )

51 www.cs.wisc.edu/condor Adding Requirements › In the submit file, add a line beginning with “requirements = “ Executable = fib Arguments = 40 Output = fib.out Log = fib.log Requirements = (Memory > 64) queue

52 www.cs.wisc.edu/condor Example Requirements › (Memory>64) › (Machine == “axpbo3.bo.infn.it” ) › (Mips>100) || (Kflops>10000) › (Subnet != “131.154.10”) && (Disk > 20000000)

53 www.cs.wisc.edu/condor Are the Requirements Reasonable? › Two ways to find out:  Before running, use condor_status to list all machines matching certain requirements.  While running, use condor_analyze to see if a match is possible.

54 www.cs.wisc.edu/condor Are the Requirements Reasonable? % condor_status –constraint ‘(Memory>640)’  Only axpd30. % condor_status –constraint ‘(Memory>512)’  Five machines: ax4mcs, axpd30, axppv3, axzds0, and stonehenge.

55 www.cs.wisc.edu/condor Are the Requirements Reasonable? › Suppose that I submit a job like this: › “My job isn’t running – Why?” Executable = fib Arguments = 40 Output = fib.out Requirements = ( Mips > 5000 ) queue

56 www.cs.wisc.edu/condor Are the Requirements Reasonable? % condor_q –analyze WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Mips > 5000)) && (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

57 www.cs.wisc.edu/condor Preferences › Condor assumes that any machines that match your requirements are suitable. › However, you may prefer some machines over others. (100 Mips is better than 10) › To indicate a preference, you may provide a ClassAd expression which ranks all matches.

58 www.cs.wisc.edu/condor Rank › The rank expression is evaluated into a number for every potential matching machine. › A machine with a higher number will be preferred over a machine with a lower number.

59 www.cs.wisc.edu/condor Rank Examples › Prefer machines with more Mips: Rank = Mips › Prefer more memory, but add 100 to the rank if the machine is Solaris 2.7: Rank = Memory + 100*(OpSys==“SOLARIS27)” › Prefer machines with a high ratio of memory to cpu performance: Rank = Memory/Mips › Prefer machines that will checkpoint in Bologna: Rank = (CkptServer==“ckpt.bo.infn.it”)

60 www.cs.wisc.edu/condor Use MORE Machines! › The Condor pool has several architectures:  115 Alpha/OSF1  62 Intel/Linux  11 Sun4u/Solaris › To get maximum throughput, you must use all that are available. Be greedy!

61 www.cs.wisc.edu/condor Compile for Each System. › Make an executable for each kind of system you wish to use. Give each a unique name. › On an Alpha/OSF1:  condor_compile gcc fib.c –o fib.ALPHA.OSF1 › On an Intel/Linux:  condor_compile gcc fib.c –o fib.INTEL.LINUX

62 www.cs.wisc.edu/condor Change the Submit File Executable = fib.$$Arch.$$Opsys Requirements = ( ((Arch==“ALPHA”) && (OpSys==“OSF1”)) || ((Arch==“INTEL”) && (OpSys==“LINUX”)) ) Make the executable name a function of the machine selected. Allow either ALPHA/OSF1 or INTEL/LINUX machines to be selected.

63 www.cs.wisc.edu/condor Condor Will Decide at the Last Minute Alpha/OSF1 You Intel/Linux Fib.$$Arch.$$Opsys Fib.ALPHA.OSF1 Fib.INTEL.LINUXFib.$$Arch.$$Opsys

64 www.cs.wisc.edu/condor Standard or Vanilla?

65 www.cs.wisc.edu/condor Which Universe? › Each Condor universe provides different services to different kinds of programs:  Standard – Relinked UNIX programs  Vanilla – Unmodified UNIX programs  PVM  Scheduler (Not described here)  Globus

66 www.cs.wisc.edu/condor Which Universe?Cluster File Server Cluster File Server VANILLA STANDARD

67 www.cs.wisc.edu/condor Standard Universe › Submit a specially-linked UNIX application to the Condor system. › Advantages:  Checkpointing for fault tolerance.  Remote I/O services: Friendly environment anywhere in the world. Data buffering and staging. I/O performance feedback. User remapping of data sources.

68 www.cs.wisc.edu/condor Standard Universe › Disadvantages:  Must statically link with Condor library.  Limited class of applications: Single-process UNIX binaries. Certain system calls prohibited.

69 www.cs.wisc.edu/condor System Call Limitations › Standard universe does not allow:  Multiple processes: fork(), exec(), system()  Inter-process communication: semaphores, messages, shared memory  Complex I/O: mmap(), select(), poll(), non-blocking I/O, …  Kernel-level threads (User level threads are OK.)

70 www.cs.wisc.edu/condor System Call Limitations › Too restrictive?  Use the vanilla universe.

71 www.cs.wisc.edu/condor Vanilla Universe › Submit any sort of UNIX program to the Condor system. › Advantages:  No relinking required.  Any program at all, including Binaries Shell scripts Interpreted programs (java, perl) Multiple processes

72 www.cs.wisc.edu/condor Vanilla Universe › Disadvantages:  No checkpointing.  Very limited remote I/O services. Specify input files explicitly. Specify output files explicitly.  Condor will refuse to start a vanilla job on a machine that is unfriendly. ClassAds: FilesystemDomain and UIDDomain

73 www.cs.wisc.edu/condor Which Universe? › Standard:  Good for mixed Condor pools, flocked pools, and the Grid at large. › Vanilla:  Good for a Condor pool of identical machines.

74 www.cs.wisc.edu/condor Conclusion

75 www.cs.wisc.edu/condor Conclusion › Condor expands your reach to many CPUs – even those you cannot log in to. › Condor makes it easy to run and manage large numbers of jobs › Good candidates for the standard universe are single-process CPU-bound jobs with simple I/O. › Too restrictive? Use the vanilla universe, but fewer available machines.

76 www.cs.wisc.edu/condor Conclusion › Need more info? › Douglas Thain (thain@cs.wisc.edu)thain@cs.wisc.edu › INFN CCL (ccl@bo.infn.it)ccl@bo.infn.it › Condor Web Page (http://www.cs.wisc.edu/condor)http://www.cs.wisc.edu/condor › This talk: (http://www.cs.wisc.edu/~thain)http://www.cs.wisc.edu/~thain


Download ppt "Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Condor."

Similar presentations


Ads by Google