Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.

Similar presentations


Presentation on theme: "Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction."— Presentation transcript:

1 Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor A Scientist’s Introduction to Condor Condor Weel, Milan, 2006

2 2 http://www.cs.wisc.edu/condor Meet Frieda. Frieda is a scientist. She has a big problem (computationally speaking).

3 3 http://www.cs.wisc.edu/condor Frieda’s Problem a Parameter Sweep: Find F(x,y,z) for 20 values of x 10 values of y 3 values of z 20 x 10 x 3 = 600 combinations!

4 4 http://www.cs.wisc.edu/condor F takes about 6 hours to compute on a typical workstation 600 runs x 6 = 3600 hours F requires a moderate amount of memory 256 Mbytes F performs a moderate amount of I/O: (x,y,z) is 5 MBytes F(x,y,z) is 50 MBytes

5 5 http://www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help?

6 6 http://www.cs.wisc.edu/condor Frieda needs a batch processing system  execution of a series jobs  run without human interaction

7 7 http://www.cs.wisc.edu/condor Examples of batch processing systems  PBS (Portable Batch System)  LSF (Load Sharing Facility)  Sun Grid Engine  Condor

8 8 http://www.cs.wisc.edu/condor Condor’s strengths  Cycle scavenging works!  High throughput computing  Very configurable, adaptable  Supports strong security methods  Interoperates with many types of computing Grids

9 9 http://www.cs.wisc.edu/condor A very simple installation of Condor will...  Watch jobs and notify you of their progress  Implement your desired ordering of jobs  Log your job’s activities  Add fault tolerance to your jobs  Implement your policy on when the jobs can run on your workstation

10 10 http://www.cs.wisc.edu/condor With Condor running, Frieda’s 600 simulations are 600 Condor jobs, described by 1 file, submitted with 1 command

11 11 http://www.cs.wisc.edu/condor a Condor pool One or more machines running Condor Each machine sets its own policy for when to run jobs

12 12 http://www.cs.wisc.edu/condor The Magic of Matchmaking Jobs and machines state their requirements and preferences Condor matches jobs with machines based on requirements and preferences

13 13 http://www.cs.wisc.edu/condor Job Jobs state their requirements and preferences: I need a Linux/x86 platform I want the machine with the most memory I prefer a machine in the chemistry department

14 14 http://www.cs.wisc.edu/condor Machine Machines state their requirements and preferences: Run jobs only when there is no keyboard activity I prefer to run Frieda’s jobs I am a machine in the physics department Never run jobs belonging to Dr. Smith

15 15 http://www.cs.wisc.edu/condor Getting Started: Submitting Jobs to Condor 1.Choose a universe for the job 2.Make the job batch-ready 3.Create a submit description file 4.Run condor_submit to put the job in the queue

16 16 http://www.cs.wisc.edu/condor 1. Choose a Universe  Controls how Condor handles jobs  some universes: vanilla standard grid java mpi scheduler

17 17 http://www.cs.wisc.edu/condor Using the Vanilla Universe The vanilla universe:  For “normal” jobs  serial execution  Like vanilla ice cream, can be used in (almost) all situations

18 18 http://www.cs.wisc.edu/condor 2. Make the job batch- ready Must run in the background:  no interactive input  no windows  no GUI

19 19 http://www.cs.wisc.edu/condor 2. Make the job batch-ready  May use STDIN, STDOUT, and STDERR (the input and output), but these are files (not devices)  Similar to Unix $./myprogram output.txt

20 20 http://www.cs.wisc.edu/condor 3. Create a Submit Description File  A plain ASCII text file  File name extensions are irrelevant  Tells Condor about the job  Can describe many jobs at once, each with different settings (e.g. different input)

21 21 http://www.cs.wisc.edu/condor About the Job Items that may appear in the submit description file describe:  executable  input and output files  command-line arguments  environment variables  requirements  preferences (called rank)  …

22 22 http://www.cs.wisc.edu/condor Simple Submit Description File # file name is sim.submit # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but file names are! universe = vanilla executable = sim.exe output = output.txt queue

23 23 http://www.cs.wisc.edu/condor 4. Run condor_submit  Give condor_submit the name of the submit description file: condor_submit sim.submit  condor_submit then  checks the submit description file for errors  creates a ClassAd that describes the job or jobs  places the job or jobs in the queue

24 24 http://www.cs.wisc.edu/condor ClassAds  Condor’s internal data representation Similar to classified advertisements  Each ClassAd may have many attributes  Represents an object and its attributes WANTED: Dog, must be brown. Willing to pay $15.

25 25 http://www.cs.wisc.edu/condor ClassAds  ClassAds state facts  The job’s executable is analysis.exe  The machine’s load average is 5.6  ClassAds state requirements  I require a machine with Linux  And preferences  This machine prefers to run jobs from the physics group

26 26 http://www.cs.wisc.edu/condor Example: MyType = "Job" TargetType = "Machine" ClusterId = 1377 Owner = "roy" Cmd = "sim.exe" Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … ClassAds ClassAds are: – semi-structured – user-extensible – format is: Attribute = Expression String Number Boolean

27 27 http://www.cs.wisc.edu/condor The Job Queue  condor_submit sends the job’s ClassAd(s) to the queue  Jobs enter the queue  with an atomic operation: a two-phase commit transaction  well defined; no partial failure!  View the queue with condor_q

28 28 http://www.cs.wisc.edu/condor Example condor_submit and condor_q % condor_submit sim.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 sim.exe 1 jobs; 1 idle, 0 running, 0 held %

29 29 http://www.cs.wisc.edu/condor Inspect the full ClassAd % condor_q -l -- Submitter: perdita.cs.wisc.edu : : MyType = “Job” TargetType = “Machine” ClusterId = 1 QDate = 1150921369 CompletionDate = 0 Owner = “frieda” RemoteWallClockTime = 0.000000 LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteSysCpu = 0.000000 ExitStatus = 0 …

30 30 http://www.cs.wisc.edu/condor Input, Output, and Error Files where standard input comes from universe = vanilla executable = sim.exe input = input.txt output = output.txt error = error.txt log = sim.log queue where standard output goes to where standard error goes to

31 31 http://www.cs.wisc.edu/condor Feedback about jobs  Condor sends e-mail about events to the person that submits the job  An entry in the submit description file specifies when Notification = Never = Error = Always = Complete the default

32 32 http://www.cs.wisc.edu/condor Feedback about jobs  Create a log of job events  Add to submit description file: log = sim.log  Becomes the Life Story of a Job

33 33 http://www.cs.wisc.edu/condor Sample Condor User Log 000 (0001.000.000) 05/25 19:10:03 Job submitted from host:... 001 (0001.000.000) 05/25 19:12:17 Job executing on host:... 005 (0001.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0)...

34 34 http://www.cs.wisc.edu/condor Job Numbering  Cluster (simplified) monotonically increasing integer ID for each new submission to the queue  Process monotonically increasing integer, for each individual job within a cluster; starts at 0

35 35 http://www.cs.wisc.edu/condor Job Numbering Each cluster.process is called a job ID 3.0 cluster.process 2.6 cluster.process

36 36 http://www.cs.wisc.edu/condor Another Submit Description File for Frieda # Example for one cluster with 2 processes Universe = vanilla Executable = analyze Input = a1.in Output = a1.out Error = a1.err Log = a1.log Queue · Input = a2.in Output = a2.out Error = a2.err Log = a2.log Queue

37 37 http://www.cs.wisc.edu/condor % condor_submit a.submit Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 sim.exe 2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 analyze 2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 analyze 3 jobs; 2 idle, 1 running, 0 held % Frieda’s Jobs, So Far

38 38 http://www.cs.wisc.edu/condor Back to Frieda’s 600 jobs…  Place all input, output, error and log files in one directory  One file of each type for each job  4 files x 600 jobs  2400 files  Alternative: Create a subdirectory for each job

39 39 http://www.cs.wisc.edu/condor Frieda’s simulation directory sim.exe sim.submit run_0 run_1 run_599 input.txt output.txt error.txt sim.log input.txt output.txt error.txt sim.log input.txt output.txt error.txt sim.log

40 40 http://www.cs.wisc.edu/condor Frieda’s 600 Jobs Universe = vanilla Executable = sim.exe Input = input.txt Output = output.txt Error = error.txt Log = sim.log InitialDir = run_0 Queue InitialDir = run_1 Queue... InitialDir = run_599 Queue 600 repetitions

41 41 http://www.cs.wisc.edu/condor Submit Description file is Too Big! > 1200 lines

42 42 http://www.cs.wisc.edu/condor Use a Substitution Macro Syntax: $(MacroName) ClassAd attribute is created with value of MacroName substituted We will use $(Process) to create a compact template that is run 600 times

43 43 http://www.cs.wisc.edu/condor Frieda’s 600 Jobs Universe = vanilla Executable = sim.exe Input = input.txt Output = output.txt Error = error.txt Log = sim.log InitialDir = run_0 Queue InitialDir = run_1 Queue... InitialDir = run_599 Queue

44 44 http://www.cs.wisc.edu/condor Frieda’s 600 Jobs Universe = vanilla Executable = sim.exe Input = input.txt Output = output.txt Error = error.txt Log = sim.log InitialDir = run_$(Process) Queue 600

45 45 http://www.cs.wisc.edu/condor Frieda submits the 600 … $ condor_submit sim.submit Submitting job(s)............................................................................................................................................................................................................................................................... Logging submit event(s)............................................................................................................................................................................................................................................................... 600 job(s) submitted to cluster 3.

46 46 http://www.cs.wisc.edu/condor And, check the queue % condor_q -- Submitter: x.cs.wisc.edu : : x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 sim.exe 3.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 sim.exe 3.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 sim.exe 3.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe... 3.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe 3.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe 600 jobs; 599 idle, 1 running, 0 held

47 47 http://www.cs.wisc.edu/condor Command-line Arguments  For a command like this % sim.exe 26 100 place in the submit description file: Executable = sim.exe Arguments = 26 100  Or, make use of the substitution macro Arguments = $(process) 100

48 48 http://www.cs.wisc.edu/condor Removing jobs  condor_rm removes a job or set of jobs from the queue  You may only remove your own jobs  Queue super-users (e.g. root on Unix or administrator on Windows) may remove any jobs

49 49 http://www.cs.wisc.edu/condor Removing jobs  Specify job IDs: condor_rm 4.1 (removes job ID cluster 4, process 1) condor_rm 4 (removes all cluster 4 jobs)  Remove all of your own jobs with the -a option condor_rm -a

50 50 http://www.cs.wisc.edu/condor Installation  Start small

51 51 http://www.cs.wisc.edu/condor Frieda starts with Personal Condor  Condor runs on Frieda’s workstation  No root / administrator access required  No system administrator needed  After installation, Frieda submits her jobs to her Personal Condor…

52 52 http://www.cs.wisc.edu/condor Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine?

53 53 http://www.cs.wisc.edu/condor A Personal Condor will...  Keep an eye on your jobs and inform you of their progress  Implement your policy on the execution order of the jobs  Keep a log of your job activities  Add fault tolerance to your jobs  Implement your policy on when the jobs can run on your workstation

54 54 http://www.cs.wisc.edu/condor Getting Condor  Available as a free download from http://www.cs.wisc.edu/condor  Download Condor for your platform (operating system and architecture)  Available for most Unix (including Linux and Apple’s OS/X) platforms  Available for Windows NT / XP

55 55 http://www.cs.wisc.edu/condor Condor Version Numbering  Version numbering scheme similar to that of the (pre 2.6) Linux kernels  Stable releases, mostly bug fixes  Numbering: major.minor.release, minor value is even  Current stable release: 6.6.11  Developer releases, with new features, but may have some bugs  Numbering: major.minor.release, minor value is odd  Current developer release: 6.7.20

56 56 http://www.cs.wisc.edu/condor Personal Condor Nice, but not enough! Frieda’s 600, 6-hour jobs take 150 days ! (If jobs run 24 hours per day.)

57 57 http://www.cs.wisc.edu/condor Good News Frieda has friends (with machines): Fred, Ford, Fiona, and Francine.

58 58 http://www.cs.wisc.edu/condor Frieda’s friends Fred: only run jobs at night Fiona: never run Ford’s jobs Francine Ford: never run Fiona’s jobs

59 59 http://www.cs.wisc.edu/condor Frieda’s Condor pool  Install Condor on each friend’s machine to form a pool  Configure each machine to implement individual policies defining when the machine may run jobs

60 60 http://www.cs.wisc.edu/condor Frieda’s jobs finish faster!  Frieda’s friends also submit their jobs to Frieda’s pool  Everyone is happy-er

61 61 http://www.cs.wisc.edu/condor condor_status gives information about the pool: % condor_status Name OpSys Arch State Activ LoadAv Mem ActvtyTime perdita.cs.wi LINUX INTEL Owner Idle 0.020 511 0+02:28:42 coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45 ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55 To inspect full ClassAds: condor_status -l

62 62 http://www.cs.wisc.edu/condor Frieda’s new question: How can my jobs access their data files?

63 63 http://www.cs.wisc.edu/condor Access to Data in Condor  Use shared file system as available  No shared file system?  Condor can transfer files  Automatically send back changed files or you specify which files to send  Can be encrypted over the wire  Remote I/O Socket  Condor’s “Standard universe” can use remote system calls

64 64 http://www.cs.wisc.edu/condor Condor File Transfer In the submit description file: should_transfer_files = YES NO IF_NEEDED when_to_transfer_output = ON_EXIT ON_EXIT_OR_EVICT Note: IF_NEEDED simply compares configured FileSystemDomain on both computers

65 65 http://www.cs.wisc.edu/condor Example Submit File Universe = vanilla Executable = sim.exe Input = input.txt Output = output.txt Error = error.txt Log = sim.log Initialdir = run_$(Process) ShouldTransferFiles = Yes WhenToTransferOutput = ON_EXIT Transfer_input_files = /data/common.dat,job.dat Transfer_output_files = TheAnswer.dat Queue 600

66 66 http://www.cs.wisc.edu/condor Remote I/O Socket  Access files on submit machine while the job is running.  Syntax in submit file: +WantIOProxy=True  Libraries provided for Java and for C Java: FileInputStream -> ChirpInputStream C: open() -> chirp_open()

67 67 http://www.cs.wisc.edu/condor Remote System Calls in the Standard Universe  I/O system calls are trapped and sent back to the submit machine  No source code changes typically required  Programming language independent  Opportunities for application steering  Example: Condor tells the job how to open files: redirect to a different path, use larger buffers, use shared filesystem when available, etc.  More on standard universe next.

68 68 http://www.cs.wisc.edu/condor Frieda is happy; Fred is not happy  Fred’s jobs run for a long time  Before a job finishes, the machine becomes unavailable  The job goes back into the queue, and starts over again from the beginning

69 69 http://www.cs.wisc.edu/condor Why Fred’s jobs do not finish Preemption: Condor’s decision to stop a currently running job Why? 1.The machine’s policy together with the machine’s state lead to this decision 2.Another job or another user’s job is prioritized higher, and should be run instead of Fred’s job

70 70 http://www.cs.wisc.edu/condor What / How ? preempt job vacate kill the nice way the nasty way job shutdown took too long!

71 71 http://www.cs.wisc.edu/condor Condor’s standard universe rescues Fred!  Supports transparent process checkpoint and restart Fred’s jobs can resume where they left off.

72 72 http://www.cs.wisc.edu/condor Checkpoints checkpoint: the entire in-memory state of a program, saved in a file  CPU registers, memory image, I/O

73 73 http://www.cs.wisc.edu/condor Checkpoints time 123

74 74 http://www.cs.wisc.edu/condor Checkpoints time 3 3 Killed!

75 75 http://www.cs.wisc.edu/condor Checkpoints time 3 3 goodputbadput goodput

76 76 http://www.cs.wisc.edu/condor Checkpoint details  Typically, no changes to the job’s source code are needed  The job must be relinked with Condor’s standard universe support library  I/O requests are handled with remote system calls (by default)  Checkpoint files are stored on the submit machine or optionally on dedicated checkpoint servers

77 77 http://www.cs.wisc.edu/condor Relinking the Job for Standard Universe To relink, place condor_compile in front of the command used to link the job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f - OR - % condor_compile make –f MyMakefile

78 78 http://www.cs.wisc.edu/condor Limitations of the Standard Universe  Condor’s checkpoints are not at the kernel level  Standard Universe jobs may not:  Fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory  Use unsupported compilers (e.g. icc)  Use “clipped” platforms (e.g. Windows)  Must have access to object code to relink  Many typical scientific jobs are OK

79 79 http://www.cs.wisc.edu/condor When will Condor produce a checkpoint?  Periodically, if desired, for fault tolerance  When the job is preempted, either by a higher priority job, or because the execution machine becomes busy  When the user explicitly runs a condor_checkpoint, condor_vacate, condor_off or condor_restart command

80 80 http://www.cs.wisc.edu/condor Alternatives to Standard Universe Checkpointing When standard universe is not an option  break work into smaller pieces  configure machines to suspend jobs instead of killing them  configure preemption to wait longer for jobs to finish  use 3 rd party checkpointing  write interruptible vanilla jobs

81 81 http://www.cs.wisc.edu/condor Interruptible Jobs  Job must save state and shut down when it receives SIGTERM.  If using file-transfers, save intermediate output: WhenToTransferOutput = ON_EXIT_OR_EVICT

82 82 http://www.cs.wisc.edu/condor We Always Want More  Condor is managing and running our jobs, but  Our CPU requirements are greater than our resources  Jobs are preempted more often than we like

83 83 http://www.cs.wisc.edu/condor Happy Day! Frieda’s organization purchased a Dedicated Cluster!  Frieda installs Condor on all the dedicated Cluster nodes  Frieda also adds a dedicated central manager  She configures the entire pool with this new host as the central manager…

84 84 http://www.cs.wisc.edu/condor Condor Pool With the additional resources, Frieda and her co-workers can get their jobs completed even faster. 600 Condor jobs Frieda’s Condor Pool Dedicated Cluster

85 85 http://www.cs.wisc.edu/condor New problem for Frieda: Some of the machines in the pool do not have enough memory or scratch disk space for my job!

86 86 http://www.cs.wisc.edu/condor Specify Requirements  An expression (syntax similar to C or Java)  Must evaluate to True for a match to be made Universe = vanilla Executable = b Input = b.input Log = b.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 Queue 8

87 87 http://www.cs.wisc.edu/condor Or, Specify Rank  All matches which meet the requirements can be sorted by preference with a Rank expression.  The higher the rank, the better the match Universe = vanilla Executable = b Log = b.log InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 Rank = (KFLOPS*10000) + Memory Queue 8

88 88 http://www.cs.wisc.edu/condor Now my jobs are not running… What’s wrong?

89 89 http://www.cs.wisc.edu/condor Check the queue % condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.1 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.2 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.3 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.4 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.5 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.6 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 5.7 frieda 4/20 12:23 0+00:00:00 I 0 9.8 b 6.0 frieda 4/20 13:22 0+00:00:00 H 0 9.8 my_job 8 jobs; 8 idle, 0 running, 1 held

90 90 http://www.cs.wisc.edu/condor Looking at held jobs  See why jobs are on hold: % condor_q –held -- Submiter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 6.0 frieda 4/20 13:23 Error from starter on vm1@skywalker.cs.wisc 9 jobs; 8 idle, 0 running, 1 held  See full details for a job: % condor_q –l 6.0

91 91 http://www.cs.wisc.edu/condor Look in the Job Log Look in the job log for clues: % cat b.log 000 (031.000.000) 04/20 14:47:31 Job submitted from host:... 007 (031.000.000) 04/20 15:02:00 Shadow exception! Error from starter on gig06.stat.wisc.edu: Failed to open '/scratch.1/frieda/workspace/v67/condor- test/test3/run_0/b.input' as standard input: No such file or directory (errno 2) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job...

92 92 http://www.cs.wisc.edu/condor Check machine status Verify that there are idle machines with condor_status: % condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:20 vm2@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:19 vm3@tonic.c LINUX INTEL Claimed Busy 0.040 501 0+00:00:17 vm4@tonic.c LINUX INTEL Claimed Busy 0.000 501 0+00:00:05 Total Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 4 0 4 0 0 0 Total 4 0 4 0 0 0

93 93 http://www.cs.wisc.edu/condor Still not running? How long should I wait?  Wait at least one negotiation cycle  This can be a few minutes or more, depending on conditions  The schedd reuses matches to machines when possible, so following jobs run quickly

94 94 http://www.cs.wisc.edu/condor Look to condor_q for help: % condor_q -analyze 29 --- 029.000: Run analysis summary. Of 1243 machines, 1243 are rejected by your job's requirements 0 are available to run your job WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Memory > 8192)) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (TARGET.FileSystemDomain == MY.FileSystemDomain)

95 95 http://www.cs.wisc.edu/condor Better, but slower: condor_q –better-analyze % condor_q -better-analyze 29 The Requirements expression for your job is: ( ( target.Memory > 8192 ) ) && ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( ( target.Memory > 8192 ) ) 0 MODIFY TO 4000 2 ( TARGET.FileSystemDomain == "cs.wisc.edu" )584 3 ( target.Arch == "INTEL" ) 1078 4 ( target.OpSys == "LINUX" ) 1100 5 ( target.Disk >= 13 ) 1243

96 96 http://www.cs.wisc.edu/condor Use condor_status to learn about resources: % condor_status –constraint 'Memory > 8192‘ (no output means no matches) % condor_status -constraint 'Memory > 4096' Name OpSys Arch State Activ LoadAv Mem ActvtyTime vm1@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 1+05:35:05 vm2@s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000 5980 13+05:37:03 vm1@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 1+06:00:05 vm2@s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000 7988 13+06:03:47 Total Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 4 0 0 4 0 0 Total 4 0 0 4 0 0

97 97 http://www.cs.wisc.edu/condor Sometimes my job fails! Can I tell Condor to automatically run it again?

98 98 http://www.cs.wisc.edu/condor Job Policy Expressions  User can supply job policy expressions in the submit description file  Can be used to describe a successful run. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =

99 99 http://www.cs.wisc.edu/condor Job Policy Example  Do not remove if exits with a signal: on_exit_remove = ExitBySignal == False

100 100 http://www.cs.wisc.edu/condor Job Policy Example  Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ( (ExitBySignal==False) && (ExitSignal != 0) ) || ((CurrentTime - JobStartDate) < 3600)

101 101 http://www.cs.wisc.edu/condor What Condor Daemons are running on my machine, and what do they do?

102 102 http://www.cs.wisc.edu/condor Behind the Scenes  There is a fair amount of software running to make Condor work  The various pieces of the software are called daemons.  Condor daemons communicate with each other  Condor daemons are responsible for specific tasks

103 103 http://www.cs.wisc.edu/condor Condor daemons master: Takes care of other processes collector: Stores ClassAds negotiator: Performs matchmaking schedd: Manages job queue shadow: Manages running job (submit side) startd: Manages execution computer starter: Manages running job (execution side)

104 104 http://www.cs.wisc.edu/condor Roles of machines within a Condor pool  Central manager: central repository and match maker for whole pool  Execute machine: a machine that may run user jobs  Submit machine: a machine upon which users may submit jobs

105 105 http://www.cs.wisc.edu/condor Condor diagram Queue Central Manager condor_collector condor_negotiator condor_startd condor_schedd condor_submit condor_starter condor_shadow Job

106 106 http://www.cs.wisc.edu/condor How to watch and record what Condor is doing

107 107 http://www.cs.wisc.edu/condor Use CondorView!  visual graphs of current and past utilization  Data is derived from Condor's own accounting statistics  Interactive Java applet  Quickly and easily view:  How much Condor is being used  How many cycles are being delivered  Who is using cycles  Utilization by machine platform, or by user

108 108 http://www.cs.wisc.edu/condor CondorView Usage Graph

109 109 http://www.cs.wisc.edu/condor Some jobs depend on other jobs… Can Condor manage my whole workflow?

110 110 http://www.cs.wisc.edu/condor DAGMan  DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.  Example: “Don’t run job B until job A has completed successfully.” Directed Acyclic Graph Manager

111 111 http://www.cs.wisc.edu/condor What is a DAG?  A DAG is the data structure used by DAGMan to represent these dependencies.  Each job is a node in the DAG.  Each node can have any number of “parent” or “children” nodes – as long as there are no loops! A BC D OK: A BC Not OK:

112 112 http://www.cs.wisc.edu/condor Defining a DAG to Condor A DAG input file defines a DAG: # file name: diamond.dag Job A a.submit Job B b.submit Job C c.submit Job D d.submit Parent A Child B C Parent B C Child D A BC D

113 113 http://www.cs.wisc.edu/condor Jobs are still defined by their own submit files. For node B: # file name: # b.submit universe = vanilla executable = B input = B.in output = B.out error = B.err log = B.log queue For node C: # file name: # c.submit universe = standard executable = C input = C.in output = C.out error = C.err log = C.log queue

114 114 http://www.cs.wisc.edu/condor Submitting the DAG to Condor  To submit the entire DAG, run condor_submit_dag diamond.dag  condor_submit_dag creates a submit description file for DAGMan, and DAGMan itself is submitted as a Condor job

115 115 http://www.cs.wisc.edu/condor a DAGMan requirement  The submit description file for each job must specify a log file  The log files tell DAGMan when the jobs finish.  Jobs with a DAG may have separate logs files or share the same log file.

116 116 http://www.cs.wisc.edu/condor DAGMan & Log Files If DAGMan dies (crash, power failure, etc…)  Condor will restart DAGMan  DAGMan re-reads job log files  DAGMan knows everything it needs to know to resume managing jobs.

117 117 http://www.cs.wisc.edu/condor Nodes  Job execution at a node is either successful or fails  Based on the return value of the job 0  success Non-0  failure (by default) A BC D

118 118 http://www.cs.wisc.edu/condor Rescue DAGs  When a job fails, DAGMan continues to run all jobs that do not depend on it  It then creates a rescue DAG  You may retry the unfinished jobs using the rescue DAG

119 119 http://www.cs.wisc.edu/condor Advanced DAGMan Tricks  Retry of a node  Abort the entire DAG  Throttles and DAGs  PRE and POST scripts: editing the DAG  Nested DAGs: loops and more

120 120 http://www.cs.wisc.edu/condor Retry  Before a node is marked as failed...  Retry N times. In the DAG input file: Retry C 4 (to rerun node C four times before calling the node failed)  Retry N times, unless a node returns specific exit code. In the DAG input file: Retry C 4 UNLESS-EXIT 2

121 121 http://www.cs.wisc.edu/condor Abort the Entire DAG  If a specific error value should cause the entire DAG to stop  Place in the DAG input file: Abort-DAG-On B 3 Name of node Returned error code

122 122 http://www.cs.wisc.edu/condor Throttles  Throttles to control number of job submissions at one time  Maximum number of jobs submitted % condor_submit_dag –maxjobs 40 bigdag.dag  Maximum number of jobs idle % condor_submit_dag –maxidle 10 bigdag.dag

123 123 http://www.cs.wisc.edu/condor  Submit DAG with  200,000 nodes  No dependencies between jobs  Use DAGMan to throttle the jobs, so that the job queue is not burdened with 200,000 simultaneous job submissions Throttling Example A1A1 A2A2 A3A3 … A 200000

124 124 http://www.cs.wisc.edu/condor DAGMan scripts  DAGMan allows PRE and/or POST scripts  Not necessarily a script: any executable  Run before (PRE) or after (POST) job  Run on the submit machine  In the DAG input file: Job A a.submit Script PRE A before-script Script POST A after-script

125 125 http://www.cs.wisc.edu/condor node A within the DAG before-script after-script Condor job described in a.submit

126 126 http://www.cs.wisc.edu/condor PRE script PRE script can make decisions  Where should the job run? This is particularly useful to make the job run in same place as last job.  Should I pass different arguments to the job?  Lazy decision making

127 127 http://www.cs.wisc.edu/condor POST script  POST script is always run, independent of the Condor job’s return value  POST script can change return value  DAGMan marks the node failed for a non- zero return value from the POST script  POST script can look at error code or output files and return 0 (success) or non-zero (failure) based on deeper knowledge.

128 128 http://www.cs.wisc.edu/condor Pre-defined variables  In the DAG input file: Job A a.submit Script PRE A before-script $JOB Script POST A after-script $JOB $RETURN (optional) arguments to script $JOB becomes the string that defines the node name $RETURN becomes the return value from the Condor job defined by the node

129 129 http://www.cs.wisc.edu/condor Script Throttles  Throttles to control the number of scripts running at one time % condor_submit_dag –maxpre 10 bigdag.dag OR % condor_submit_dag –maxpost 30 bigdag.dag

130 130 http://www.cs.wisc.edu/condor Nested DAGs  Idea: any DAG node can be a script that does: 1.Make decision 2.Create DAG input file 3.Call condor_submit_dag 4.Wait for DAG to exit  DAG node will not complete until the inner (nested) DAG finishes  Uses?  Implement a fixed-length loop  Modify behavior on the fly

131 131 http://www.cs.wisc.edu/condor Nested DAG Example A BC D V W Z X Y C is

132 132 http://www.cs.wisc.edu/condor DAGMan is close but not quite right… How can I manage a large number of tasks that are dynamically created as the workflow progresses with very low latency?

133 133 http://www.cs.wisc.edu/condor Using Master Worker  An alternative to DAGMan  DAGMan  Create a bunch of Condor jobs  Run them in a pre-determined order  Master Worker (MW)  C++ API  Your master code may dynamically create new tasks  Your worker code executes them.  Submission of workers to Condor and communication with them is taken care of for you.

134 134 http://www.cs.wisc.edu/condor Leightweight MW tasks multiplexed on top of Condor Jobs  MW Task dispatched in milliseconds, Condor job can take minutes  Opens possibility of many small tasks Easy to be opportunistic.

135 135 http://www.cs.wisc.edu/condor You Must Write 3 Classes Subclasses of … MWDriver MWTask MWWorker Master exe Worker exe

136 136 http://www.cs.wisc.edu/condor MW in action condor_submit Submit machine T T T T Master exe T T T Worker

137 137 http://www.cs.wisc.edu/condor More on MW  http://www.cs.wisc.edu/condor/mw  Online Manual  Mailing list

138 138 http://www.cs.wisc.edu/condor MW is close but not quite right! My jobs need to interact with each other.

139 139 http://www.cs.wisc.edu/condor Condor Universes for Parallel Jobs  PVM Schedule PVM jobs on at least N cpus.  MPI Schedule MPI jobs on exactly N cpus.  Parallel Universe Generalized replacement for MPI universe.

140 140 http://www.cs.wisc.edu/condor How can I run jobs on a mix of different platforms?

141 141 http://www.cs.wisc.edu/condor Running on multiple platforms  Let matchmaker dynamically choose appropriate executable  Or use a cross-platform solution such as java

142 142 http://www.cs.wisc.edu/condor Using Match-substitution Macros  In your submit file, specify: Executable = myjob.$$(OpSys).$$(Arch) Requirements = (Arch==“INTEL” && OpSys==“LINUX”)\ ||(Arch==“SUN4u” && OpSys==“SOLARIS8” )\ ||(Arch==“SGI” && OpSys==“IRIX65”)  The $$(MachineAttribute) macros are replaced with attributes from the ClassAd of the machine matched with your job.

143 143 http://www.cs.wisc.edu/condor Match-substitution Macros  You can use $$(MachineAttribute) in string attributes of your job executable, environment, arguments, …  Standard machine attributes may be augmented with additional custom ones, e.g. path to software cache  See condor_status –l

144 144 http://www.cs.wisc.edu/condor Using Java Universe  Java Universe provides more than just inserting “java” at the start of the execute line of a vanilla job:  Knows which machines have a JVM installed  Knows the location, version, and performance of JVM on each machine  Knows about jar files, etc.  Provides more information about Java job completion than just JVM exit code Program runs in a Java wrapper, allowing Condor to report Java exceptions, etc.

145 145 http://www.cs.wisc.edu/condor Java Universe Job  Example Java Universe Submit file: Universe = java Executable = Main.class jar_files = MyLibrary.jar Input = infile Output = outfile Arguments = Main 1 2 3 Queue

146 146 http://www.cs.wisc.edu/condor Java support, cont. bash-2.05a$ condor_status –java Name JavaVendor Ver State Actv LoadAv Mem abulafia.cs Sun Microsy 1.5.0_ Claimed Busy 0.180 503 acme.cs.wis Sun Microsy 1.5.0_ Unclaimed Idle 0.000 503 adelie01.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002 adelie02.cs Sun Microsy 1.5.0_ Claimed Busy 0.000 1002 … Total Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 965 179 516 250 20 0 INTEL/WINNT50 102 6 65 31 0 0 SUN4u/SOLARIS28 1 0 0 1 0 0 X86_64/LINUX 128 2 106 20 0 0 Total 1196 187 687 302 20 0

147 147 http://www.cs.wisc.edu/condor My collaborators have a cluster too. Can we share our resources?

148 148 http://www.cs.wisc.edu/condor Condor and the Grid  Condor can send jobs to other sites  Condor can receive jobs from other sites

149 149 http://www.cs.wisc.edu/condor “Grid” Universe  All handled in your submit file  Supports a number of “back end” types:  Globus: GT2, GT3, GT4  NorduGrid  UNICORE  Condor  PBS  LSF Thanks to recent contributions from INFN!

150 150 http://www.cs.wisc.edu/condor Example Globus Job (a.k.a. Condor-G Job)  Used for a Globus GT2 / GT3 gatekeeper  Syntax:  Grid_Resource = (gt2|gt3) Head-Node  Globus_rsl =  Example: Universe = grid Grid_Resource = gt2 beak.cs.wisc.edu/jobmanager-pbs Globus_rsl = (queue=long)(project=atom-smasher)

151 151 http://www.cs.wisc.edu/condor Example Condor Grid Job (a.k.a. Condor-C)  Used for a Condor back-end  Syntax:  Grid_Resource = condor  Remote_ =  Example: Universe = grid Grid_Resource = condor beak.cs.wisc.edu condor.cs.wisc.edu Remote_Universe = standard

152 152 http://www.cs.wisc.edu/condor In Review We have seen how Condor can  Provide a multi-purpose batch queue  Use computers opportunistically  Work without a shared filesystem  Manage workflows of varying complexity

153 153 http://www.cs.wisc.edu/condor Summary of User Commands condor_submitSubmit new Jobs Condor_submit_dagSubmit a DAG condor_qView Job Queue condor_qeditModify Job’s in Queue condor_status View Pool Status condor_rmRemove Jobs condor_prioChange Job Priorities condor_historyView Completed Jobs condor_compileLink Condor library

154 154 http://www.cs.wisc.edu/condor Condor Job Universes Vanillaordinary jobs Standard checkpointing, remote I/O SchedulerDAGMan PVM, MPI, Parallelinterconnected jobs Javajava conveniences Gridexporting jobs

155 155 http://www.cs.wisc.edu/condor Thanks! Check us out on the Web: http://www.condorproject.org Email: condor-admin@cs.wisc.edu


Download ppt "Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction."

Similar presentations


Ads by Google