Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Using Condor An Introduction ICE 2008.

Similar presentations


Presentation on theme: "1 Using Condor An Introduction ICE 2008."— Presentation transcript:

1 1 http://www.cs.wisc.edu/condor Using Condor An Introduction ICE 2008

2 2 http://www.cs.wisc.edu/condor The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

3 3 http://www.cs.wisc.edu/condor Definitions › Job  The Condor representation of your work › Machine  The Condor representation of computers and that can perform the work › Match Making  Matching a job with a machine “Resource”

4 4 http://www.cs.wisc.edu/condor Job Jobs state their requirements and preferences: I need a Linux/x86 platform I need the machine at least 500 Mb I prefer a machine with more memory

5 5 http://www.cs.wisc.edu/condor Machine Machines state their requirements and preferences: Run jobs only when there is no keyboard activity I prefer to run Frieda’s jobs I am a machine in the econ department Never run jobs belonging to Dr. Smith

6 6 http://www.cs.wisc.edu/condor The Magic of Matchmaking › Jobs and machines state their requirements and preferences › Condor matches jobs with machines based on requirements and preferences

7 7 http://www.cs.wisc.edu/condor Getting Started: Submitting Jobs to Condor › Overview:  Choose a “Universe” for your job  Make your job “batch-ready”  Create a submit description file  Run condor_submit to put your job in the queue

8 8 http://www.cs.wisc.edu/condor 1. Choose the “Universe” › Controls how Condor handles jobs › Choices include:  Vanilla  Standard  Grid  Java  Parallel  VM

9 9 http://www.cs.wisc.edu/condor Using the Vanilla Universe The Vanilla Universe: – Allows running almost any “serial” job – Provides automatic file transfer, etc. – Like vanilla ice cream Can be used in just about any situation

10 10 http://www.cs.wisc.edu/condor 2. Make your job batch- ready Must be able to run in the background No interactive input No GUI/window clicks No music ;^)

11 11 http://www.cs.wisc.edu/condor Make your job batch-ready (continued)…  Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices  Similar to UNIX shell: $./myprogram output.txt

12 12 http://www.cs.wisc.edu/condor 3. Create a Submit Description File › A plain ASCII text file › Condor does not care about file extensions › Tells Condor about your job:  Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) › Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

13 13 http://www.cs.wisc.edu/condor Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = my_job Output = output.txt Queue

14 14 http://www.cs.wisc.edu/condor 4. Run condor_submit › You give condor_submit the name of the submit file you have created:  condor_submit my_job.submit › condor_submit:  Parses the submit file, checks for errors  Creates a “ClassAd” that describes your job(s)  Puts job(s) in the Job Queue

15 15 http://www.cs.wisc.edu/condor The Job Queue › condor_submit sends your job’s ClassAd(s) to the schedd › The schedd (more details later):  Manages the local job queue  Stores the job in the job queue Atomic operation, two-phase commit “Like money in the bank” › View the queue with condor_q

16 16 http://www.cs.wisc.edu/condor Example condor_submit and condor_q % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job 1 jobs; 1 idle, 0 running, 0 held %

17 17 http://www.cs.wisc.edu/condor Input, output & error files › Controlled by submit file settings › You can define the job’s standard input, standard output and standard error:  Read job’s standard input from “input_file”: Input= input_file Shell equivalent: program <input_file  Write job’s standard ouput to “output_file”: Output= output_file Shell equivalent: program >output_file  Write job’s standard error to “error_file”: Error= error_file Shell equivalent: program 2>error_file

18 18 http://www.cs.wisc.edu/condor Email about your job Condor sends email about job events to the submitting user Specify “notification” in your submit file to control which events: Notification= complete Notification= never Notification= error Notification= always Default

19 19 http://www.cs.wisc.edu/condor Feedback on your job › Create a log of job events › Add to submit description file: log = sim.log › Becomes the Life Story of a Job  Shows all events in the life of a job  Always have a log file

20 20 http://www.cs.wisc.edu/condor Sample Condor User Log 000 (0001.000.000) 05/25 19:10:03 Job submitted from host:... 001 (0001.000.000) 05/25 19:12:17 Job executing on host:... 005 (0001.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0)...

21 21 http://www.cs.wisc.edu/condor Example Submit Description File With Logging # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/frieda/condor/my_job.condor Log = my_job.log ·Job log (from Condor) Input = my_job.in ·Program’s standard input Output = my_job.out ·Program’s standard output Error = my_job.err ·Program’s standard error Arguments = -a1 -a2 ·Command line arguments InitialDir = /home/frieda/condor/run Queue

22 22 http://www.cs.wisc.edu/condor Let’s run a job › First, need a terminal emulator  http://www.putty.org http://www.putty.org (or similar) › Login to chopin.cs.wisc.edu as  cguserXX, and the given password › source /scratch/ice08

23 23 http://www.cs.wisc.edu/condor Logged In? › condor_q › condor_status

24 24 http://www.cs.wisc.edu/condor Create submit file › nano submit universe = vanilla executable = /bin/echo Arguments = hello world Should_transfer_files = always When_to_transfer_output = on_exit Output = out Log = log queue

25 25 http://www.cs.wisc.edu/condor And submit it… › condor_submit submit › (wait… remember the HTC bit?) › Condor_q xx › cat output

26 26 http://www.cs.wisc.edu/condor “Clusters” and “Processes” › If your submit file describes multiple jobs, we call this a “cluster” › Each cluster has a unique “cluster number” › Each job in a cluster is called a “process”  Process numbers always start at zero › A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1)  A cluster can have a single process Job ID = 20.0 ·Cluster 20, process 0  Or, a cluster can have more than one process Job ID: 21.0, 21.1, 21.2·Cluster 21, process 0, 1, 2

27 27 http://www.cs.wisc.edu/condor Submit File for a Cluster # Example submit file for a cluster of 2 jobs # with separate input, output, error and log files Universe = vanilla Executable = my_job Arguments = -x 0 log = my_job_0.log Input = my_job_0.in Output = my_job_0.out Error = my_job_0.err Queue ·Job 2.0 (cluster 2, process 0) Arguments = -x 1 log = my_job_1.log Input = my_job_1.in Output = my_job_1.out Error = my_job_1.err Queue ·Job 2.1 (cluster 2, process 1)

28 28 http://www.cs.wisc.edu/condor % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2 2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0 2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 1 3 jobs; 2 idle, 1 running, 0 held % Submitting The Job

29 29 http://www.cs.wisc.edu/condor Organize your files and directories for big runs › Create subdirectories for each “run”  run_0, run_1, … run_599 › Create input files in each of these  run_0/simulation.in  run_1/simulation.in  …  run_599/simulation.in › The output, error & log files for each job will be created by Condor from your job’s output

30 30 http://www.cs.wisc.edu/condor Submit Description File for 600 Jobs # Cluster of 600 jobs with different directories Universe = vanilla Executable = sim Log = simulation.log... Arguments = -x 0 InitialDir = run_0 ·Log, input, output & error files -> run_0 Queue ·Job 3.0 (Cluster 3, Process 0) Arguments = -x 1 InitialDir = run_1 ·Log, input, output & error files -> run_1 Queue ·Job 3.1 (Cluster 3, Process 1) ·Do this 598 more times…………

31 31 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs › We just submitted 1 cluster with 600 processes › All the input/output files will be in different directories › The submit file is pretty unwieldy (over 1200 lines) › Isn’t there a better way?

32 32 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs (the better way) #1 › We can queue all 600 in 1 “Queue” command  Queue 600 › Condor provides $(Process) and $(Cluster)  $(Process) will be expanded to the process number for each job in the cluster 0, 1, … 599  $(Cluster) will be expanded to the cluster number Will be 4 for all jobs in this cluster

33 33 http://www.cs.wisc.edu/condor Submit File for a Big Cluster of Jobs (the better way) #2 › The initial directory for each job can be specified using $(Process)  InitialDir = run_$(Process)  Condor will expand these to “ run_0 ”, “ run_1 ”, … “ run_599 ” directories › Similarly, arguments can be variable  Arguments = -x $(Process)  Condor will expand these to “-x 0”, “-x 1”, … “-x 599”

34 34 http://www.cs.wisc.edu/condor Better Submit File for 600 Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Input = my_job.in Output = my_job.out Error = my_job.err Arguments = –x $(Process) ·–x 0, -x 1, … -x 599 InitialDir = run_$(Process) ·run_0 … run_599 Queue 600 ·Jobs 4.0 … 4.599

35 35 http://www.cs.wisc.edu/condor Now, we submit it… $ condor_submit my_job.submit Submitting job(s)............................................................................................................................................................................................................................................................... Logging submit event(s)............................................................................................................................................................................................................................................................... 600 job(s) submitted to cluster 4.

36 36 http://www.cs.wisc.edu/condor And, Check the queue $ condor_q -- Submitter: x.cs.wisc.edu : : x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0 4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1 4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2 4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3... 4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598 4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599 600 jobs; 599 idle, 1 running, 0 held

37 37 http://www.cs.wisc.edu/condor Removing jobs › If you want to remove a job from the Condor queue, you use condor_rm › You can only remove jobs that you own › Privileged user can remove any jobs  “root” on UNIX  “administrator” on Windows

38 38 http://www.cs.wisc.edu/condor Removing jobs (continued) › Remove an entire cluster:  condor_rm 4 ·Removes the whole cluster › Remove a specific job from a cluster:  condor_rm 4.0 ·Removes a single job › Or, remove all of your jobs with “-a”  condor_rm -a ·Removes all jobs / clusters

39 39 http://www.cs.wisc.edu/condor Submit cluster of 10 jobs › nano submit universe = vanilla executable = /bin/echo Arguments = hello world $(PROCESS) Should_transfer_files = always When_to_transfer_output = on_exit Output = out.$(PROCESS) Log = log Queue 10

40 40 http://www.cs.wisc.edu/condor And submit it… › condor_submit submit › (wait…) › Condor_q xx › cat log › cat output.yy

41 41 http://www.cs.wisc.edu/condor My new jobs run for 20 days… › What happens when a job is forced off it’s CPU?  Preempted by higher priority user or job  Vacated because of user activity › How can I add fault tolerance to my jobs?

42 42 http://www.cs.wisc.edu/condor Condor’s Standard Universe to the rescue! › Support for transparent process checkpoint and restart › Remote system calls (remote I/O)  Your job can read / write files as if they were local

43 43 http://www.cs.wisc.edu/condor Remote System Calls in the Standard Universe › I/O system calls are trapped and sent back to the submit machine Examples: open a file, write to a file › No source code changes typically required › Programming language independent

44 44 http://www.cs.wisc.edu/condor Process Checkpointing in the Standard Universe › Condor’s process checkpointing provides a mechanism to automatically save the state of a job › The process can then be restarted from right where it was checkpointed  After preemption, crash, etc.

45 45 http://www.cs.wisc.edu/condor Checkpointing: Process Starts checkpoint: the entire state of a program, saved in a file  CPU registers, memory image, I/O time

46 46 http://www.cs.wisc.edu/condor Checkpointing: Process Checkpointed time 123

47 47 http://www.cs.wisc.edu/condor Checkpointing: Process Killed time 3 3 Killed!

48 48 http://www.cs.wisc.edu/condor Checkpointing: Process Resumed time 3 3 goodputbadput goodput

49 49 http://www.cs.wisc.edu/condor When will Condor checkpoint your job? › Periodically, if desired  For fault tolerance › When your job is preempted by a higher priority job › When your job is vacated because the execution machine becomes busy › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

50 50 http://www.cs.wisc.edu/condor Making the Standard Universe Work › The job must be relinked with Condor’s standard universe support library › To relink, place condor_compile in front of the command used to link the job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f - OR - % condor_compile make –f MyMakefile

51 51 http://www.cs.wisc.edu/condor Limitations of the Standard Universe › Condor’s checkpointing is not at the kernel level.  Standard Universe the job may not: Fork() Use kernel threads Use some forms of IPC, such as pipes and shared memory › Must have access to source code to relink › Many typical scientific jobs are OK

52 52 http://www.cs.wisc.edu/condor Submitting Std uni job › #include › int main(int argc, char **argv) { › int i; for(i = 0 ; i < 10000000; i++) { } › }

53 53 http://www.cs.wisc.edu/condor And submit… › condor_compile –o foo foo.c › condor_submit

54 54 http://www.cs.wisc.edu/condor My jobs have have dependencies… Can Condor help solve my dependency problems?

55 55 http://www.cs.wisc.edu/condor Condor Universes: Scheduler and Local › Scheduler Universe  Plug in a meta-scheduler  Developed for DAGMan (more later)  Similar to Globus’s fork job manager › Local  Very similar to vanilla, but jobs run on the local host  Has more control over jobs than scheduler universe

56 56 http://www.cs.wisc.edu/condor Frieda learns DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

57 57 http://www.cs.wisc.edu/condor What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D

58 58 http://www.cs.wisc.edu/condor Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D

59 59 http://www.cs.wisc.edu/condor Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag is run by the schedd  DAGMan daemon itself is “watched” by Condor, so you don’t have to

60 60 http://www.cs.wisc.edu/condor DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue B C D A A.dag File

61 61 http://www.cs.wisc.edu/condor DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue D B C B A C

62 62 http://www.cs.wisc.edu/condor Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue DAGMan X D A B Rescue File

63 63 http://www.cs.wisc.edu/condor Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue Rescue File C DAGMan D A B C

64 64 http://www.cs.wisc.edu/condor DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

65 65 http://www.cs.wisc.edu/condor DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

66 66 http://www.cs.wisc.edu/condor Additional DAGMan Features › Provides other handy features for job management…  nodes can have PRE & POST scripts  failed nodes can be automatically re- tried a configurable number of times  job submission can be “throttled”

67 67 http://www.cs.wisc.edu/condor General User Commands › condor_status View Pool Status › condor_qView Job Queue › condor_submitSubmit new Jobs › condor_rmRemove Jobs › condor_prioIntra-User Prios › condor_historyCompleted Job Info › condor_submit_dagSubmit new DAG › condor_checkpointForce a checkpoint › condor_compileLink Condor library

68 68 http://www.cs.wisc.edu/condor Thank you! Check us out on the Web: http://www.condorproject.org Email: condor-admin@cs.wisc.edu


Download ppt "1 Using Condor An Introduction ICE 2008."

Similar presentations


Ads by Google