Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison

Similar presentations


Presentation on theme: "Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison"— Presentation transcript:

1 Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

2 2 Conventions Used In This Presentation  A slide with an all-yellow background is the beginning of a new “chapter” The slides after it will describe each entry on the yellow slide in great detail  A Condor tool that users would use will be in red italics  A ClassAd attribute name will be in blue  A UNIX shell command or file name will be in courier font

3 3 What is Condor?  A system for “High-Throughput Computing”  Lots of jobs over a long period of time, not a short burst of “high-performance”  Condor manages both resources (machines) and resource requests (jobs)  Supports additional features for jobs that are re-linked with Condor libraries: checkpointing remote system calls

4 4 What’s Condor Good For?  Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan)

5 5 What’s Condor Good For? (cont’d)  Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only loose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

6 6 What’s Condor Good For? (cont’d)  Giving you access to more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, you don’t even need an account on a machine where your job executes

7 7 What is a Condor Pool?  “Pool” can be a single machine, or a group of machines  Determined by a “central manager” - the matchmaker and centralized information repository  Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

8 8 What Kind of Job Do You Have?  You must know some things about your job to decide if and how it will work with Condor: What kind of I/O does it do? Does it use TCP/IP? (network sockets) Can the job be resumed? Is the job multi-process (fork(), pvm_addhost(), etc.)

9 9 What Kind of I/O Does Your Job Do?  Interactive TTY  “Batch” TTY (just reads from STDIN and writes to STDOUT or STDERR, but you can redirect to/from files)  X Windows  NFS, AFS, or another network file system  Local file system  TCP/IP

10 10 What Does Condor Support?  Condor can support various combinations of these features in different “Universes”  Different Universes provide different functionality for your job: Vanilla Standard Scheduler PVM

11 11 What Does Condor Support?

12 12 Condor Universes  A Universe specifies a Condor runtime environment: STANDARD –Supports Checkpointing –Supports Remote System Calls –Has some limitations ( no fork(), socket(), etc.) VANILLA –Any Unix executable (shell scripts, etc) –No Condor Checkpointing or Remote I/O

13 13 Condor Universes (cont’d) PVM (Parallel Virtual Machine) –Allows you to run parallel jobs in Condor (more on this later) SCHEDULER –Special kind of Condor job: the job is run on the submit machine, not a remote execute machine –Job is automatically restarted is the condor_schedd is shutdown –Used to schedule jobs (e.g. DAGMan)

14 14 Submitting Jobs to Condor  Choosing a “Universe” for your job (already covered this)  Preparing your job Making it “batch-ready” Re-linking if checkpointing and remote system calls are desired (condor_compile)  Creating a submit description file  Running condor_submit Sends your request to the User Agent (condor_schedd)

15 15 Preparing Your Job  Making your job “batch-ready” Must be able to run in the background: no interactive input, windows, GUI, etc. Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices If your job expects input from the keyboard, you have to put the input you want into a file

16 16 Preparing Your Job (cont’d)  If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s special libraries  To do this, you use condor_compile Place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c

17 17 Creating a Submit Description File  A plain ASCII text file  Tells Condor about your job: Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)  Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

18 18 Example Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue

19 19 Example Submit Description File Described  Submits a single job to the standard universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog defines command line arguments, and specifies the directory the job should be run in  Equivalent to (for outside of Condor): % cd /home/wright/condor/run_1 % /home/wright/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin

20 20 “Clusters” and “Processes”  If your submit file describes multiple jobs, we call this a “cluster”  Each job within a cluster is called a “process” or “proc”  If you only specify one job, you still get a cluster, but it has only one process  A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”)  Process numbers always start at 0

21 21 Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_$(Process) Queue 500

22 22 Example Submit Description File for a Cluster - Described  Now, the initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 500” to submit 500 jobs at once  $(Process) will be expaned to the process number for each job in the cluster (from 0 up to 499 in this case), so we’ll have “run_0”, “run_1”, … “run_499” directories  All the input/output files will be in different directories!

23 23 Running condor_submit  You give condor_submit the name of the submit file you have created  condor_submit parses the file and creates a “ClassAd” that describes your job(s)  Creates the files you specified for STDOUT and STDERR  Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue

24 24 Monitoring Your Jobs  Using condor_q  Using a “User Log” file  Using condor_status  Using condor_rm  Getting email from Condor  Once they complete, you can use condor_history to examine them

25 25 Using condor_q  To view the jobs you have submitted, you use condor_q  Displays the status of your job, how much compute time it has accumulated, etc.  Many different options: A single job, a single cluster, all jobs that match a certain constraint, or all jobs Can view remote job queues (either individual queues, or “-global”)

26 26 Using a “User Log” file  A UserLog must be specified in your submit file: Log = filename  You get a log entry for everything that happens to your job: When it was submitted, when it starts executing, if it is checkpointed or vacated, if there are any problems, etc.  Very useful! Highly recommended!

27 27 Using condor_status  To view the status of the whole Condor pool, you use condor_status  Can use the “-run” option to see which machines are running jobs, as well as: The user who submitted each job The machine they submitted from  Can also view the status of various submitters with “-submitter ”

28 28 Using condor_rm  If you want to remove a job from the Condor queue, you use condor_rm  You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root)  You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.

29 29 Getting Email from Condor  By default, Condor will send you email when your jobs completes  If you don’t want this email, put this in your submit file: notification = never  If you want email every time something happens to your job (checkpoint, exit, etc), use this: notification = always

30 30 Getting Email from Condor (cont’d)  If you only want email if your job exits with an error, use this: notification = error  By default, the email is sent to your account on the host you submitted from. If you want the email to go to a different address, use this: notify_user = email@address.here

31 31 Using condor_history  Once your job completes, it will no longer show up in condor_q  Now, you must use condor_history to view the job’s ClassAd  The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm

32 32 Any questions?  Nothing is too basic  If I was unclear, you probably are not the only person who doesn’t understand, and the rest of the day will be even more confusing

33 Hands-On Exercise #1 Submitting and Monitoring a Simple Test Job

34 34 Hands-On Exercise #1  Login to your machine as user “condor”  You will see two windows: Netscape, with instructions An xterm, where you execute commands  To begin, click on Simple Test Job  Please follow the directions carefully  Any lines beginning with % are commands that you should execute in your xterm  If you accidentally exit Netscape, click on “Tutorial” in the Start menu

35 Lunch break Please be back by 13:30

36 Welcome Back

37 37 Classified Advertisements  ClassAds Language for expressing attributes Semantics for evaluating them  Intuitively, a ClassAd is a set of named expressions Each named expression is an attribute  Expressions are similar to C … Constants, attribute references, operators

38 38 Classified Advertisements: Example MyType = "Machine" TargetType = "Job" Name = "froth.cs.wisc.edu" StartdIpAddr=" " Arch = "INTEL" OpSys = "SOLARIS26" VirtualMemory = 225312 Disk = 35957 KFlops = 21058 Mips = 103 LoadAvg = 0.011719 KeyboardIdle = 12 Cpus = 1 Memory = 128 Requirements = LoadAvg 15 * 60 Rank = 0

39 39 Classified Advertisements: Matching  ClassAds are always considered in pairs: Does ClassAd A match ClassAd B (and vice versa)? This is called “2-way matching”  If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name

40 40 Classified Advertisements: Examples  ClassAd A MyType = "Apartment" TargetType = "ApartmentRenter" SquareArea = 3500 RentOffer = 1000 HeatIncluded = False OnBusLine = True Rank = UnderGrad==False + TARGET.RentOffer Requirements = MY.RentOffer - TARGET.RentOffer < 150  ClassAd B MyType = "ApartmentRenter" TargetType = "Apartment" UnderGrad = False RentOffer = 900 Rank = 1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded Requirements = OnBusLine && SquareArea > 2700

41 41 ClassAds in the Condor System  ClassAds allow Condor to be a general system Constraints and ranks on matches expressed by the entities themselves Only priority logic integrated into the Match-Maker  All principal entities in the Condor system are represented by ClassAds Machines, Jobs, Submitters

42 42 ClassAds in Condor: Requirements and Rank (Example for Machines) Friend = Owner == "tannenba" || Owner == "wright" ResearchGroup = Owner == "jbasney" || Owner == "raman" Trusted = Owner != "rival" && Owner != "riffraff" Requirements = Trusted && ( ResearchGroup || (LoadAvg 15*60) ) Rank = Friend + ResearchGroup*10

43 43 Requirements for Machine Example Described  Machine will never start a job submitted by “rival” or “riffraff”  If someone from ResearchGroup (“jbasney” or “raman”) submits a job, it will always run, regardless of keyboard activity or load average  If anyone else submits a job, it will only run here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3

44 44 Machine Rank Example Described  If the machine is running a job submitted by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group  If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0)  If “raman” or “jbasney” submit a job, it will have a rank of 10  While a machine is running a job, it will be preempted for a higher ranked job

45 45 ClassAds in Condor: Requirements and Rank (Example for Jobs) Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20 Rank = (Memory > 32) * ( (Memory * 100) + (IsDedicated * 10000) + Mips )

46 46 Job Example Described  The job must run on an Intel CPU, running Linux, with at least 20 megs of RAM  All machines with 32 megs of RAM or less are Ranked at 0  Machines with more than 32 megs of RAM are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in Million Instructions Per Second

47 47 Finding and Using the ClassAd Attributes in your Pool  Condor defines a number of attributes by default, which are listed in the User Manual (“About Requirements and Rank”)  To see if machines in your pool have other attributes defined, use: condor_status -long  A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta- operators”

48 48 ClassAd “Meta-Operators”  Meta operators allow you to compare against “UNDEFINED” as if it were a real value: =?= is “meta-equal-to” =!= is “meta-not-equal-to” Color != “Red” (non-meta) would evaluate to UNDEFINED if Color is not defined Color =!= “Red” would evaluate to True if Color is not defined, since UNDEFINED is not “Red”

49 Hands-On Exercise #2 Submitting Jobs with Requirements and Rank

50 50 Hands-On Exercise #2  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Requirements and Rank Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

51 51 Priorities In Condor  Two kinds of priorities: User Priorities –Priorities between users in the pool to ensure fairness –The lower the value, the better the priority Job Priorities –Priorities that users give to their own jobs to determine the order in which they will run –The higher the value, the better the priority –Only matters within a given user’s jobs

52 52 User Priorities in Condor  Each active user in the pool has a user priority  Viewed or changed with condor_userprio  The lower the number, the better  A given user’s share of available machines is inversely related to the ratio between user priorities. Example: Fred’s priority is 10, Joe’s is 20. Fred will be allocated twice as many machines as Joe.

53 53 User Priorities in Condor, cont.  Condor continuously adjusts user priorities over time machines allocated > priority, priority worsens machines allocated < priority, priority improves  Priority Preemption Higher priority users will grab machines away from lower priority users (thanks to Checkpointing…) Starvation is prevented Priority “thrashing” is prevented

54 54 Job Priorities in Condor  Can be set at submit-time in your description file with: prio =  Can be viewed with condor_q  Can be changed at any time with condor_prio  The higher the number, the more likely the job will run (only among the jobs of an individual user)

55 55 Managing a Large Cluster of Jobs  Condor can manage huge numbers of jobs  Special features of the submit description file make this easier  Condor can also manage inter-job dependencies with condor_dagman For example: job A should run first, then, run jobs B and C, when those finish, submit D, etc… We’ll discuss DAGMan later

56 56 Submitting a Large Cluster  Anywhere in your submit file, if you use $(Process), that will expand to the process number of each job in the cluster: input = my_input.$(process) arguments = $(process)  It is common to use $(Process) to specify InitialDir, so that each process runs in its own directory: InitialDir = dir.$(process)

57 57 Submitting a Large Cluster (cont’d)  Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000  A cluster is more efficient: Your jobs will run faster, and they’ll use less space  Can only have one executable per cluster: Different executables must be different clusters!

58 Hands-On Exercise #3 Submitting a Large Cluster of Jobs

59 59 Hands-On Exercise #3  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Large Clusters Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

60 10 Minute Break Questions are welcome….

61 61 Inter-Job Dependencies with DAGMan  DAGMan can be used to handle a set of jobs that must be run in a certain order  Also provides “pre” and “post” operations, so you can have a program or script run before each job is submitted and after it completes  Robust: handles errors and submit-machine crashes

62 62 Using DAGMan  You define a DAG description file, which is similar in function to the submit file you give to condor_submit  DAGMan restrictions: Each job in the DAG must be in its own cluster (this is a limitation we will remove in future versions) All jobs in the DAG must have a User Log and must share the same file

63 63 Format of the DAGMan Description File  # is a comment  First section names the jobs in your DAG and associates a submit description file with each job  Second (optional) section defines PRE and POST scripts to run  Final section defines the job dependencies

64 64 Example DAGMan Description File # Example DAGMan input file Job A A.submit Job B B.submit Job C C.submit Job D D.submit Script PRE D d_input_checker Script POST A a_output_processor A.out PARENT A CHILD B C PARENT B C CHILD D

65 65 Setting up a DAG for Condor  Must create the DAG description file  Must create all the submit description files for the individual jobs  Must prepare any executables you plan to use  If you want, you can have a mix of Vanilla and Standard jobs  Must setup any PRE/POST commands or scripts you wish to use

66 66 Submitting a DAG to Condor  Once you have everything in place, to submit a DAG, you use condor_submit_dag and give it the name of your DAG description file  This will check your input file for errors and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments

67 67 Removing a DAG  Removing a DAG is easy: Just use on the scheduler universe job (condor_dagman) On shutdown, DAGMan will remove any jobs that are currently in the queue that are associated with its DAG Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue

68 Hands-On Exercise #4 Using DAGMan

69 69 Hands-On Exercise #4  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Using_DAGMan Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

70 70 What’s Wrong with my Vanilla Job?  Special requirements expressions for vanilla jobs  You didn’t submit it from a directory that is shared  Condor isn’t running as root (more on this later)  You don’t have your file permissions setup correctly (more on this later)

71 71 Special Requirements Expressions for Vanilla Jobs  When you submit a vanilla job, Condor automatically appends two extra Requirements: UID_DOMAIN == FILESYSTEM_DOMAIN ==  Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files

72 72 Special Requirements Expressions for Vanilla Jobs  By default, each machine in your pool is in its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system  If you don’t have an account on the remote system, Vanilla jobs won’t work

73 73 Shared File Systems for Vanilla Jobs  Just because you have AFS or NFS doesn’t mean ALL files are shared Initialdir = /tmp will probably cause trouble for Vanilla jobs!  You must be sure to set Initialdir to a shared directory (or cd into it to run condor_submit) for Vanilla jobs

74 74 Why Don’t My Jobs Run?  Try using condor_q -analyze  Try specify a User Log for your job  Look at condor_userprio: maybe you have a bad priority and higher priority users are being served  Problems with file permissions or network file systems  Look at the SchedLog

75 75 Using condor_q -analyze  condor_q -analyze will analyze your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on: Will report errors in your Requirements expression (impossible to match, etc.) Will tell you about user priorities in the pool (other people have better priority)

76 76 Looking at condor_userprio  You can look at condor_userprio yourself  If your priority value is a really high number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool

77 77 File Permissions in Condor  If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”)  You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)

78 78 File Permissions in Condor (cont’d)  Often, there will be a “condor” group and you can make your files owned and write- able by this group  For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!

79 79 Problems with NFS in Condor  For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)

80 80 Problems with NFS in Condor (cont’d)  If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir (the directory you were in when you ran condor_submit) might not exist on a remote machine E.g. you’re in /mnt/tmp/home/me/...  With automounting, you always need to specify InitialDir explicitly InitialDir = /home/me/...

81 81 Problems with AFS in Condor  If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token You must grant an unauthenticated AFS user the appropriate access to your files Some sites provide a better alternative that world-writable files –Host ACLs –Network-specific ACLs

82 82 Looking at the SchedLog  Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems Find it with: condor_config_val schedd_log You might need your pool administrator to turn on a higher “debugging level” to see more verbose output

83 83 Other User Features  Submit-Only installation  Heterogeneous Submit  PVM jobs

84 84 Submit-Only Installation  Can install just a condor_master and condor_schedd on your machine  Can submit jobs into a remote pool  Special option to condor_install

85 85 Heterogeneous Submit  The job you submit doesn’t have to be the same platform as the machine you submit from Maybe you have access to a pool that’s full of Alphas, but you have a Sparc on your desk, and moving all your data is a pain  You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1

86 86 Parallel Jobs in Condor  Condor can run parallel applications Written to the popular PVM message passing library Future work includes support for MPI  Master-Worker Paradigm  What does Condor-PVM do?  How to compile and submit Condor-PVM jobs

87 87 Master-Worker Paradigm Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.  Master has a pool of work, sends pieces of work to the workers, manages the work and the workers  Worker gets a piece of work, does the computation, sends the result back

88 88 What does Condor-PVM do? Condor acts as the PVM resource manager.  All pvm_addhost requests get re-mapped to Condor. Condor dynamically constructs PVM virtual machines out of non-dedicated desktop machines.  When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.

89 89 How to compile and submit Condor-PVM jobs  Binary Compatible Compile and link with PVM library just as normal PVM applications. No need to link with Condor.  Submit In the submit description file, set: universe = PVM machine_count =..

90 90 Obtaining Condor  Condor can be downloaded from the Condor web site at: http://www.cs.wisc.edu/condor  Complete Users and Administrators manual available http://www.cs.wisc.edu/condor/manual  Contracted Support is available  Questions? Email: condor-admin@cs.wisc.edu


Download ppt "Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison"

Similar presentations


Ads by Google