Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison

Slides:



Advertisements
Similar presentations
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Advertisements

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.
1 Using Condor An Introduction ICE 2008.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Lecture 8 Configuring a Printer-using Magic Filter Introduction to IP Addressing.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Lesson 7-Creating and Changing Directories. Overview Using directories to create order. Managing files in directories. Using pathnames to manage files.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Condor Tutorial Prabhaker Mateti Wright State University.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
– Introduction to the Shell 10/1/2015 Introduction to the Shell – Session Introduction to the Shell – Session 2 · Permissions · Users.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Tutorial for Administrators INFN-Bologna, 6/28/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
SQL SERVER 2008 Installation Guide A Step by Step Guide Prepared by Hassan Tariq.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Lecture 02 File and File system. Topics Describe the layout of a Linux file system Display and set paths Describe the most important files, including.
Condor Tutorial NCSA Alliance ‘98 Presented by: The Condor Team University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Agenda The Bourne Shell – Part I Redirection ( >, >>,
Five todos when moving an application to distributed HTC.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Development Environment
Condor DAGMan: Managing Job Dependencies with Condor
Intermediate HTCondor: Workflows Monday pm
Chapter 2: System Structures
Introduction to Operating System (OS)
Condor: Job Management
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Operating Systems.
Presentation transcript:

Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison

2 Conventions Used In This Presentation  A slide with an all-yellow background is the beginning of a new “chapter” The slides after it will describe each entry on the yellow slide in great detail  A Condor tool that users would use will be in red italics  A ClassAd attribute name will be in blue  A UNIX shell command or file name will be in courier font

3 What is Condor?  A system for “High-Throughput Computing”  Lots of jobs over a long period of time, not a short burst of “high-performance”  Condor manages both resources (machines) and resource requests (jobs)  Supports additional features for jobs that are re-linked with Condor libraries: checkpointing remote system calls

4 What’s Condor Good For?  Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan)

5 What’s Condor Good For? (cont’d)  Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only loose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

6 What’s Condor Good For? (cont’d)  Giving you access to more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, you don’t even need an account on a machine where your job executes

7 What is a Condor Pool?  “Pool” can be a single machine, or a group of machines  Determined by a “central manager” - the matchmaker and centralized information repository  Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

8 What Kind of Job Do You Have?  You must know some things about your job to decide if and how it will work with Condor: What kind of I/O does it do? Does it use TCP/IP? (network sockets) Can the job be resumed? Is the job multi-process (fork(), pvm_addhost(), etc.)

9 What Kind of I/O Does Your Job Do?  Interactive TTY  “Batch” TTY (just reads from STDIN and writes to STDOUT or STDERR, but you can redirect to/from files)  X Windows  NFS, AFS, or another network file system  Local file system  TCP/IP

10 What Does Condor Support?  Condor can support various combinations of these features in different “Universes”  Different Universes provide different functionality for your job: Vanilla Standard Scheduler PVM

11 What Does Condor Support?

12 Condor Universes  A Universe specifies a Condor runtime environment: STANDARD –Supports Checkpointing –Supports Remote System Calls –Has some limitations ( no fork(), socket(), etc.) VANILLA –Any Unix executable (shell scripts, etc) –No Condor Checkpointing or Remote I/O

13 Condor Universes (cont’d) PVM (Parallel Virtual Machine) –Allows you to run parallel jobs in Condor (more on this later) SCHEDULER –Special kind of Condor job: the job is run on the submit machine, not a remote execute machine –Job is automatically restarted is the condor_schedd is shutdown –Used to schedule jobs (e.g. DAGMan)

14 Submitting Jobs to Condor  Choosing a “Universe” for your job (already covered this)  Preparing your job Making it “batch-ready” Re-linking if checkpointing and remote system calls are desired (condor_compile)  Creating a submit description file  Running condor_submit Sends your request to the User Agent (condor_schedd)

15 Preparing Your Job  Making your job “batch-ready” Must be able to run in the background: no interactive input, windows, GUI, etc. Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices If your job expects input from the keyboard, you have to put the input you want into a file

16 Preparing Your Job (cont’d)  If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s special libraries  To do this, you use condor_compile Place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c

17 Creating a Submit Description File  A plain ASCII text file  Tells Condor about your job: Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)  Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

18 Example Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue

19 Example Submit Description File Described  Submits a single job to the standard universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog defines command line arguments, and specifies the directory the job should be run in  Equivalent to (for outside of Condor): % cd /home/wright/condor/run_1 % /home/wright/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin

20 “Clusters” and “Processes”  If your submit file describes multiple jobs, we call this a “cluster”  Each job within a cluster is called a “process” or “proc”  If you only specify one job, you still get a cluster, but it has only one process  A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”)  Process numbers always start at 0

21 Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_$(Process) Queue 500

22 Example Submit Description File for a Cluster - Described  Now, the initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 500” to submit 500 jobs at once  $(Process) will be expaned to the process number for each job in the cluster (from 0 up to 499 in this case), so we’ll have “run_0”, “run_1”, … “run_499” directories  All the input/output files will be in different directories!

23 Running condor_submit  You give condor_submit the name of the submit file you have created  condor_submit parses the file and creates a “ClassAd” that describes your job(s)  Creates the files you specified for STDOUT and STDERR  Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue

24 Monitoring Your Jobs  Using condor_q  Using a “User Log” file  Using condor_status  Using condor_rm  Getting from Condor  Once they complete, you can use condor_history to examine them

25 Using condor_q  To view the jobs you have submitted, you use condor_q  Displays the status of your job, how much compute time it has accumulated, etc.  Many different options: A single job, a single cluster, all jobs that match a certain constraint, or all jobs Can view remote job queues (either individual queues, or “-global”)

26 Using a “User Log” file  A UserLog must be specified in your submit file: Log = filename  You get a log entry for everything that happens to your job: When it was submitted, when it starts executing, if it is checkpointed or vacated, if there are any problems, etc.  Very useful! Highly recommended!

27 Using condor_status  To view the status of the whole Condor pool, you use condor_status  Can use the “-run” option to see which machines are running jobs, as well as: The user who submitted each job The machine they submitted from  Can also view the status of various submitters with “-submitter ”

28 Using condor_rm  If you want to remove a job from the Condor queue, you use condor_rm  You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root)  You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.

29 Getting from Condor  By default, Condor will send you when your jobs completes  If you don’t want this , put this in your submit file: notification = never  If you want every time something happens to your job (checkpoint, exit, etc), use this: notification = always

30 Getting from Condor (cont’d)  If you only want if your job exits with an error, use this: notification = error  By default, the is sent to your account on the host you submitted from. If you want the to go to a different address, use this: notify_user =

31 Using condor_history  Once your job completes, it will no longer show up in condor_q  Now, you must use condor_history to view the job’s ClassAd  The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm

32 Any questions?  Nothing is too basic  If I was unclear, you probably are not the only person who doesn’t understand, and the rest of the day will be even more confusing

Hands-On Exercise #1 Submitting and Monitoring a Simple Test Job

34 Hands-On Exercise #1  Login to your machine as user “condor”  You will see two windows: Netscape, with instructions An xterm, where you execute commands  To begin, click on Simple Test Job  Please follow the directions carefully  Any lines beginning with % are commands that you should execute in your xterm  If you accidentally exit Netscape, click on “Tutorial” in the Start menu

Lunch break Please be back by 13:30

Welcome Back

37 Classified Advertisements  ClassAds Language for expressing attributes Semantics for evaluating them  Intuitively, a ClassAd is a set of named expressions Each named expression is an attribute  Expressions are similar to C … Constants, attribute references, operators

38 Classified Advertisements: Example MyType = "Machine" TargetType = "Job" Name = "froth.cs.wisc.edu" StartdIpAddr=" " Arch = "INTEL" OpSys = "SOLARIS26" VirtualMemory = Disk = KFlops = Mips = 103 LoadAvg = KeyboardIdle = 12 Cpus = 1 Memory = 128 Requirements = LoadAvg 15 * 60 Rank = 0

39 Classified Advertisements: Matching  ClassAds are always considered in pairs: Does ClassAd A match ClassAd B (and vice versa)? This is called “2-way matching”  If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name

40 Classified Advertisements: Examples  ClassAd A MyType = "Apartment" TargetType = "ApartmentRenter" SquareArea = 3500 RentOffer = 1000 HeatIncluded = False OnBusLine = True Rank = UnderGrad==False + TARGET.RentOffer Requirements = MY.RentOffer - TARGET.RentOffer < 150  ClassAd B MyType = "ApartmentRenter" TargetType = "Apartment" UnderGrad = False RentOffer = 900 Rank = 1/(TARGET.RentOffer ) + 50*HeatIncluded Requirements = OnBusLine && SquareArea > 2700

41 ClassAds in the Condor System  ClassAds allow Condor to be a general system Constraints and ranks on matches expressed by the entities themselves Only priority logic integrated into the Match-Maker  All principal entities in the Condor system are represented by ClassAds Machines, Jobs, Submitters

42 ClassAds in Condor: Requirements and Rank (Example for Machines) Friend = Owner == "tannenba" || Owner == "wright" ResearchGroup = Owner == "jbasney" || Owner == "raman" Trusted = Owner != "rival" && Owner != "riffraff" Requirements = Trusted && ( ResearchGroup || (LoadAvg 15*60) ) Rank = Friend + ResearchGroup*10

43 Requirements for Machine Example Described  Machine will never start a job submitted by “rival” or “riffraff”  If someone from ResearchGroup (“jbasney” or “raman”) submits a job, it will always run, regardless of keyboard activity or load average  If anyone else submits a job, it will only run here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3

44 Machine Rank Example Described  If the machine is running a job submitted by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group  If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0)  If “raman” or “jbasney” submit a job, it will have a rank of 10  While a machine is running a job, it will be preempted for a higher ranked job

45 ClassAds in Condor: Requirements and Rank (Example for Jobs) Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20 Rank = (Memory > 32) * ( (Memory * 100) + (IsDedicated * 10000) + Mips )

46 Job Example Described  The job must run on an Intel CPU, running Linux, with at least 20 megs of RAM  All machines with 32 megs of RAM or less are Ranked at 0  Machines with more than 32 megs of RAM are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in Million Instructions Per Second

47 Finding and Using the ClassAd Attributes in your Pool  Condor defines a number of attributes by default, which are listed in the User Manual (“About Requirements and Rank”)  To see if machines in your pool have other attributes defined, use: condor_status -long  A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta- operators”

48 ClassAd “Meta-Operators”  Meta operators allow you to compare against “UNDEFINED” as if it were a real value: =?= is “meta-equal-to” =!= is “meta-not-equal-to” Color != “Red” (non-meta) would evaluate to UNDEFINED if Color is not defined Color =!= “Red” would evaluate to True if Color is not defined, since UNDEFINED is not “Red”

Hands-On Exercise #2 Submitting Jobs with Requirements and Rank

50 Hands-On Exercise #2  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Requirements and Rank Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

51 Priorities In Condor  Two kinds of priorities: User Priorities –Priorities between users in the pool to ensure fairness –The lower the value, the better the priority Job Priorities –Priorities that users give to their own jobs to determine the order in which they will run –The higher the value, the better the priority –Only matters within a given user’s jobs

52 User Priorities in Condor  Each active user in the pool has a user priority  Viewed or changed with condor_userprio  The lower the number, the better  A given user’s share of available machines is inversely related to the ratio between user priorities. Example: Fred’s priority is 10, Joe’s is 20. Fred will be allocated twice as many machines as Joe.

53 User Priorities in Condor, cont.  Condor continuously adjusts user priorities over time machines allocated > priority, priority worsens machines allocated < priority, priority improves  Priority Preemption Higher priority users will grab machines away from lower priority users (thanks to Checkpointing…) Starvation is prevented Priority “thrashing” is prevented

54 Job Priorities in Condor  Can be set at submit-time in your description file with: prio =  Can be viewed with condor_q  Can be changed at any time with condor_prio  The higher the number, the more likely the job will run (only among the jobs of an individual user)

55 Managing a Large Cluster of Jobs  Condor can manage huge numbers of jobs  Special features of the submit description file make this easier  Condor can also manage inter-job dependencies with condor_dagman For example: job A should run first, then, run jobs B and C, when those finish, submit D, etc… We’ll discuss DAGMan later

56 Submitting a Large Cluster  Anywhere in your submit file, if you use $(Process), that will expand to the process number of each job in the cluster: input = my_input.$(process) arguments = $(process)  It is common to use $(Process) to specify InitialDir, so that each process runs in its own directory: InitialDir = dir.$(process)

57 Submitting a Large Cluster (cont’d)  Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000  A cluster is more efficient: Your jobs will run faster, and they’ll use less space  Can only have one executable per cluster: Different executables must be different clusters!

Hands-On Exercise #3 Submitting a Large Cluster of Jobs

59 Hands-On Exercise #3  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Large Clusters Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

10 Minute Break Questions are welcome….

61 Inter-Job Dependencies with DAGMan  DAGMan can be used to handle a set of jobs that must be run in a certain order  Also provides “pre” and “post” operations, so you can have a program or script run before each job is submitted and after it completes  Robust: handles errors and submit-machine crashes

62 Using DAGMan  You define a DAG description file, which is similar in function to the submit file you give to condor_submit  DAGMan restrictions: Each job in the DAG must be in its own cluster (this is a limitation we will remove in future versions) All jobs in the DAG must have a User Log and must share the same file

63 Format of the DAGMan Description File  # is a comment  First section names the jobs in your DAG and associates a submit description file with each job  Second (optional) section defines PRE and POST scripts to run  Final section defines the job dependencies

64 Example DAGMan Description File # Example DAGMan input file Job A A.submit Job B B.submit Job C C.submit Job D D.submit Script PRE D d_input_checker Script POST A a_output_processor A.out PARENT A CHILD B C PARENT B C CHILD D

65 Setting up a DAG for Condor  Must create the DAG description file  Must create all the submit description files for the individual jobs  Must prepare any executables you plan to use  If you want, you can have a mix of Vanilla and Standard jobs  Must setup any PRE/POST commands or scripts you wish to use

66 Submitting a DAG to Condor  Once you have everything in place, to submit a DAG, you use condor_submit_dag and give it the name of your DAG description file  This will check your input file for errors and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments

67 Removing a DAG  Removing a DAG is easy: Just use on the scheduler universe job (condor_dagman) On shutdown, DAGMan will remove any jobs that are currently in the queue that are associated with its DAG Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue

Hands-On Exercise #4 Using DAGMan

69 Hands-On Exercise #4  Please point your browser to the new instructions: Go back to the tutorial homepage Click on Using_DAGMan Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm  If you exited Netscape, just click on “Tutorial” from your Start menu

70 What’s Wrong with my Vanilla Job?  Special requirements expressions for vanilla jobs  You didn’t submit it from a directory that is shared  Condor isn’t running as root (more on this later)  You don’t have your file permissions setup correctly (more on this later)

71 Special Requirements Expressions for Vanilla Jobs  When you submit a vanilla job, Condor automatically appends two extra Requirements: UID_DOMAIN == FILESYSTEM_DOMAIN ==  Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files

72 Special Requirements Expressions for Vanilla Jobs  By default, each machine in your pool is in its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system  If you don’t have an account on the remote system, Vanilla jobs won’t work

73 Shared File Systems for Vanilla Jobs  Just because you have AFS or NFS doesn’t mean ALL files are shared Initialdir = /tmp will probably cause trouble for Vanilla jobs!  You must be sure to set Initialdir to a shared directory (or cd into it to run condor_submit) for Vanilla jobs

74 Why Don’t My Jobs Run?  Try using condor_q -analyze  Try specify a User Log for your job  Look at condor_userprio: maybe you have a bad priority and higher priority users are being served  Problems with file permissions or network file systems  Look at the SchedLog

75 Using condor_q -analyze  condor_q -analyze will analyze your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on: Will report errors in your Requirements expression (impossible to match, etc.) Will tell you about user priorities in the pool (other people have better priority)

76 Looking at condor_userprio  You can look at condor_userprio yourself  If your priority value is a really high number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool

77 File Permissions in Condor  If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”)  You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)

78 File Permissions in Condor (cont’d)  Often, there will be a “condor” group and you can make your files owned and write- able by this group  For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!

79 Problems with NFS in Condor  For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)

80 Problems with NFS in Condor (cont’d)  If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir (the directory you were in when you ran condor_submit) might not exist on a remote machine E.g. you’re in /mnt/tmp/home/me/...  With automounting, you always need to specify InitialDir explicitly InitialDir = /home/me/...

81 Problems with AFS in Condor  If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token You must grant an unauthenticated AFS user the appropriate access to your files Some sites provide a better alternative that world-writable files –Host ACLs –Network-specific ACLs

82 Looking at the SchedLog  Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems Find it with: condor_config_val schedd_log You might need your pool administrator to turn on a higher “debugging level” to see more verbose output

83 Other User Features  Submit-Only installation  Heterogeneous Submit  PVM jobs

84 Submit-Only Installation  Can install just a condor_master and condor_schedd on your machine  Can submit jobs into a remote pool  Special option to condor_install

85 Heterogeneous Submit  The job you submit doesn’t have to be the same platform as the machine you submit from Maybe you have access to a pool that’s full of Alphas, but you have a Sparc on your desk, and moving all your data is a pain  You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1

86 Parallel Jobs in Condor  Condor can run parallel applications Written to the popular PVM message passing library Future work includes support for MPI  Master-Worker Paradigm  What does Condor-PVM do?  How to compile and submit Condor-PVM jobs

87 Master-Worker Paradigm Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.  Master has a pool of work, sends pieces of work to the workers, manages the work and the workers  Worker gets a piece of work, does the computation, sends the result back

88 What does Condor-PVM do? Condor acts as the PVM resource manager.  All pvm_addhost requests get re-mapped to Condor. Condor dynamically constructs PVM virtual machines out of non-dedicated desktop machines.  When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.

89 How to compile and submit Condor-PVM jobs  Binary Compatible Compile and link with PVM library just as normal PVM applications. No need to link with Condor.  Submit In the submit description file, set: universe = PVM machine_count =..

90 Obtaining Condor  Condor can be downloaded from the Condor web site at:  Complete Users and Administrators manual available  Contracted Support is available  Questions?