Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Project Computer Sciences Department University of Wisconsin-Madison Installing and Using Condor.

Similar presentations


Presentation on theme: "Condor Project Computer Sciences Department University of Wisconsin-Madison Installing and Using Condor."— Presentation transcript:

1 Condor Project Computer Sciences Department University of Wisconsin-Madison Installing and Using Condor

2 2 What is Condor? › High-Throughput Computing system  Emphasizes long-term productivity › Many features for local and global computing › Limited focus for today  Managing a cluster of machines and the jobs that will run on them

3 Condor Pool Machine Roles › Central Manager  Matches jobs to machines  Daemons: master, collector, negotiator › Submit Machine  Manages jobs  Daemons: master, schedd › Execute Machine  Runs jobs  Daemons: master, startd › Every machine plays one or more of these roles 3

4 4 Condor Daemon Layout Personal Condor / Central Manager Master collector negotiator startd = Process Spawned schedd

5 5 condor_master › Starts up all other Condor daemons › Runs on all Condor hosts › If there are any problems and a daemon exits, it restarts the daemon and sends to the administrator › Acts as the server for many Condor remote administration commands:  condor_reconfig, condor_restart  condor_off, condor_on  condor_config_val  etc.

6 6 Central Manager: condor_collector › Collects information from all other Condor daemons in the pool  “Directory Service” / Database for a Condor pool  Each daemon sends a periodic update ClassAd to the collector › Services queries for information:  Queries from other Condor daemons  Queries from users (condor_status) › Only on the Central Manager(s) › At least one collector per pool

7 7 Condor Pool Layout: Collector = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector negotiator

8 8 Central Manager: condor_negotiator › Performs “matchmaking” in Condor › Each “Negotiation Cycle” (typically 5 minutes):  Gets information from the collector about all available machines and all idle jobs  Tries to match jobs with machines that will serve them  Both the job and the machine must satisfy each other’s requirements › Only one Negotiator per pool  Ignoring HAD › Only on the Central Manager(s)

9 9 Condor Pool Layout: Negotiator = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector negotiator

10 10 Execute Hosts: condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this in the administrator’s tutorial) › Creates a “starter” for each running job › One startd runs on each execute node

11 11 Condor Pool Layout: startd = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector schedd negotiator Cluster Node Master startd Cluster Node Master startd Workstation Master startd schedd Workstation Master startd schedd

12 12 Submit Hosts: condor_schedd › Condor’s Scheduler Daemon › One schedd runs on each submit host › Maintains the persistent queue of jobs › Responsible for contacting available machines and sending them jobs › Services user commands which manipulate the job queue:  condor_submit, condor_rm, condor_q, condor_hold, condor_release, condor_prio, … › Creates a “shadow” for each running job

13 13 Condor Pool Layout: schedd = ClassAd Communication Pathway = Process Spawned Cluster Node Master startd Cluster Node Master startd Central Manager Master Collector schedd negotiator Workstation Master startd schedd Workstation Master startd schedd

14 14 Condor Pool Layout: master = ClassAd Communication Pathway = Process Spawned Central Manager Master Collector schedd negotiator Cluster Node Master startd Cluster Node Master startd Cluster Node Master startd schedd Cluster Node Master startd schedd

15 15 Execute MachineSubmit Machine Job Startup submit schedd starter Job shadow Condor Syscall Lib startd Central Manager collectornegotiator Q Q J S Q Q S J J S J J S S master

16 16 Condor ClassAds

17 17 What is a ClassAd? › Condor’s internal data representation  Similar to a classified ad in a newspaper Or Craig’s list Or 58.com, baixing.com, ganji.com  Represent an object & its attributes Usually many attributes  Can also describe what an object matches with

18 18 ClassAd Types › Condor has many types of ClassAds  A Job ClassAd represents a job to Condor Condor_q –long shows full job ClassAds  A Machine ClassAd represents a machine within the Condor pool Condor_status –long shows full machine ClassAds  Other ClassAds represent other pieces of the Condor pool  Job and Machine ClassAds are matched to each other by the negotiator daemon

19 19 ClassAds Explained › ClassAds can contain a lot of details  The job’s executable is "cosmos"  The machine’s load average is 5.6 › ClassAds can specify requirements  My job requires a machine with Linux › ClassAds can specify rank  This machine prefers to run jobs from the physics group

20 20 Example Machine Ad ~]# condor_status –l Machine = "creamce.foo" EnteredCurrentState = JavaVersion = "1.4.2" CpuIsBusy = false CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= ) TotalVirtualMemory = LoadAvg = 0.0 CondorLoadAvg = ~]#

22 Normal Condor Installation (Don’t Do This Today) › Goto Condor’s Yum Repository Page  › Follow the instructions there  Use condor-stable-rhel4.repo  Ignore the optional steps 22

23 Normal Condor Installation (Don’t Do This Today) › Example  cd /etc/yum.repos.d  wget o.d/condor-stable-rhel5.repo  yum install condor.x86_64  service condor start  ps -ef | grep condor 23

24 Condor Install For Today › We’ll use a locally-cached copy of Condor  cd /root  wget r rhel5.x86_64.rpm  yum localinstall condor rhel5.x86_64.rpm  service condor start  ps -ef | grep condor 24

25 Good Install Results ~]# ps -ef|grep condor condor :32 ? 00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid condor :32 ? 00:00:00 condor_collector -f condor :32 ? 00:00:00 condor_negotiator -f condor :32 ? 00:00:00 condor_schedd -f condor :32 ? 00:00:00 condor_startd -f root :32 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R S 60 -C 101 root :38 pts/0 00:00:00 grep condor ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime creamce.foo LINUX X86_64 Unclaimed Idle :04:42 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX Total ~]# 25

26 Running a Job › Create a regular user account and switch to it  adduser joe  su - joe › Create a submit description file › Call condor_submit › Monitor job’s status with condor_q 26

27 27 Simple Submit Description File # simple submit description file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /bin/date Job's executable #Input = /dev/null Job's STDIN Output = date.out Job's STDOUT Error = date.err Job's STDERR Log = date.log Log the job's activities Queue Put the job in the queue

28 Submitting the Job ~]$ condor_submit date.sub Submitting job(s). 1 job(s) submitted to cluster 4. ~]$ condor_q -- Submitter: creamce.foo : : creamce.foo ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 jfrey 5/10 22: :00:00 I date 1 jobs; 1 idle, 0 running, 0 held ~]$ condor_q -- Submitter: creamce.foo : : creamce.foo ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held ~]$ 28

29 Try a Longer Job › The ‘I’ in condor_q means the job is idle › While a job is running, condor_q will show an ‘R’ and the RUN_TIME will increase › To see a job as it runs, try making a script that sleeps for a minute: #!/bin/sh echo Hello sleep 60 echo Goodbye › Don’t forget to run chmod 755 on it 29

30 30 Sample Job Log ~]$ cat date.log 000 ( ) 05/10 22:28:41 Job submitted from host: ( ) 05/10 22:28:42 Job executing on host: ( ) 05/10 22:28:42 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job... ~]$

31 31 Jobs, Clusters, and Processes › If the submit description file describes multiple jobs, it is called a cluster › Each cluster has a cluster number, where the cluster number is unique to the job queue on a machine › Each individual job within a cluster is called a process, and process numbers always start at zero › A Condor Job ID is the cluster number, a period, and the process number (i.e. 2.1)  A cluster can have a single process Job ID = 20.0·Cluster 20, process 0  Or, a cluster can have more than one process Job IDs: 21.0, 21.1, 21.2·Cluster 21, process 0, 1, 2

32 32 Submitting Several Jobs # Example submit file for a cluster of 2 jobs # with separate output, error and log files Universe = vanilla Executable = /bin/date log = date_0.log Output = date_0.out Error = date_0.err Queue ·Job (cluster 102, process 0) log = date_1.log Output = date_1.out Error = date_1.err Queue ·Job (cluster 102, process 1)

33 33 Submitting Many Jobs # Example submit file for a cluster of 10 jobs # with separate output, error and log files Universe = vanilla Executable = /bin/date log = date_$(cluster).$(process).log Output = date_$(cluster).$(process).out Error = date_$(cluster).$(process).err Queue 10 ·Jobs – $(cluster) and $(process) are replaced with each job’s Cluster and Process id.

34 34 Removing Jobs › If you want to remove a job from the Condor queue, you use condor_rm › You can only remove jobs that you own › Privileged user can remove any jobs  “root” on UNIX / Linux  “administrator” on Windows

35 35 Removing jobs (continued) › Remove an entire cluster:  condor_rm 4 ·Removes the whole cluster › Remove a specific job from a cluster:  condor_rm 4.0 ·Removes a single job › Or, remove all of your jobs with “-a”  DANGEROUS!!  condor_rm -a ·Removes all jobs / clusters

36 36 My Jobs Are Idle › Our scientist runs condor_q and finds all his jobs are idle ~]$ condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 einstein 4/20 13: :00:00 I cosmos -arg1 –arg2 5.0 einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n einstein 4/20 12: :00:00 I cosmos -arg1 –n 7 8 jobs; 8 idle, 0 running, 0 held

37 37 Exercise a little patience › On a busy pool, it can take a while to match and start your jobs › Wait at least a negotiation cycle or two (typically a few minutes)

38 38 Check Machine's Status ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime LINUX X86_64 Claimed Busy :10:13 LINUX X86_64 Claimed Busy :10:36 LINUX X86_64 Claimed Busy :42:20 LINUX X86_64 Claimed Busy :22:10 LINUX X86_64 Claimed Busy :17:00 LINUX X86_64 Claimed Busy :09:14 LINUX X86_64 Claimed Busy :13:49... WINNT51 INTEL Owner Idle [Unknown] WINNT51 INTEL Owner Idle [Unknown] WINNT51 INTEL Unclaimed Idle [Unknown] WINNT51 INTEL Unclaimed Idle [Unknown] WINNT51 INTEL Claimed Busy [Unknown] WINNT51 INTEL Claimed Busy [Unknown] Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT X86_64/LINUX Total

39 39 Not Matching at All? condor_q –analyze ~]$ condor_q -analyze 29 The Requirements expression for your job is: ( (target.Memory > 8192) ) && (target.Arch == "INTEL") && (target.OpSys == "LINUX") && (target.Disk >= DiskUsage) && (TARGET.FileSystemDomain == MY.FileSystemDomain) Condition Machines Matched Suggestion ( ( target.Memory > 8192 ) ) 0 MODIFY TO ( TARGET.FileSystemDomain == "cs.wisc.edu" )584 3 ( target.Arch == "INTEL" ) ( target.OpSys == "LINUX" ) ( target.Disk >= 13 ) 1243

40 40 Learn about available resources: ~]$ condor_status –const 'Memory > 8192' (no output means no matches) ~]$ condor_status -const 'Memory > 4096' Name OpSys Arch State Activ LoadAv Mem ActvtyTime LINUX X86_64 Unclaimed Idle :35:05 LINUX X86_64 Unclaimed Idle :37:03 LINUX X86_64 Unclaimed Idle :00:05 LINUX X86_64 Unclaimed Idle :03:47 Total Owner Claimed Unclaimed Matched Preempting X86_64/LINUX Total

41 41 Submit a Job That Won’t Run Universe = vanilla Executable = /bin/date Output = date.out Error = date.err # Our machine doesn’t have this much # memory Requirements = Memory > 8192 Log = date.log Queue

42 42 Submit and Run condor_q -analyze -- Submitter: test17.epikh : : test17.epikh : Run analysis summary. Of 4 machines, 4 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 match but are currently offline 0 are available to run your job WARNING: Be advised: No resources matched request's constraints The Requirements expression for your job is: ( target.Memory > 8192 ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain ) Condition Machines Matched Suggestion ( target.Memory > 8192 ) 0 MODIFY TO ( TARGET.Arch == "X86_64" ) 4 3 ( TARGET.OpSys == "LINUX" ) 4 4 ( TARGET.Disk >= 1 ) 4 5 ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory, E-04)) ) >= 1 ) 4 6 ( TARGET.FileSystemDomain == "test17.epikh" )4

43 43 Held Jobs › Condor may place your jobs on hold if there’s a problem running them… ~]$ condor_q -- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 einstein 4/20 13: :00:00 H cosmos -arg1 –arg2 5.0 einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n einstein 4/20 12: :00:00 H cosmos -arg1 –n 7 8 jobs; 0 idle, 0 running, 8 held

44 44 Look at jobs on hold ~]$ condor_q –hold -- Submiter: submit.chtc.wisc.edu : :submit.chtc.wisc.edu ID OWNER HELD_SINCE HOLD_REASON 6.0 einstein 4/20 13:23 Error from starter on skywalker.cs.wisc.edu 9 jobs; 8 idle, 0 running, 1 held Or, see full details for a job ~]$ condor_q –l 6.0 … HoldReason = "Error from starter" …

45 45 Look in the Job Log › The job log will likely contain clues: ~]$ cat cosmos.log 000 ( ) 04/20 14:47:31 Job submitted from host: ( ) 04/20 15:02:00 Shadow exception! Error from starter on gig06.stat.wisc.edu: Failed to open '/scratch.1/einstein/workspace/v67/condor- test/test3/run_0/cosmos.in' as standard input: No such file or directory (errno 2) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job...

46 46 Holding Jobs › You can put jobs in the HELD state yourself, using condor_hold  Same syntax and rules as condor_rm › You can take jobs out of the HELD state with the condor_release command  Again, same syntax and rules as condor_rm

47 Configuration Files › “amp wiring” by “fbz_” © 2005 › Licensed under the Creative Commons Attribution 2.0 license › om/photos/fbz/ /

48 48 Configuration File › Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or ~condor/condor_config › All settings can be in this one file › Might want to share between all machines (NFS, automated copies, Wallaby, etc)

49 49 Other Configuration Files › LOCAL_CONFIG_FILE setting  Comma separated, processed in order LOCAL_CONFIG_FILE = \ /var/condor/config.local,\ /var/condor/policy.local,\ /shared/condor/config.$(HOSTNAME),\ /shared/condor/config.$(OPSYS)

50 50 Configuration File Syntax # I’m a comment! CREATE_CORE_FILES=TRUE MAX_JOBS_RUNNING = 50 # Condor ignores case: log=/var/log/condor # Long entries: collector_host=condor.cs.wisc.edu,\ secondary.cs.wisc.edu

51 51 Configuration File Macros › You reference other macros (settings) with:  SBIN = /usr/sbin  SCHEDD = $(SBIN)/condor_schedd › Can create additional macros for organizational purposes

52 Tools › “Tools” by “batega” © 2007 Licensed under Creative Commons Attribution 2.0 license › /

53 53 Administrator Commands › condor_vacateLeave a machine now › condor_onStart Condor › condor_offStop Condor › condor_reconfigReconfig on-the-fly › condor_config_valView/set config › condor_userprioUser Priorities › condor_statsView detailed usage accounting stats

54 54 condor_config_val › Find current configuration values % condor_config_val MASTER_LOG /var/condor/logs/MasterLog % cd `condor_config_val LOG`

55 55 condor_config_val -v › Can identify source % condor_config_val –v CONDOR_HOST CONDOR_HOST: condor.cs.wisc.edu Defined in ‘/etc/condor_config.hosts’, line 6

56 56 condor_config_val -config › What configuration files are being used? % condor_config_val –config Config source: /var/home/condor/condor_config Local config sources: /unsup/condor/etc/condor_config.hosts /unsup/condor/etc/condor_config.global /unsup/condor/etc/condor_config.policy /unsup/condor-test/etc/hosts/puffin.local

57 57 condor_fetchlog › Retrieve logs remotely condor_fetchlog beak.cs.wisc.edu Master

58 58 Querying daemons condor_status › Queries the collector for information about daemons in your pool › Defaults to finding condor_startd s › condor_status –schedd summarizes all job queues › condor_status –master returns list of all condor_master s

59 59 condor_status › -long displays the full ClassAd › Optionally specify a machine name to limit results to a single host condor_status –l node4.cs.wisc.edu

60 60 condor_status -constraint › Only return ClassAds that match an expression you specify › Show me idle machines with 1GB or more memory  condor_status -constraint 'Memory >= 1024 && Activity == "Idle"'

61 61 condor_status -format › Controls format of output › Useful for writing scripts › Uses C printf style formats  One field per argument “slanting” by Stefano Mortellaro (“fazen”) © 2005 Licensed under the Creative Commons Attribution 2.0 license

62 62 condor_status -format › Census of systems in your pool: % condor_status -format '%s ' Arch -format '%s\n' OpSys | sort | uniq –c 797 INTEL LINUX 118 INTEL WINNT SUN4u SOLARIS28 6 SUN4x SOLARIS28

63 63 Examining Queues condor_q › View the job queue › The “ -long ” option is useful to see the entire ClassAd for a given job › supports –constraint and -format › Can view job queues on remote machines with the “ -name ” option

64 64 condor_q -format › Census of jobs per user % condor_q -format ’%s ' Owner -format '%s\n' Cmd | sort | uniq –c 64 adesmet /scratch/submit/a.out 2 adesmet /home/bin/run_events 4 smith /nfs/sim1/em2d3d 4 smith /nfs/sim2/em2d3d

65 65 condor_q -analyze › condor_q will try to figure out why the job isn’t running › Good at determining that no machine matches the job Requirements expressions

66 66 condor_q -analyze › Typical intro: % condor_q –analyze : Run analysis summary. Of 820 machines, 458 are rejected by your job's requirements 25 reject your job because of their own requirements 0 match, but are serving users with a better priority in the pool 4 match, but reject the job for unknown reasons 6 match, but will not currently preempt their existing job 327 are available to run your job Last successful match: Sun Apr 27 14:32:

67 67 condor_q -analyze › Continued, and heavily truncated: The Requirements expression for your job is: ( ( target.Arch == "SUN4u" ) && ( target.OpSys == "WINNT50" ) && [snip] Condition Machines Suggestion 1 (target.Disk > ) 0 MODIFY TO (target.Memory > 10000) 0 MODIFY TO (target.Arch == "SUN4u") (target.OpSys == "WINNT50") 110 MOD TO "SOLARIS28" Conflicts: conditions: 3, 4

68 Adding Machines to Your Pool › Install Condor on new machines › Modify security settings on all machines to trust each other › Modify condor_config.local on new machines  DAEMON_LIST : remove unwanted daemons  CONDOR_HOST : set to hostname of central manager › Start Condor on new machines 68

69 Let’s Make a Big Pool › Edit /etc/condor/condor_config.local  DAEMON_LIST = MASTER, SCHEDD, STARTD  CONDOR_HOST = test17.epikh  ALLOW_WRITE = *  ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), \ $(CONDOR_HOST)  NUM_CPUS = 4 › Run condor_restart -master › condor_status should show more machines  May take a couple minutes 69

70 70 Security › We’re using host-based security  Trust all packets from given IP addresses  Only OK on a private network › Stronger security options  Pool password  OpenSSL  GSI (with optional VOMS)  Kerberos

71 File Transfer › If your job needs data files, you’ll need to have Condor transfer them for you › Likewise, Condor can transfer results files back for you › You need to place your data files in a place where Condor can access them › Sounds Great! What do I need to do?

72 72 Specify File Transfer Lists In your submit file: › Transfer_Input_Files  List of files for Condor to transfer from the submit machine to the execute machine › Transfer_Output_Files  List of files for Condor to transfer back from the execute machine to the submit machine  If not specified, Condor will transfer back all “new” files in the execute directory

73 73 Condor File Transfer Controls Should_Transfer_Files  YES: Always transfer files to execution site  NO: Always rely on a shared file system  IF_NEEDED: Condor will automatically transfer the files, if the submit and execute machine are not in the same FileSystemDomain Translation: Use shared file system if available When_To_Transfer_Output  ON_EXIT: T ransfer the job's output files back to the submitting machine only when the job completes  ON_EXIT_OR_EVICT: Like above, but also when the job is evicted

74 74 File Transfer Example # Example using file transfer Universe= vanilla Executable= cosmos Log = cosmos.log ShouldTransferFiles= YES Transfer_input_files= cosmos.dat Transfer_output_files= results.dat When_To_Transfer_Output= ON_EXIT Queue

75 Create a Job That Uses Input and Output Files › Sample script #!/bin/sh echo Directory listing /bin/ls -l echo Here is my input file cat $1 sleep 5 › Sample input file I am the job’s input! 75

76 Submit Your New Job › Submit description file universe = vanilla executable = test.sh arguments = test.input output = out.$(cluster).$(process) error = err.$(cluster).$(process) transfer_input_files = test.input should_transfer_files = YES when_to_transfer_output = ON_EXIT queue 10 76

77 77 More Information › › › https://condor-wiki.cs.wisc.edu/index.cgi/wiki › condor-users mailing list › support


Download ppt "Condor Project Computer Sciences Department University of Wisconsin-Madison Installing and Using Condor."

Similar presentations


Ads by Google