Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014
Introduction Two aspects of monitoring –General overview of the system How many running/idle jobs? By user/VO? By schedd? How full is the farm? How many draining worker nodes? –More detailed views What are individual jobs doing? What’s happening on individual worker nodes? Health of the different components of the HTCondor pool...in addition to Nagios
Introduction Methods –Command line utilities –Ganglia –Third-party applications (which run command-line tools or use python API)
Command line Three useful commands –condor_status Overview of the pool (including jobs, machines) Information about specific worker nodes –condor_q Information about jobs in the queue –condor_history Information about completed jobs
Overview of jobs -bash-4.1$ condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal condor01.gridpp.rl condor02.gridpp.rl
Overview of machines -bash-4.1$ condor_status -total Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX Total
Jobs by schedd -bash-4.1$ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs arc-ce01.gridpp.rl.a arc-ce01.g arc-ce02.gridpp.rl.a arc-ce02.g arc-ce03.gridpp.rl.a arc-ce03.g arc-ce04.gridpp.rl.a arc-ce04.g arc-ce05.gridpp.rl.a arc-ce05.g cream-ce01.gridpp.rl cream-ce cream-ce02.gridpp.rl cream-ce lcg0955.gridpp.rl.ac lcg0955.gr lcgui03.gridpp.rl.ac lcgui03.gr lcgui04.gridpp.rl.ac lcgui04.gr lcgvm21.gridpp.rl.ac lcgvm21.gr TotalRunningJobs TotalIdleJobs TotalHeldJobs Total
Jobs by user, schedd -bash-4.1$ condor_status -submitters Name Machine RunningJobs IdleJobs HeldJobs arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_CMS.prodcms_multicore. arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl …
…Jobs by user RunningJobs IdleJobs HeldJobs group_ALICE.alice.al group_ALICE.alice.al group_ALICE.alice_pi group_ATLAS.atlas.at group_ATLAS.atlas_pi group_ATLAS.atlas_pi group_ATLAS.prodatls group_CMS.cms.cmssgm group_CMS.cms_pilot group_CMS.cms_pilot group_CMS.cms_pilot group_CMS.prodcms.pc group_CMS.prodcms.pc group_CMS.prodcms_mu …
condor_q ~]# condor_q -- Submitter: arc-ce01.gridpp.rl.ac.uk : : arc-ce01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) … 3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended
Multi-core jobs -bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’ -- Schedd: arc-ce01.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD pcms004 12/5 14: :15:07 R (gridjob ) pcms004 12/5 14: :12:02 R (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) …
Multi-core jobs Custom print format -bash-4.1$ condor_q -global -pr queue_mc.cpf -- Schedd: arc-ce01.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES pcms004 12/5 14: :00:00 R 2.0 (gridjob) pcms004 12/5 14: :00:00 R 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) 8 …
Jobs with specific DN -bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’ -- Schedd: arc-ce03.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD tatls015 12/2 17: :07:15 R (arc_pilot ) tatls015 12/3 03: :12:31 R (arc_pilot ) tatls015 12/4 07: :49:12 R (arc_pilot ) tatls015 12/4 08: :09:27 R (arc_pilot ) tatls015 12/4 08: :09:27 R (arc_pilot ) tatls015 12/4 09: :09:37 R (arc_pilot ) tatls015 12/4 09: :09:26 R (arc_pilot ) …
Jobs killed Jobs which were removed ~]# condor_history -constraint 'JobStatus == 3’ ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD alicesgm 12/5 01: :13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi tlhcb005 12/5 13: :52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi tlhcb005 12/5 14: :07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi alicesgm 12/4 19: :13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi alicesgm 12/5 03: :52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi alicesgm 12/5 00: :58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi alicesgm 12/4 19: :43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi alicesgm 12/5 16: :06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi tlhcb005 12/2 05: :00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi …
Jobs killed Jobs removed for exceeding memory limit ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory alicesgm alicesgm alicesgm alicesgm … ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c 515 alice 5 cms 70 lhcb
condor_who What jobs are currently running on a worker node? ~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM arc-ce02.gridpp.rl.ac.uk 1_ :01: /usr/libexec/condor/co arc-ce02.gridpp.rl.ac.uk 1_ :56: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :51: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :06: /usr/libexec/condor/co arc-ce02.gridpp.rl.ac.uk 1_ :02: /usr/libexec/condor/co arc-ce03.gridpp.rl.ac.uk 1_ :44: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :42: /usr/libexec/condor/co arc-ce01.gridpp.rl.ac.uk 1_ :50: /usr/libexec/condor/co arc-ce03.gridpp.rl.ac.uk 1_ :44: /usr/libexec/condor/co
Startd history If STARTD_HISTORY defined on your WNs ~]# condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD tatls015 12/6 07: :02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi tatls015 12/6 07: :02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi tatls015 12/6 07: :02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi tatls015 12/6 07: :02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi tatls015 12/6 07: :02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi tatls015 12/6 07: :02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi alicesgm 12/4 18: :15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI …
Ganglia condor_gangliad –Runs on a single host (can be any host) –Gathers daemon ClassAds from the collector –Publishes metrics to ganglia with host spoofing At RAL we have on one host GANGLIAD_VERBOSITY = 2 GANGLIAD_PER_EXECUTE_NODE_METRICS = False GANGLIAD = $(LIBEXEC)/condor_gangliad GANGLIA_CONFIG = /etc/gmond.conf GANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.d GANGLIA_SEND_DATA_FOR_ALL_HOSTS = true DAEMON_LIST = MASTER, GANGLIAD
Ganglia Small subset from schedd
Ganglia Small subset from central manager
Easy to make custom plots
Total running, idle, held jobs f
Running jobs by schedd
Negotiator health s Negotiation cycle durationNumber of AutoClusters
Draining & multi-core slots
(Some) Third party tools
Job overview Condor Job Overview Monitor
Mimic Internal RAL application
htcondor-sysview
Hover mouse over a core to get job information
Nagios Most (all?) sites probably use Nagios or an alternative At RAL –Process checks for condor_master on all nodes –Central mangers Check for at least 1 collector Check for the negotiator Check for worker nodes Number of startd ClassAds needs to be above a threshold Number of non-broken worker nodes above a threshold –CEs Check for schedd Job submission test