Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
HTCondor at the RAL Tier-1
HTCondor within the European Grid & in the Cloud
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Two Years of HTCondor at the RAL Tier-1
April Open Science Grid Building a Campus Grid Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
Grid job submission using HTCondor Andrew Lahiff.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
What’s new in HTCondor? What’s coming? HTCondor Week 2014 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
An Introduction to Using
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Experience on HTCondor batch system for HEP and other research fields at KISTI-GSDC Sang Un Ahn, Sangwook Bae, Amol Jaikar, Jin Kim, Byungyun Kong, Ilyeon.
HTCondor at RAL: An Update
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
HTCondor at the RAL Tier-1
Things you may not know about HTCondor
High Availability in HTCondor
Moving from CREAM CE to ARC CE
CREAM-CE/HTCondor site
Monitoring HTCondor with Ganglia
Accounting in HTCondor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Troubleshooting Your Jobs
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
Job Matching, Handling, and Other HTCondor Features
Basic Grid Projects – Condor (Part I)
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Condor Administration in the Open Science Grid
Negotiator Policy and Configuration
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Introduction Two aspects of monitoring –General overview of the system How many running/idle jobs? By user/VO? By schedd? How full is the farm? How many draining worker nodes? –More detailed views What are individual jobs doing? What’s happening on individual worker nodes? Health of the different components of the HTCondor pool...in addition to Nagios

Introduction Methods –Command line utilities –Ganglia –Third-party applications (which run command-line tools or use python API)

Command line Three useful commands –condor_status Overview of the pool (including jobs, machines) Information about specific worker nodes –condor_q Information about jobs in the queue –condor_history Information about completed jobs

Overview of jobs -bash-4.1$ condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal condor01.gridpp.rl condor02.gridpp.rl

Overview of machines -bash-4.1$ condor_status -total Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX Total

Jobs by schedd -bash-4.1$ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs arc-ce01.gridpp.rl.a arc-ce01.g arc-ce02.gridpp.rl.a arc-ce02.g arc-ce03.gridpp.rl.a arc-ce03.g arc-ce04.gridpp.rl.a arc-ce04.g arc-ce05.gridpp.rl.a arc-ce05.g cream-ce01.gridpp.rl cream-ce cream-ce02.gridpp.rl cream-ce lcg0955.gridpp.rl.ac lcg0955.gr lcgui03.gridpp.rl.ac lcgui03.gr lcgui04.gridpp.rl.ac lcgui04.gr lcgvm21.gridpp.rl.ac lcgvm21.gr TotalRunningJobs TotalIdleJobs TotalHeldJobs Total

Jobs by user, schedd -bash-4.1$ condor_status -submitters Name Machine RunningJobs IdleJobs HeldJobs arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_CMS.prodcms_multicore. arc-ce01.gridpp.rl arc-ce01.gridpp.rl group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl …

…Jobs by user RunningJobs IdleJobs HeldJobs group_ALICE.alice.al group_ALICE.alice.al group_ALICE.alice_pi group_ATLAS.atlas.at group_ATLAS.atlas_pi group_ATLAS.atlas_pi group_ATLAS.prodatls group_CMS.cms.cmssgm group_CMS.cms_pilot group_CMS.cms_pilot group_CMS.cms_pilot group_CMS.prodcms.pc group_CMS.prodcms.pc group_CMS.prodcms_mu …

condor_q ~]# condor_q -- Submitter: arc-ce01.gridpp.rl.ac.uk : : arc-ce01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) pcms054 12/3 12: :00:00 I (gridjob ) … 3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

Multi-core jobs -bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’ -- Schedd: arc-ce01.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD pcms004 12/5 14: :15:07 R (gridjob ) pcms004 12/5 14: :12:02 R (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) pcms004 12/5 14: :00:00 I (gridjob ) …

Multi-core jobs Custom print format -bash-4.1$ condor_q -global -pr queue_mc.cpf -- Schedd: arc-ce01.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES pcms004 12/5 14: :00:00 R 2.0 (gridjob) pcms004 12/5 14: :00:00 R 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) pcms004 12/5 14: :00:00 I 0.0 (gridjob) 8 …

Jobs with specific DN -bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’ -- Schedd: arc-ce03.gridpp.rl.ac.uk : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD tatls015 12/2 17: :07:15 R (arc_pilot ) tatls015 12/3 03: :12:31 R (arc_pilot ) tatls015 12/4 07: :49:12 R (arc_pilot ) tatls015 12/4 08: :09:27 R (arc_pilot ) tatls015 12/4 08: :09:27 R (arc_pilot ) tatls015 12/4 09: :09:37 R (arc_pilot ) tatls015 12/4 09: :09:26 R (arc_pilot ) …

Jobs killed Jobs which were removed ~]# condor_history -constraint 'JobStatus == 3’ ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD alicesgm 12/5 01: :13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi tlhcb005 12/5 13: :52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi tlhcb005 12/5 14: :07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi alicesgm 12/4 19: :13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi alicesgm 12/5 03: :52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi alicesgm 12/5 00: :58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi alicesgm 12/4 19: :43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi alicesgm 12/5 16: :06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi tlhcb005 12/2 05: :00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi …

Jobs killed Jobs removed for exceeding memory limit ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory alicesgm alicesgm alicesgm alicesgm … ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c 515 alice 5 cms 70 lhcb

condor_who What jobs are currently running on a worker node? ~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM arc-ce02.gridpp.rl.ac.uk 1_ :01: /usr/libexec/condor/co arc-ce02.gridpp.rl.ac.uk 1_ :56: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :51: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :06: /usr/libexec/condor/co arc-ce02.gridpp.rl.ac.uk 1_ :02: /usr/libexec/condor/co arc-ce03.gridpp.rl.ac.uk 1_ :44: /usr/libexec/condor/co arc-ce04.gridpp.rl.ac.uk 1_ :42: /usr/libexec/condor/co arc-ce01.gridpp.rl.ac.uk 1_ :50: /usr/libexec/condor/co arc-ce03.gridpp.rl.ac.uk 1_ :44: /usr/libexec/condor/co

Startd history If STARTD_HISTORY defined on your WNs ~]# condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD tatls015 12/6 07: :02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi tatls015 12/6 07: :02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi tatls015 12/6 07: :02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi tatls015 12/6 07: :02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi tatls015 12/6 07: :02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi tatls015 12/6 07: :02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi alicesgm 12/4 18: :15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI …

Ganglia condor_gangliad –Runs on a single host (can be any host) –Gathers daemon ClassAds from the collector –Publishes metrics to ganglia with host spoofing At RAL we have on one host GANGLIAD_VERBOSITY = 2 GANGLIAD_PER_EXECUTE_NODE_METRICS = False GANGLIAD = $(LIBEXEC)/condor_gangliad GANGLIA_CONFIG = /etc/gmond.conf GANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.d GANGLIA_SEND_DATA_FOR_ALL_HOSTS = true DAEMON_LIST = MASTER, GANGLIAD

Ganglia Small subset from schedd

Ganglia Small subset from central manager

Easy to make custom plots

Total running, idle, held jobs f

Running jobs by schedd

Negotiator health s Negotiation cycle durationNumber of AutoClusters

Draining & multi-core slots

(Some) Third party tools

Job overview Condor Job Overview Monitor

Mimic Internal RAL application

htcondor-sysview

Hover mouse over a core to get job information

Nagios Most (all?) sites probably use Nagios or an alternative At RAL –Process checks for condor_master on all nodes –Central mangers Check for at least 1 collector Check for the negotiator Check for worker nodes Number of startd ClassAds needs to be above a threshold Number of non-broken worker nodes above a threshold –CEs Check for schedd Job submission test