Presentation is loading. Please wait.

Presentation is loading. Please wait.

HTCondor at the RAL Tier-1

Similar presentations


Presentation on theme: "HTCondor at the RAL Tier-1"— Presentation transcript:

1 HTCondor at the RAL Tier-1
Andrew Lahiff

2 Overview Current status at RAL Multi-core jobs
Interesting features we’re using Some interesting features & commands Future plans

3 Current status at RAL

4 Background Computing resources
784 worker nodes, over 14K cores Generally have 40-60K jobs submitted per day Torque / Maui had been used for many years Many issues Severity & number of problems increased as size of farm increased Doesn’t like dynamic resources A problem if we want to extend batch system into the cloud In 2012 decided it was time to start investigating moving to a new batch system

5 Choosing a new batch system
Considered, tested & eventually rejected the following LSF, Univa Grid Engine* Requirement: avoid commercial products unless absolutely necessary Open source Grid Engines Competing products, not sure which has the next long term future Communities appear less active than HTCondor & SLURM Existing Tier-1s running Grid Engine using the commercial version Torque 4 / Maui Maui problematic Torque 4 seems less scalable than alternatives (but better than Torque 2) SLURM Carried out extensive testing & comparison with HTCondor Found that for our use case Very fragile, easy to break Unable to get reliably working above 6000 running jobs * Only tested open source Grid Engine, not Univa Grid Engine

6 Choosing a new batch system
HTCondor chosen as replacement for Torque/Maui Has the features we require Seems very stable Easily able to run 16,000 simultaneous jobs Didn’t do any tuning – it “just worked” Have since tested > 30,000 running jobs Is more customizable than all other batch systems

7 The story so far History of HTCondor at RAL Experience
Jun 2013: started testing with real ATLAS & CMS jobs Sep 2013: 50% pledged resources moved to HTCondor Nov 2013: fully migrated to HTCondor Experience Very stable operation No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores Job start rate much higher than Torque / Maui even when throttled Very good support

8 Architecture CEs condor_schedd Central manager condor_collector
condor_negotiator Worker nodes condor_startd

9 Our setup 2 central managers
condor_master condor_collector condor_HAD (responsible for high-availability) condor_replication (responsible for high-availability) condor_negotiator (only running on at most 1 machine at a time) 8 submission hosts (4 ARC CE, 2 CREAM CE, 2 UI) condor_schedd Lots of worker nodes condor_startd Monitoring box (runs 8.1.x which contains ganglia integration) condor_gangliad

10 Computing elements ARC experience so far
Have run over 9.4 million jobs so far across our ARC CEs Generally ignore them & they “just work” VO status ALTAS & CMS Fine from the beginning LHCb Added ability to DIRAC to submit to ARC ALICE Not yet able to submit to ARC, have said they will work on this Non-LHC VOs Some use DIRAC, which can now submit to ARC Some use EMI WMS, which can submit to ARC

11 Computing elements ARC 4.1.0 released recently Plans
Will be in UMD very soon Has just passed through staged-rollout Contains all of our fixes to HTCondor backend scripts Plans When VOs start using RFC proxies we could enable the web service interface Doesn’t affect ATLAS/CMS VOs using NorduGrid client commands (e.g. LHCb) can get job status information more quickly

12 Computing elements Alternative: HTCondor-CE
Special configuration of HTCondor, not a brand new service Some sites starting to use this in the US Note: contains no BDII (!) site ATLAS & CMS pilot factories Central manager(s) HTCondor-CE collector(s) negotiator schedd schedd HTCondor-G Worker nodes startds job router

13 Multi-core jobs

14 Getting multi-core jobs to work
Job submission Haven’t set up dedicated queues VO has to request how many cores they want in their JDL Fine for ATLAS & CMS, not sure yet for LHCb/DIRAC… Could set up additional queues if necessary  Did 5 things on the HTCondor side to support multi-core jobs…

15 Getting multi-core jobs to work
Worker nodes configured to use partitionable slots WN resources divided up as necessary amongst jobs We had this configuration anyway Setup multi-core accounting groups & associated quotas Configured so that multi-core jobs automatically get assigned to the appropriate groups Specified group quotas (fairshares) for the multi-core groups Adjusted the order in which the negotiator considers groups Consider multi-core groups before single core groups 8 free slots are “expensive” to obtain, so try not to lose them too quickly

16 Getting multi-core jobs to work
Setup condor_defrag daemon Finds WNs to drain, triggers draining & cancels draining as required Pick WNs to drain based on how many cores they have that can be freed up E.g. getting 8 free cores by draining a full 32 core WN is generally faster than draining a full 8 core WN

17 Getting multi-core jobs to work
Improvement to condor_defrag daemon Demand for multi-core jobs not known by condor_defrag Setup simple cron which adjusts number of concurrent draining WNs based on demand If many idle multi-core jobs but few running, drain aggressively Otherwise very little draining

18 Results Running & idle multi-core jobs Number of WNs running multi-
Gaps in submission by ATLAS results in loss of multi-core slots. Number of WNs running multi- core jobs & draining WNs Significantly reduced CPU wastage due to the cron Aggressive draining: 3% waste Less-aggressive draining: < 1% waste

19 Multi-core jobs Current status Details about our configuration here:
Haven’t made any changes over the past few months Now only CMS running multi-core jobs Waiting for ATLAS to start up again Will be interesting to see what happens when multiple VOs run multi-core jobs May look at making improvements if necessary Details about our configuration here:

20 Interesting features we’re using (or are about to use)

21 Startd cron Worker node health check script
Script run on WN at regular intervals by HTCondor Can place custom information into WN ClassAds. In our case: NODE_IS_HEALTHY (should WN start jobs or not) NODE_STATUS (list of any problems) STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) WN_HEALTHCHECK STARTD_CRON_WN_HEALTHCHECK_EXECUTABLE = /usr/local/bin/healthcheck_wn_condor STARTD_CRON_WN_HEALTHCHECK_PERIOD = 10m STARTD_CRON_WN_HEALTHCHECK_MODE = periodic STARTD_CRON_WN_HEALTHCHECK_RECONFIG = false STARTD_CRON_WN_HEALTHCHECK_KILL = true ## When is this node willing to run jobs? START = (NODE_IS_HEALTHY =?= True)

22 Startd cron Current checks Plans CVMFS
Filesystem problems (e.g. read-only) Swap usage Plans May add more checks, e.g. CPU load More thorough tests only run when HTCondor first starts up? E.g. checks for grid middleware, checks for other essential software & configuration, … Want to be sure that jobs will never be started unless the WN is correctly set up Will be more important for dynamic virtual WNs

23 Startd cron Easily list any WNs with problems, e.g.
# condor_status -constraint 'partitionableslot == True && NODE_STATUS != "All_OK”’-autoformat Machine NODE_STATUS lcg1209.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1248.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1309.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1340.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch gmetric script using HTCondor Python API for making Ganglia plots

24 PID namespaces We have USE_PID_NAMESPACES=True on WNs
Jobs can’t see any system processes or processes associated with other jobs on the WN Example stdout of job running “ps –e”: PID TTY TIME CMD 1 ? :00:00 condor_exec.exe 3 ? :00:00 ps

25 MOUNT_UNDER_SCRATCH Each job sees a different /tmp, /var/tmp
Uses bind mounts to directories inside job scratch area No more junk left behind in /tmp Jobs can’t fill /tmp & cause problems Jobs can’t see what other jobs have written into /tmp For glexec jobs to work, need a special lcmaps plugin enabled lcmaps-plugins-mount-under-scratch Minor tweak to lcmaps.db We have tested this but it’s not yet rolled out

26 CPU affinity Can set on WN ASSIGN_CPU_AFFINITY=True
Jobs locked to specific cores Problem: When PID namespaces also used, CPU affinity doesn’t work

27 Control groups Cgroups: mechanism for managing a set of processes
We’re starting with the most basic option: CGROUP_MEMORY_LIMIT_POLICY=none No memory limits applied Cgroups used for Process tracking Memory accounting CPU usage assigned proportionally to the number of CPUs in the slot Currently configured on 2 WNs for initial testing with production jobs

28 Some interesting features & commands

29 Upgrades Central managers, CEs: Worker nodes Update the RPMs
condor_master will notice the binaries have changed & restart daemons as required Worker nodes Make sure WN config contains MASTER_NEW_BINARY_RESTART=PEACEFUL condor_master will notice the binaries have changed, drain running jobs, then restart daemons as required

30 Dealing with held jobs Held jobs have failed in some way & remain in the queue waiting for user intervention E.g. input file(s) missing from CE Can configure HTCondor to deal with them automatically Try re-running held jobs once only after waiting 30 minutes SYSTEM_PERIODIC_RELEASE = ((CurrentTime - EnteredCurrentStatus > 30 * 60) && (JobRunCount < 2)) Remove held jobs after 24 hours SYSTEM_PERIODIC_REMOVE = ((CurrentTime - EnteredCurrentStatus > 24 * 60 * 60) && JobStatus == 5))

31 condor_who See what jobs are running on a worker node
~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM arc-ce03.gridpp.rl.ac.uk 1_ :30: /pool/condor/dir_30180/condor_exec.exe arc-ce01.gridpp.rl.ac.uk 1_ :38: /pool/condor/dir_21262/condor_exec.exe arc-ce03.gridpp.rl.ac.uk 1_ :40: /pool/condor/dir_6938/condor_exec.exe arc-ce02.gridpp.rl.ac.uk 1_ :42: /pool/condor/dir_26665/condor_exec.exe arc-ce01.gridpp.rl.ac.uk 1_ :57: /pool/condor/dir_12342/condor_exec.exe arc-ce01.gridpp.rl.ac.uk 1_ :03: /pool/condor/dir_26331/condor_exec.exe arc-ce01.gridpp.rl.ac.uk 1_ :55: /pool/condor/dir_31268/condor_exec.exe

32 condor_q –analyze 1 Why isn’t a job running?
-bash-4.1$ condor_q -analyze 16244 -- Submitter: lcgui03.gridpp.rl.ac.uk : < :45033> : lcgui03.gridpp.rl.ac.uk User priority for is not available, attempting to analyze without it. --- : Run analysis summary. Of machines, 13180 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job WARNING: Be advised: No resources matched request's constraints

33 condor_q –analyze 2 The Requirements expression for your job is: ( ( Ceph is true ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer ) Suggestions: Condition Machines Matched Suggestion ( TARGET.Cpus >= 16 ) 0 MODIFY TO 8 2 ( ( target.Ceph is true ) ) ( TARGET.Memory >= 1 ) ( TARGET.Arch == "X86_64" ) ( TARGET.OpSys == "LINUX" ) ( TARGET.Disk >= 1 ) ( TARGET.HasFileTransfer ) 13180

34 condor_ssh_to_job ssh into a job from a CE
~]# condor_ssh_to_job Welcome to Your condor job is running with pid(s) 2402. dir_2393]$ hostname lcg1554.gridpp.rl.ac.uk dir_2393]$ ls condor_exec.exe _condor_stderr _condor_stderr.schedd_glideins4_vocms32.cern.ch_ _ _condor_stdout _condor_stdout.schedd_glideins4_vocms32.cern.ch_ _ glide_adXvcQ glidein_startup.sh job.MCJMDmk8W6jnCIXDjqiBL5XqABFKDmABFKDmb5JKDmEBFKDmQ8FM6m.proxy

35 condor_fetchlog Retrieve log files from daemons on other machines
~]# condor_fetchlog arc-ce04.gridpp.rl.ac.uk SCHEDD 05/20/14 12:39:09 (pid:2388) ****************************************************** 05/20/14 12:39:09 (pid:2388) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 05/20/14 12:39:09 (pid:2388) ** /usr/sbin/condor_schedd 05/20/14 12:39:09 (pid:2388) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 05/20/14 12:39:09 (pid:2388) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 05/20/14 12:39:09 (pid:2388) ** $CondorVersion: Feb BuildID: $ 05/20/14 12:39:09 (pid:2388) ** $CondorPlatform: x86_RedHat6 $ 05/20/14 12:39:09 (pid:2388) ** PID = 2388 05/20/14 12:39:09 (pid:2388) ** Log last touched time unavailable (No such file or directory) ...

36 condor_gather_info Gathers information about a job
Including log files from schedd, startd ~]# condor_gather_info --jobid cgi-root-jid _36_59PM-BST/ cgi-root-jid _36_59PM-BST/ShadowLog.old cgi-root-jid _36_59PM-BST/ShadowLog cgi-root-jid _36_59PM-BST/StartLog cgi-root-jid _36_59PM-BST/MasterLog cgi-root-jid _36_59PM-BST/MasterLog.lcg1043.gridpp.rl.ac.uk cgi-root-jid _36_59PM-BST/condor-profile.txt cgi-root-jid _36_59PM-BST/ / cgi-root-jid _36_59PM-BST/ /job_q cgi-root-jid _36_59PM-BST/ /job_userlog_lines cgi-root-jid _36_59PM-BST/ /job_ad_analysis cgi-root-jid _36_59PM-BST/ /job_ad cgi-root-jid _36_59PM-BST/SchedLog.old cgi-root-jid _36_59PM-BST/StarterLog.slot1

37 Job ClassAds Lots of useful information in job ClassAds
Including address from proxy Easy to contact users of problematic jobs # condor_q autoformat x509UserProxy

38 condor_chirp Jobs can put custom information into job ClassAds
Example: lcmaps-plugin-condor-update Puts information into job ClassAd about glexec payload user & DN Can then use condor_q to see this information

39 Job router Job router daemon transforms jobs from one type to another according to configurable policies E.g. submit jobs to a different batch system or a CE Example: sending excess jobs to the GridPP Cloud using glideinWMS JOB_ROUTER_DEFAULTS = \ [ \ MaxIdleJobs = 10; \ MaxJobs = 50; \ ] JOB_ROUTER_ENTRIES = \ Requirements=true; \ GridResource = "condor lcggwms02.gridpp.rl.ac.uk lcggwms02.gridpp.rl.ac.uk:9618"; \ name = "GridPP_Cloud"; \

40 Job router Example: initially have 5 idle jobs -bash-4.1$ condor_q
-- Submitter: lcgvm21.gridpp.rl.ac.uk : < :40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

41 Job router Routed copies of the jobs soon appear
Original job mirrors the status of the routed copy -bash-4.1$ condor_q -- Submitter: lcgvm21.gridpp.rl.ac.uk : < :40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) alahiff /2 20: :00:00 I (CMSAnalysis ) 10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

42 Job router Can check that the new jobs have been sent to a remote resource ~]# condor_q -grid -- Submitter: lcgvm21.gridpp.rl.ac.uk : < :40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID alahiff IDLE condor->lcggwms02.gridpp.rl 20.0 alahiff IDLE condor->lcggwms02.gridpp.rl 21.0 alahiff IDLE condor->lcggwms02.gridpp.rl 17.0 alahiff IDLE condor->lcggwms02.gridpp.rl 18.0 alahiff IDLE condor->lcggwms02.gridpp.rl 19.0 Job ids on remote resource glideinWMS HTCondor pool

43 Future plans Setting up ARC CE & some WNs using Ceph as a shared storage system (CephFS) ATLAS testing with arcControlTower Pulls jobs from PanDA, pushes jobs to ARC CEs Input files pre-staged & cached on Ceph by ARC Currently in progress…

44 Future plans Test power management features
HTCondor can power down idle WNs and wake them as required Batch system expanding into the cloud Make use of idle private cloud resources We have tested that condor_rooster can be used to dynamically provision VMs as they are needed

45 Questions?

46 Backup slides

47 Monitoring

48 Overview Batch system monitoring Mimic Jobview Ganglia Elasticsearch

49 Mimic Overview of state of worker nodes

50 Jobview

51 Ganglia Custom gmetric scripts + condor_gangliad

52 Ganglia Aim to move to using metrics only from condor_gangliad as much
as possible (easy to share with other sites)

53 Jobs monitoring CASTOR team at RAL have been testing Elasticsearch
Why not try using it with HTCondor? Elasticsearch ELK stack Logstash: parses log files Elasticsearch: search & analyze data in real-time Kibana: data visualization Hardware setup Test cluster of 13 servers (old diskservers & worker nodes) But 3 servers could handle 16 GB of CASTOR logs per day Adding HTCondor Early testing phase only Wrote config file for Logstash to enable history files to be parsed Add Logstash to machines running schedds

54 Jobs monitoring Full job ClassAds visible & can be queried

55 Jobs monitoring Can make custom plots & dashboards


Download ppt "HTCondor at the RAL Tier-1"

Similar presentations


Ads by Google