Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Similar presentations


Presentation on theme: "A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop."— Presentation transcript:

1 A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop

2 Outline Overview of HTCondor at RAL Computing elements Multi-core jobs Monitoring 2

3 Introduction RAL is a Tier-1 for all 4 LHC experiments –In terms of Tier-1 computing requirements, RAL provides 2% ALICE 13% ATLAS 8% CMS 32% LHCb –Also support ~12 non-LHC experiments, including non-HEP Computing resources –784 worker nodes, over 14K cores –Generally have 40-60K jobs submitted per day Torque / Maui had been used for many years –Many issues –Severity & number of problems increased as size of farm increased –In 2012 decided it was time to start investigating moving to a new batch system 3

4 Choosing a new batch system Considered, tested & eventually rejected the following –LSF, Univa Grid Engine* Requirement: avoid commercial products unless absolutely necessary –Open source Grid Engines Competing products, not sure which has the next long term future Communities appear less active than HTCondor & SLURM Existing Tier-1s running Grid Engine using the commercial version –Torque 4 / Maui Maui problematic Torque 4 seems less scalable than alternatives (but better than Torque 2) –SLURM Carried out extensive testing & comparison with HTCondor Found that for our use case –Very fragile, easy to break –Unable to get reliably working above 6000 running jobs 4 * Only tested open source Grid Engine, not Univa Grid Engine

5 Choosing a new batch system HTCondor chosen as replacement for Torque/Maui –Has the features we require –Seems very stable –Easily able to run 16,000 simultaneous jobs Didn’t do any tuning – it “just worked” Have since tested > 30,000 running jobs –Is more customizable than all other batch systems 5

6 Migration to HTCondor Strategy –Start with a small test pool –Gain experience & slowly move resources from Torque / Maui Migration 2012 Aug Started evaluating alternatives to Torque / Maui (LSF, Grid Engine, Torque 4, HTCondor, SLURM) 2013 Jun Began testing HTCondor with ATLAS & CMS ~1000 cores from old WNs beyond MoU commitments 2013 AugChoice of HTCondor approved by management 2013 SepHTCondor declared production service Moved 50% of pledged CPU resources to HTCondor 2013 NovMigrated remaining resources to HTCondor 6

7 Experience so far Experience –Very stable operation Generally just ignore the batch system & everything works fine Staff don’t need to spend all their time fire-fighting problems –No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores –Job start rate much higher than Torque / Maui even when throttled Farm utilization much better –Very good support 7

8 Problems A few issues found, but fixed quickly by developers –Found job submission hung when one of a HA pair of central managers was down Fixed & released in 8.0.2 –Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS Fixed & released in 8.0.5 –Experienced jobs dying 2 hours after network break between CEs and WNs Fixed & released in 8.1.4 8

9 All job submission to RAL is via the Grid –No local users Currently have 5 CEs –2 CREAM CEs –3 ARC CEs CREAM doesn’t currently support HTCondor –We developed the missing functionality ourselves –Will feed this back so that it can be included in an official release ARC better –But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, … –We wrote lots of patches, all included in the recent 4.1.0 release Will make it easier for more European sites to move to HTCondor Computing elements 9

10 ARC CE experience –Have run almost 9 million jobs so far across our 3 ARC CEs –Generally ignore them and they “just work” –VOs ATLAS & CMS fine from the beginning LHCb added ability to submit to ARC CEs to DIRAC –Seem to be ready to move entirely to ARC ALICE not yet able to submit to ARC –They have said they will work on this Non-LHC VOs –Some use DIRAC, which now can submit to ARC –Others use EMI WMS, which can submit to ARC CREAM CE status –Plan to phase-out CREAM CEs this year Computing elements 10

11 HTCondor & ARC in the UK Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC –RAL T2HTCondor + ARC (in production) –BristolHTCondor + ARC (in production) –OxfordHTCondor + ARC (small pool in production, migration in progress) –DurhamSLURM + ARC –GlasgowTesting HTCondor + ARC –LiverpoolTesting HTCondor 7 more sites considering moving to HTCondor or SLURM Configuration management: community effort –The Tier-2s using HTCondor and ARC have been sharing Puppet modules 11

12 Multi-core jobs Current situation –ATLAS have been running multi-core jobs at RAL since November –CMS started submitting multi-core jobs in early May –Interest so far only for multi-core jobs, not whole-node jobs Only 8-core jobs Our aims –Fully dynamic No manual partitioning of resources –Number of running multi-core jobs determined by fairshares 12

13 Getting multi-core jobs to work Job submission –Haven’t setup dedicated multi-core queues –VO has to request how many cores they want in their JDL, e.g. (count=8) Worker nodes configured to use partitionable slots –Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs Setup multi-core groups & associated fairshares –HTCondor configured to assign multi-core jobs to the appropriate groups Adjusted the order in which the negotiator considers groups –Consider multi-core groups before single core groups 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly 13

14 Getting multi-core jobs to work If lots of single-core jobs are idle & running, how does a multi- core job start? –By default it probably won’t condor_defrag daemon –Finds WNs to drain, triggers draining & cancels draining as required –Configuration changes from default: Drain 8-cores only, not whole WNs Pick WNs to drain based on how many cores they have that can be freed up –E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN –Demand for multi-core jobs not known by condor_defrag Setup simple cron to adjust number of concurrent draining WNs based on demand –If many idle multi-core jobs but few running, drain aggressively –Otherwise very little draining 14

15 Results Effect of changing the way WNs to drain are selected –No change in the number of concurrent draining machines –Rate in increase in number of running multi-core jobs much higher 15 Running multi-core jobs

16 Recent ATLAS activity Results 16 Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Significantly reduced CPU wastage due to the cron Aggressive draining: 3% waste Less-aggressive draining: < 1% waste Number of WNs running multi- core jobs & draining WNs

17 Worker node health check Startd cron –Checks for problems on worker nodes Disk full or read-only CVMFS Swap … –Prevents jobs from starting in the event of problems If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting –Information about problems made available in machine ClassAds Can easily identify WNs with problems, e.g. # condor_status –constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free 17

18 Worker node health check Also can put this data into ganglia –RAL tests new CVMFS releases Therefore it’s important for us to detect increases in CVMFS problems –Generally have only small numbers of WNs with issues: –Example: a user’s “problematic” jobs affected CVMFS on many WNs: 18

19 Jobs monitoring CASTOR team at RAL have been testing Elasticsearch –Why not try using it with HTCondor? Elasticsearch ELK stack –Logstash: parses log files –Elasticsearch: search & analyze data in real-time –Kibana: data visualization Hardware setup –Test cluster of 13 servers (old diskservers & worker nodes) But 3 servers could handle 16 GB of CASTOR logs per day Adding HTCondor –Wrote config file for Logstash to enable history files to be parsed –Add Logstash to machines running schedds 19 HTCondor history files Logstash Elastic search Kibana

20 Jobs monitoring Can see full job ClassAds 20

21 Jobs monitoring 21 Custom plots –E.g. completed jobs by schedd

22 Custom dashboards Jobs monitoring 22

23 Jobs monitoring 23 Benefits –Easy to setup Took less than a day to setup the initial cluster –Seems to be able to handle the load from HTCondor For us (so far): < 1 GB, < 100K documents per day –Arbitrary queries Seem faster than using native HTCondor commands (condor_history) –Horizontal construction Need more capacity? Just add more nodes

24 Summary Due to scalability problems with Torque/Maui, migrated to HTCondor last year We are happy with the choice we made based on our requirements –Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future Multi-core jobs working well –Looking forward to ATLAS and CMS running multi-core jobs at the same time 24

25 Future plans HTCondor –Phase in cgroups onto WNs Integration with private cloud –When production cloud is ready, want to be able to expand the batch system into the cloud –Using condor_rooster for provisioning resources HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205 http://indico.cern.ch/event/214784/session/9/contribution/205 Monitoring –Move Elasticsearch into production –Try sending all HTCondor & ARC CE log files to Elasticsearch E.g. could easily find information about a particular job from any log file 25

26 Future plans Ceph –Have setup a 1.8 PB test Ceph storage system –Accessible from some WNs using Ceph FS –Setting up an ARC CE with shared filesystem (Ceph) ATLAS testing with arcControlTower –Pulls jobs from PanDA, pushes jobs to ARC CEs –Unlike the normal pilot concept, jobs can have more precise resource requirements specified Input files pre-staged & cached by ARC on Ceph 26

27 Thank you! 27


Download ppt "A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop."

Similar presentations


Ads by Google