HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

12th EELA TUTORIAL - USERS AND SYSTEM ADMINISTRATOR E-infrastructure shared between Europe and Latin America CE + WN installation and configuration.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
HTCondor at the RAL Tier-1
Scheduling under LCG at RAL UK HEP Sysman, Manchester 11th November 2004 Steve Traylen
HTCondor within the European Grid & in the Cloud
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Two Years of HTCondor at the RAL Tier-1
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier pre-GDB on Batch Systems 11 March 2014, Bologna.
Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
The EDGeS project receives Community research funding 1 SG-DG Bridges Zoltán Farkas, MTA SZTAKI.
Grid job submission using HTCondor Andrew Lahiff.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Two Years of HTCondor at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier STFC Rutherford Appleton Laboratory 2015 WLCG Collaboration.
Enabling Grids for E-sciencE SGE J. Lopez, A. Simon, E. Freire, G. Borges, K. M. Sephton All Hands Meeting Dublin, Ireland 12 Dec 2007 Batch system support.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
HTCondor & ARC CEs Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier GridPP March 2014, Pitlochry, Scotland.
Your university or experiment logo here Tier1 Deployment Steve Traylen.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Middleware Update Maria Alandes Pradillo.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
The GridPP DIRAC project DIRAC for non-LHC communities.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
Tier 1 Experience Provisioning Virtualized Worker Nodes on Demand Ian Collier, Andrew Lahiff UK Tier 1 Centre, RAL ISGC 2014.
RI EGI-TF 2010, Tutorial Managing an EGEE/EGI Virtual Organisation (VO) with EDGES bridged Desktop Resources Tutorial Robert Lovas, MTA SZTAKI.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
Accounting Update John Gordon. Outline Multicore CPU Accounting Developments Cloud Accounting Storage Accounting Miscellaneous.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
HTCondor Accounting Update
HTCondor Accounting Update
The New APEL Client Will Rogers, STFC.
Summary on PPS-pilot activity on CREAM CE
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Moving from CREAM CE to ARC CE
Raw Wallclock in APEL John Gordon, STFC-RAL
CREAM-CE/HTCondor site
Cristina del Cano Novales STFC - RAL
Presentation transcript:

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Outline Introduction CREAM CE ARC CE

Introduction Computing element requirements –Job submission from LHC VOs AliEn: ALICE HTCondor-G: ATLAS, CMS DIRAC: LHCb –EMI WMS job submission Still used by some non-LHC VOs Usage (probably) likely to decrease –Information system Information about jobs & worker nodes needs to go into the BDII

CREAM CE Does not work out-of-the-box with HTCondor –HTCondor not officially supported (anymore) However –BLAH supports HTCondor, so there is hope…

CREAM CE Problems –No YAIM function for configuring CE to use HTCondor –Scripts to publish dynamic information about state of jobs missing –YAIM function for configuring BLAH doesn’t support HTCondor RAL solution –Make use of scripts from very old versions of CREAM which did support HTCondor Updated these for current EMI-3 CREAM CE Needed modernizing, e.g. to support partitionable slots

CREAM CE: queues We wanted the following queues –Different memory limits (1GB, 2GB, 3GB, 4GB) –Multi-core Make sure site-info.def contains details about the queues Modifications –condor_submit.sh (from glite-ce-blahp RPM) Add request_memory, request_cpus to the submit file –lrmsinfo-condor Runs condor_q with constraints for each queue

CREAM CE: accounting APEL accounting –At RAL Script which modifies blahp.log files Appends.hostname to job ids (required for sites with multiple schedds) Script which converts condor history files to PBS-style accounting files PBS APEL parser –Now there’s a fork of APEL containing HTCondor integration Script which writes some information from condor history into new files HTCondor APEL parser which reads these files

CREAM CE RAL has been running 2 CREAM CEs with HTCondor in production for over a year –Was used by ALICE, LHCb, non-LHC VOs Now mainly used by a single non-LHC VO –Over 3 million jobs run

ARC CE NorduGrid product –In EMI, UMD Features –Simpler than CREAM CE –Can send APEL accounting data directly to central broker –File staging: can download & cache input files; upload output to SE Configuration –Single config file /etc/arc.conf No YAIM required

ARC CE Can the LHC VOs submit to ARC? –ATLAS Use HTCondor-G for job submission Able to submit to ARC ARC Control Tower for job submission –CMS Use HTCondor-G for job submission –LHCb Last year added to DIRAC the ability to submit to ARC –ALICE Recently regained the capability to submit to ARC Submission via EMI WMS –Works (uses HTCondor-G)

ARC CE Integration with HTCondor –Like CREAM CE, HTCondor scripts had gotten a bit out of date –Older versions (< 4.1.0) required lots of patches E.g. assumptions made which were not true with partitionable slots –Current version (4.2.0) works out-of-the-box with HTCondor

ARC CE –Contains many patches provided by RAL for HTCondor backend scripts –Bug fix for memory limit of multi-core jobs (HTCondor) Future release –Have submitted patch to enable CE to make use of per-job history files Repository –At RAL we use the NorduGrid repository. For 4.2.0:

ARC CE: accounting APEL node not required –JURA component of ARC sends accounting data directly to APEL central broker Note some sites prefer to have an APEL publisher node Scaling factors –Unlike Torque, HTCondor doesn’t scale CPU & wall time –How this is handled at RAL Startd ClassAds contain a scaling factor An ARC auth plugin applies scaling factors to completed jobs (1 line added to /etc/arc.conf)

ARC CE Information passed to HTCondor from jobs –Max wall time, max CPU time, number of cores, total memory –Generates appropriate Periodic_remove for submit file Memory, wall time, CPU time –Example snippet from an ATLAS job request_cpus = 1 request_memory = JobTimeLimit = JobMemoryLimit = Periodic_remove = FALSE || RemoteWallClockTime > JobTimeLimit || ResidentSetSize > JobMemoryLimit

ARC CE: queues Can specify HTCondor requirements expression for each queue in /etc/arc.conf request_memory can be taken from queue configuration (if job doesn’t specify memory) Cannot setup queues with different time limits (without hacking scripts)

ARC CE Current issues –Currently no proxy renewal for jobs submitted via EMI WMS Affects non-LHC VOs only A workaround exists (*) –Job status information is not real-time LHCb have commented on this If ARC WS interface used, situation much better Requires RFC proxies *

Summary There are no blocking issues preventing European sites from using HTCondor as a batch system Both CREAM and ARC CEs work! –CREAM currently requires more effort to setup HTCondor-CE? –Integration with BDII