Delivering Experiment Software to WLCG sites A new approach using the CernVM Filesystem (cvmfs) Ian Collier – RAL Tier 1 HEPSYSMAN.

Slides:



Advertisements
Similar presentations
ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
Advertisements

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Chapter 9 Chapter 9: Managing Groups, Folders, Files, and Object Security.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Torrent-based Software Distribution in ALICE.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
CVMFS AT TIER2S Sarah Williams Indiana University.
Version Control with Subversion. What is Version Control Good For? Maintaining project/file history - so you don’t have to worry about it Managing collaboration.
1 port BOSS on Wenjing Wu (IHEP-CC)
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Introduction to CVMFS A way to distribute HEP software on cloud Tian Yan (IHEP Computing Center, BESIIICGEM Cloud Computing Summer School.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CERN Physics Database Services and Plans Maria Girone, CERN-IT
Predrag Buncic (CERN/PH-SFT) WP9 - Workshop Summary
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
Virtualised Worker Nodes Where are we? What next? Tony Cass GDB /12/12.
Changes to CernVM-FS repository are staged on an “installation box" using a read/write file system interface. There is a dedicated installation box for.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Last update 29/01/ :01 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD CERN VOMS server deployment LCG Grid Deployment Board
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES CVMFS deployment status Ian Collier – STFC Stefan Roiser – CERN.
Introduction to AFS IMSA Intersession 2003 An Overview of AFS Brian Sebby, IMSA ’96 Copyright 2003 by Brian Sebby, Copies of these slides.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Testing CernVM-FS scalability at RAL Tier1 Ian Collier RAL Tier1 Fabric Team WLCG GDB - September
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
36 th LHCb Software Week Pere Mato/CERN.  Provide a complete, portable and easy to configure user environment for developing and running LHC data analysis.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Predrag Buncic (CERN/PH-SFT) CernVM Status. CERN, 24/10/ Virtualization R&D (WP9)  The aim of WP9 is to provide a complete, portable and easy.
CernVM-FS – Best Practice to Consolidate Global Software Distribution Catalin CONDURACHE, Ian COLLIER STFC RAL Tier-1 ISGC15, Taipei, March 2015.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Considerations on Using CernVM-FS for Datasets Sharing Within Various Research Communities Catalin Condurache STFC RAL UK ISGC, Taipei, 18 March 2016.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
CHEP 2010 Taipei, 19 October Predrag Buncic Jakob Blomer, Carlos Aguado Sanchez, Pere Mato, Artem Harutyunyan CERN/PH-SFT.
CernVM-FS vs Dataset Sharing
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Belle II Physics Analysis Center at TIFR
Torrent-based software distribution
Service Challenge 3 CERN
Introduction to CVMFS A way to distribute HEP software on cloud
Presentation transcript:

Delivering Experiment Software to WLCG sites A new approach using the CernVM Filesystem (cvmfs) Ian Collier – RAL Tier 1 HEPSYSMAN November 22 nd 2010, Birmingham

Contents What’s the problem? What is cvmfs - and why might it help? Experiences at PIC Experiences at RAL Future

So, what’s the problem? Experiment application software (and conditions databases) is currently installed in a shared area of individual site (NFS, AFS…) The software is installed by jobs which run on a privileged node of the computing farm, these must be run at all sites

So, what’s the problem? II Issues observed with this model include: NFS scalability issues at RAL we see BIG load issues, especially for Atlas Upgrading the NFS server helped – but problem has not gone away at PIC they see huge loads for LHCb Shared area sometimes not reachable (stale mounts on the WN, or the NFS server is too busy to respond) Software installation in many Grid sites is a tough task (job failures, resubmission, tag publication...) Space on performant NFS servers is expensive if VOs want to install new releases and keep the old ones they have to ask for quota increases In the three months to September there were 33 GGUS tickets related to shared software area issues for LHCb

What might help? A caching file system would be nice – some sites use AFS but install jobs seem still to be very problematic A read only file system would be good A system optimised for many duplicated files would be nice

What, exactly, is cvmfs? It is not, really, anything to do with virtualisation A client-server file system Originally developed to deliver VO software distributions to (cernvm) virtual machines in a fast, scalable, and reliable way. Implemented as a FUSE module Makes a specially prepared directory tree stored on a web server look like a local read-only file system on the local (virtual or physical) machine. Uses only outgoing HTTP connections Avoids most of the firewall issues of other network file systems. Transfers data file by file on demand, verifying the content by SHA1 keys.

What is cvmfs? - Software Distribution Essential Properties: ∙ Distribution of read-only binaries ∙ Files and file meta data are downloaded on demand and cached locally ∙ Intermediate squids reduce network traffic further ∙ File based deduplication – a side effect of the hashing ∙ Self-contained (e. g. /opt/atlas), does not interfere with base system

What is cvmfs? – Integrity & authenticity Principle: Digitally signed repositories with certificate white- list

What is cvmfs? – Repository statistics Note how small the bars for unique files are - both in numbers and by volume The hashing process identifies duplicate files and once they are cached they are never transferred twice

Tests at PIC Set out to compare performance at WN in detail Metrics measured Execution time for SetupProject - the most demanding phase of the job for the software area (huge amount of stat() and open() calls ) Execution time for DaVinci Dependence on the number of concurrent jobs

Tests at PIC – setup time

Tests at PIC – local cache size LHCb: One version of DaVinci (analysis package): the software takes 300 MB + CernVMFS catalogue overhead, total space: 900 MB The catalog contains file metadata for all LHCb releases Download once (for every new release) and then keep in cache Each additional version of DaVinci exceuted adds 100 MB ATLAS : One release of Athena: 224MB of SW + catalog files: 375MB The overhead is less since the catalog has been more optimised for ATLAS software structure

Tests at RAL More qualitative – Interested in scalability for servers Focus on how well it might replace overloaded NFS server – Have not examined in such detail at what happens at the client – Have confirmed that we can run through jobs with no problems

Tests at RAL – Atlas Monte Carlo Early test with 800 or so jobs – the squid barely missed a beat

Tests at RAL – Atlas Monte Carlo In the same period the NFS Atlas SW server – with 1500 or so jobs running

Tests at RAL – Atlas Hammercloud tests Average over several sites

6150 jobs at RAL Tests at RAL – Atlas Hammercloud tests

Lets look in more detail – CPU/Walltime Tests at RAL – Atlas Hammercloud tests Average over several sitesRAL

Lets look in more detail – Events/Wallclock Tests at RAL – Atlas Hammercloud tests Average over several sitesRAL

Tests at RAL – Atlas Monte Carlo Since reconfiguring cvmfs and starting analysis jobs– the squid again happy Worth noting that even at the point the network is busiest – cpu is not -

Tests at RAL – Load on Squids Not quite negligible - but very manageable For initial tests squid was running on the retired Atlas software server (5 year old WN) Separate from other frontier squids For resilience (rather than load) we have added a second squid – client randomly mounts one or the other - failover appears transparent Starting to accept Atlas user analysis jobs which will all use cvmfs - on an experimental basis Will learn much more Will use wider range of releases – should stress chaches more than current production and test jobs Over last weekend removed limits – squids still happy

Current State of CernVM-FS Network file system designed for software repositories Serves 600 GB and 18.5 Million files and directories Revision control based on catalog snapshots Outperforms NFS and AFS Warm cache speed comparable to local file system Scalable infrastructure Integrated with automount/autofs Delivered as rpm/yum package – and as part of CernVM Atlas, LHCb, Alice, CMS… all supported Failover-Mirror of the source repositories (CernVM-FS already supports automatic host failover) Extend to conditions databases Service to be supported by Cern IT ( Security audit – in progress Client submitted for inclusion with SL Active Developments

Summary Good for sites – Performance at client is better – Squid very easy to set up and maintain – and very scalable – Much less network traffic Good for VOs – VOs can install each release once – at CERN No more local install jobs – Potentially useful for hot files to

Acknowledgements & Links Thanks to – Jakob Blomer who developed – Elisa Lanciotti carried out the tests at – Alastair Dewhurst & Rod Walker who’ve been running the Atlas tests at RAL Links PIC Tests: Elisa’s talk at September 2010 GDB RAL cvmfs squids: CVMFS Downloads: CVMFS Technical Paper: Jakob’s talk from CHEP: