Considerations on Using CernVM-FS for Datasets Sharing Within Various Research Communities Catalin Condurache STFC RAL UK ISGC, Taipei, 18 March 2016.

Slides:



Advertisements
Similar presentations
High Performance Computing Course Notes Grid Computing.
Advertisements

Delivering Experiment Software to WLCG sites A new approach using the CernVM Filesystem (cvmfs) Ian Collier – RAL Tier 1 HEPSYSMAN.
Pre-GDB on Batch Systems (Bologna)11 th March Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Chapter 7 Configuring & Managing Distributed File System
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep ,
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
1 port BOSS on Wenjing Wu (IHEP-CC)
OSG Public Storage and iRODS
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Introduction to CVMFS A way to distribute HEP software on cloud Tian Yan (IHEP Computing Center, BESIIICGEM Cloud Computing Summer School.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing.
The ILC And the Grid Andreas Gellrich DESY LCWS2007 DESY, Hamburg, Germany
1 Resource Provisioning Overview Laurence Field 12 April 2015.
ATLAS Detector Description Database Vakho Tsulaia University of Pittsburgh 3D workshop, CERN 14-Dec-2004.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Changes to CernVM-FS repository are staged on an “installation box" using a read/write file system interface. There is a dedicated installation box for.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Catalin Condurache STFC RAL Tier-1 GridPP OPS meeting, 10 March 2015.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
The GridPP DIRAC project DIRAC for non-LHC communities.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest
US Atlas Tier 3 Overview Doug Benjamin Duke University.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Testing CernVM-FS scalability at RAL Tier1 Ian Collier RAL Tier1 Fabric Team WLCG GDB - September
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
Module 11: Configuring and Managing Distributed File System.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
The GridPP DIRAC project DIRAC for non-LHC communities.
Andrea Manzi CERN EGI Conference on Challenges and Solutions for Big Data Processing on cloud 24/09/2014 Storage Management Overview 1 24/09/2014.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
CernVM-FS – Best Practice to Consolidate Global Software Distribution Catalin CONDURACHE, Ian COLLIER STFC RAL Tier-1 ISGC15, Taipei, March 2015.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
Module 11 Configuring and Managing Distributed File System.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
CernVM-FS vs Dataset Sharing
Accessing the VI-SEEM infrastructure
WLCG IPv6 deployment strategy
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Dag Toppe Larsen UiB/CERN CERN,
BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Introduction to CVMFS A way to distribute HEP software on cloud
Simulation use cases for T2 in ALICE
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
Multi-Farm, Cross-Continent SharePoint Architecture
IPv6 update Duncan Rand Imperial College London
User Accounting Integration Spreading the Net.
Thursday AM, Lecture 2 Brian Lin OSG
Presentation transcript:

Considerations on Using CernVM-FS for Datasets Sharing Within Various Research Communities Catalin Condurache STFC RAL UK ISGC, Taipei, 18 March 2016

Outline Introduction Bit of history EGI CernVM-FS Infrastructure Status What is a dataset? Dataset sharing with CernVM-FS Use cases ISGC, Taipei, 18 March 2016

Introduction - CernVM-FS? CernVM File System is a read-only file system designed to deliver experiment software to grid worker nodes over HTTP in a fast, scalable and reliable way Built using standard technologies (fuse, sqlite, http, squid and caches) ISGC, Taipei, 18 March 2016

Introduction - CernVM-FS? Files and directories are hosted on standard web servers and get distributed through a hierarchy of caches to individual grid worker nodes Mounted in the universal /cvmfs namespace at client level ISGC, Taipei, 18 March 2016

Introduction - CernVM-FS? Software needs one single installation, then it is available at any site with CernVM-FS client installed and configured –Relocatable, dependencies included Standard method to distribute HEP experiment software in the WLCG, also adopted by other grid computing communities outside HEP ISGC, Taipei, 18 March 2016

Introduction - CernVM-FS? Due to increasing interest, technology has been optimized also for access to conditions data and other auxiliary data –file chunking for large files –garbage collection on revision level –file system history ISGC, Taipei, 18 March 2016

Introduction - CernVM-FS? Also because of use of standard technologies (http, squid) => CernVM-FS can be used everywhere i.e. cloud environment, local clusters (not only grid) I.e. add CernVM-FS client to a VM image => /cvmfs space automatically available ISGC, Taipei, 18 March 2016

Bit of History – CernVM-FS Last 5+ years distribution of experiment software and conditions data to WLCG sites has been changed massively –CernVM-FS became the primary method –No need of local installation jobs (heavy loaded) network file servers –It relies on a robust decentralised network of repositories replicas and caches ISGC, Taipei, 18 March 2016

Bit of History – CernVM-FS In parallel the use of CernVM-FS in communities outside WLCG and HEP has been increasing steadily –Growing number of repositories and CernVM-FS servers around the world ISGC, Taipei, 18 March 2016

Bit of History – CernVM-FS ~3 years ago RAL Tier-1 started a non-LHC Stratum-0 service –Inception of EGI CernVM-FS infrastructure –The European Grid Infrastructure (EGI) enables access to computing resources for European scientists and researchers from all fields of science, from High Energy Physics to Humanities ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status ISGC, Taipei, 18 March 2016 Stratum-0 NIKHEF nikhef.nl Stratum-1 NIKHEF Proxy Hierarchy Stratum-1 CERN Proxy Hierarchy Stratum-0 DESY desy.de Stratum-1 RAL Stratum-0 RAL egi.eu Proxy Hierarchy Stratum-1 DESY Proxy Hierarchy Stratum-1 ASGC Stratum-1 TRIUMF Current topology

EGI CernVM-FS Infrastructure Status Topology follows WLCG model Stratum-0 are disjoint and represent the source repositories where software is installed by VOs Stratum-0, Stratum-1 can be geographically co- located or not It partially makes use of the existent hierarchy of proxy servers used for LHC software distribution ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status 37 software repos currently hosted and replicated –HEP and non-HEP –Stratum-0s at RAL, NIKHEF, DESY –Stratum-1s at RAL, NIKHEF, TRIUMF, ASGC –5 repos at the time of EGI CernVM-FS Task Force kick-off (Aug 2013) –BIG change in two-and-half years! ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status Stratum-0 at RAL –29 repos –2.15 Mio files 6.65 Mio on uploader –829 GB –~107KB average filesize (between 36KB and 3MB) ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status At RAL – 29 repositories on Stratum-0 –Some more dynamic / less dynamic pheno.egi.eu – 377 releases phys-ibergrid.egi.eu – 5 releases t2k.egi.eu – 139 releases glast.egi.eu – 4 releases –Some big / not so big chipster.egi.eu - 313GB phys-ibergrid.egi.eu - 69MB biomed.egi.eu - 177GB ligo.egi.eu - 2.8GB t2k.egi.eu – 59GB wenmr.egi.eu – 3.8GB ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status Stratum-1 at RAL –It replicates 58 repos from RAL, NIKHEF, DESY, OSG, CERN –Over 10 TB ISGC, Taipei, 18 March 2016

EGI CernVM-FS Infrastructure Status Last 12 months activity – egi.eu Hits by repository on x1000 ISGC, Taipei, 18 March 2016

What is a Dataset? Collection of data A file is the smallest operational unit of data Files are aggregated into datasets (a named set of files) Datasets can be grouped into containers (a named set of datasets) ISGC, Taipei, 18 March 2016

What Is a Dataset? If possible, files are kept large (several GB) Datasets often contain thousands of files with events recorded under “constant” conditions All this from a LHC / HEP perspective Is this true from an EGI non-HEP research community perspective? ISGC, Taipei, 18 March 2016

Datasets Sharing with CernVM-FS? Nowadays CernVM-FS used for software (and conditions data) distribution –Grid jobs tend to be started in large batches running the same code –High cache hit ratio at proxy cache level In what situations… –Can datasets be shared via CernVM-FS? Well… ISGC, Taipei, 18 March 2016

Datasets Sharing with CernVM-FS? Datasets are usually large amount of data (big- sized files) that are not quite appropriate for distribution via CernVM-FS… –While the system can cope with big files nowadays (file chunking), pure data set distribution would render any kind of cache useless (especially at client level) –The low cache hit ratio would impact the proxy caches engineered with relatively low bandwidth ISGC, Taipei, 18 March 2016

Datasets Sharing with CernVM-FS? …but could be considered within some limits though… –There is at least some sharing involved very frequent data re-accessing datasets that are required by many nodes simultaneously (i.e. conditions data) –Burst of co-located clients that are using the same data –When datasets are small and proper dedicated storage not first option (or not available) ISGC, Taipei, 18 March 2016

Use cases Educational purpose Small test data sets < O(10MB) included within the repositories –New users can get started straight away with the code –Also for software unit tests ISGC, Taipei, 18 March 2016

Use cases As for "production" data sets - currently relying on a multi-VO DIRAC setup to manage the data via the DIRAC File Catalog (DFC) - it has the needed metadata capabilities ISGC, Taipei, 18 March 2016

Use cases Using CernVM-FS to distribute several genomes and their application indexes and some microarray reference datasets to Chipster servers Chipster could also use CernVM-FS for distributing other research datasets too –BLAST databases indexes – needed to do sequence similarity searches for protein or nucleotide sequences (~100GB) ISGC, Taipei, 18 March 2016

Use cases Adding BLAST DBs to the current Chipster, would increase the size of the virtual image by 100 GB - too much New releases of the databases are published about once a month. If incorporated to the Chipster image, the users would need update their (too big) virtual images regularly ISGC, Taipei, 18 March 2016

Use cases By using CernVM-FS the latest versions of BLAST datasets could be provided to the Chipster users in a centralized and more effective storage way Very interested to continue using CernVM-FS for sharing both scientific tools and datasets for the Chipster servers running in EGI Federated Cloud ISGC, Taipei, 18 March 2016

Use cases Some dataset distribution ~180 MB of data that have to be always fed as an input to the simulation - it sets up the environment It is possible to have to upload another set of input data in order to have a different setup ISGC, Taipei, 18 March 2016

Use cases LIGO (UK) Recent dataset was 1.5TB in 5400 files –12 directories, each file up-to 450MB Not suitable, but… ISGC, Taipei, 18 March 2016

Use cases …from time to time people create fake data with fake signals Fewer files, so might be a good fit for CernVM-FS Waiting for the opportunity to arrive ISGC, Taipei, 18 March 2016

Use cases AUGER Do not plan to use CernVM-FS for a data distribution in the near future May consider it for distribution of some relatively small databases, but not yet even a plan for tests ISGC, Taipei, 18 March 2016

CernVM-FS vs Datasets Sharing - Conclusions CernVM-FS should not be first option… –It may affect the performance of the service at Stratum-1 and squid level …if some conditions do not meet –At least some sharing involved, like very frequent data re-accessing –Or data sets that are required by many nodes simultaneously (i.e. conditions data) ISGC, Taipei, 18 March 2016

CernVM-FS vs Datasets Sharing - Conclusions Use instead a Storage Element (SE)… –Not always possible …or setup a Web server… –/cvmfs/ -> wget –Still could make use of squids and caches …or use NFS for local storage ISGC, Taipei, 18 March 2016

CernVM Users Workshop, RAL, 6-8 June registration open!

ISGC, Taipei, 18 March 2016 Other Conclusions?

ISGC, Taipei, 18 March 2016 Thank you!