Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.
D. Düllmann - IT/DB LCG - POOL Project1 POOL Release Plan for 2003 Dirk Düllmann LCG Application Area Meeting, 5 th March 2003.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.
Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
ATLAS Detector Description Database Vakho Tsulaia University of Pittsburgh 3D workshop, CERN 14-Dec-2004.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN - IT Department CH-1211 Genève 23 Switzerland t COOL Conditions Database for the LHC Experiments Development and Deployment Status Andrea.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
1Maria Dimou- cern-it-gd LCG November 2007 GDB October 2007 VOM(R)S Workshop report Grid Deployment Board.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
CASTOR: possible evolution into the LHC era
Jean-Philippe Baud, IT-GD, CERN November 2007
Diskpool and cloud storage benchmarks used in IT-DSS
Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)
Database Readiness Workshop Intro & Goals
Luca dell’Agnello INFN-CNAF
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
Workshop Summary Dirk Duellmann.
Presentation transcript:

Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF

Outline Face 2 Face workshop overview Site status reports Development status and plans ◦ CASTOR ◦ Tape ◦ Monitoring Development agenda Open issues Load testing Conclusions

CASTOR workshop overview Held at RAL on February ‘09 with the following goals: ◦ Exchange experience between sites currently using CASTOR about day-by-day operations and issues ◦ Plan for 2009 about scaling up the deployments, number of instances, size of disk caches, tape infrastructure, s/w upgrades, migration to SLC5, usage of 32bit architecture ◦ Describe and discuss the development plans for 2009 and beyond: planned releases, feature changes, time lines, support for the different software versions ◦ Train the newcomers (several new entries in CASTOR staff at various sites) ~ 30 participants from CERN, RAL, CNAF and ASGC (via phone conference)

CASTOR Sites Reports Staff ◦ CERN: CASTOR2 & SRM operations staff decreases from ~3 to ~2 FTEs. Tape team from 4 to 2FTEs ◦ RAL: various changes in the staff ◦ CNAF: CASTOR2 main expert has left and substituted by a new admin Hardware survey SITETape robots Tape DrivesTape Space PB Disk servers Disk space PB Oracle servers CERN per exp CNAF218 (soon 58) (soon 6)6 RAL1 (soon 2)7 dedicated + 32 shared 5k slots (soon 10k) ASGC

CASTOR Sites Reports Software versions: Monitoring in brief:

Databases Time slot in the 3D meetings dedicated to CASTOR found very useful Working together in order to move all CASTOR databases to CERN 3D Enterprise Manager in order to have a single entry point for CASTOR DB monitoring Future development: ◦ DB side connection pooling for SRM ? ◦ Get rid of the optimiser hints in the code ◦ Create SQL profiles via OMS and distribute as part of the Castor software ?

Development (Sebastien Ponce) Task force from January to June 2008 reviewed the design choices of CASTOR and started 5 projects to address its weak points: ◦ File access protocols and data latency  CASTOR needs to support analysis activity: Small files, many concurrent streams, Mainly disk, with aggregated tape recalls, Low latency for file opening  The XROOT protocol and I/O server have been chosen to achieve this with CASTOR specific extensions ◦ Tape efficiency, repack  File aggregation on tape( e.g. Tar/zip files of 10GB)  New tape format: less tape marks, no metadata  Repack strategies ◦ Security  Goal is to ensure: every user is authenticated (authentication),every action is logged (accounting),Every resource can be protected (authorization),Ensure complete interoperability of Grid and local users  Kerberos 5 or GSI authentication for all client connections ◦ SRM and Database schema  Plan to combine the stager and the SRM software, Would allow also to merge the 2 databases ◦ Monitoring  Ease operation  Real time, automatic detection of pathological cases

File access protocols (Sebastien Ponce) File access protocols (Sebastien Ponce)

Tape Development (German Cancio’s presentation) Recall/migration policies :write more data per tape mount: ◦ Hold back requests based on the amount of data and elapsed time Writing small files on tape is slow, due to tape format: ANSI AUL format ◦ 3 tape marks per file ◦ 9 seconds per data file independent of its size New tape format reduce the metadata overhead: Multi-file block format within the ANSI AUL format: ◦ Header per block for “self description” ◦ 3 tape marks per n files data file 1 … … data file n hdr1 hdr2 uh1 tm eof1 eof2 utl1 tm TrailerN data filesHeader Each 256 KB data file block written to tape includes a 1 KB header

Monitoring Development Monitoring CASTOR today requires to collect different information from many different tools Both CERN and RAL are putting a big effort on enhancing and consolidating existing monitoring tools RAL is starting from CMS monitoring code to incorporate information scattered at RAL but also through the 4 sites in a single view (see Brian Devies’s presentation) New monitoring system developed at CERN is part of release Strategy: use the existing logging system (DLF) as the starting point, eventually improve and extend the existing log messages to contain any missing information (see Dennis’s presentations) Monitoring Tables PHP DLF tables SQL WEB Interface

CASTOR Agenda: Is in maintenance mode ◦ Stable, deployed in production at all sites ◦ No new features included for quite long ◦ Bug fix releases for major issues Will be phased out in the next few months at CERN ◦ Repack and analysis setups already running ◦ Plans to upgrade the other production instances in March/April Would not be supported anymore after the release of according to current rules ◦ i.e. sometime in Spring

CASTOR Agenda: Some of the new features: ◦ Support for replication on close ◦ Basic user space accounting ◦ High level monitoring interface ◦ Ordering of requests is now respected ◦ Support for OFFLINE libraries Now stabilized: For ~2 months on CERN's repack and analysis setups Proposal of the CERN's operation team to build a with important fixes/features backported and to switch to maintenance mode should be available by end of February Deployed 2-4 weeks later on CERN production instances

CASTOR Agenda: Current development version (head of CVS) will include ◦ Improved nameserver ◦ Further xroot integration (write case) ◦ Revisited build infrastructure ◦ Ability to build only client/tape part Timelines not yet very precise ◦ Spring/Summer 2009 Deployment before LHC startup unlikely for T0 with current LHC scheduler

Load tests (Dennis) Certification setup at CERN based on virtual machine and aimed at certifying CASTOR components functionality Many test scripts exist (/afs/cern.ch/project/castor/stresstest) ◦ Heavily customized to the CERN environment. ◦ Each test should run for hours. Requires expert knowledge Not all elements of CASTOR are tested Tests are customized for each CASTOR version RAL is thinking of creating a test-bed in order to try out the new CASTOR release in a T1 like environment

Some of the Main Open Issues Hotspots: server goes into high IO wait and delivers (almost) no data ◦ Can only recover by killing off all RUNning requests (i.e. set server state to DISABLED) ◦ Observed correlations to specific RAID configurations -> reducing RAID units shows higher latencies and severely limits I/O performance (see Ignacio talk) Very big ID inserted sometimes in the id2type table (Aka BigIDs problem): SQL> select count(*) from id2type where id>10E18; COUNT(*) but no “no type found for ID”. The daemon is keep trying to process the request in status=0 which don't have Id2Type. ◦ Solution via hand written receipt from Shaun De Witt, but a more stable solution is needed Occasional unresponsiveness from JobManager for 2-3 minutes: ◦ delay with jobs reaching the job manager from the stager ◦ delay with jobs reaching LSF Oracle unique constraint violations in RH Possible crosstalk between atlas and lhcb stagers (Oracle Bug)

Conclusions CASTOR workshop useful to exchange experience between sites in administering and facing the new problems which arise One of main topics of discussion is upgrades of CASTOR software releases: non CERN (2.1.8 upgrade or not?) Significant effort is being put on development of monitoring integration tools, tape efficiency and core s/w