Operational experiences Castor deployment team Castor Readiness Review – June 2006.

Slides:

Advertisements

Similar presentations

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Advertisements

CERN Castor external operation meeting – November 2006 Olof Bärring CERN / IT.

June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.

16/9/2004Features of the new CASTOR1 Alice offline week, 16/9/2004 Olof Bärring, CERN.

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

Castor F2F Meeting Barbara Martelli Castor Database CNAF.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.

GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

Functional description Detailed view of the system Status and features Castor Readiness Review – June 2006 Giuseppe Lo Presti, Olof Bärring CERN / IT.

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.

Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

Patricia Méndez Lorenzo Status of the T0 services.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -

CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.

Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.

INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre

15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.

Item 9 The committee recommends that the development and operations teams review the list of workarounds, involving replacement of palliatives with features.

Jean-Philippe Baud, IT-GD, CERN November 2007

CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team

High Availability Linux (HA Linux)

Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.

Cross-site problem resolution Focus on reliable file transfer service

Giuseppe Lo Re Workshop Storage INFN 20/03/2006 – CNAF (Bologna)

Service Challenge 3 CERN

Bernd Panzer-Steindel, CERN/IT

Castor services at the Tier-0

Olof Bärring LCG-LHCC Review, 22nd September 2008

1 VO User Team Alarm Total ALICE ATLAS CMS

PES Lessons learned from large scale LSF scalability tests

Ákos Frohner EGEE'08 September 2008

CTA: CERN Tape Archive Overview and architecture

Data Management cluster summary

Presentation transcript:

Operational experiences Castor deployment team Castor Readiness Review – June 2006

Jan van Eldik (IT/FIO/FS) 2 Outline  Operational model  Castor-2 usage characteristics by the different experiments  What works, what doesn’t, what can be improved?  Database operations  Services (dlf, lsf, stager, rmmaster)  Grid related complications just to give you a flavour

Jan van Eldik (IT/FIO/FS) 3 Current operational model  Many distinct but similar Remedy/GGUS flows…  Castor.support  Castor2.support  WAN data operations  Data service interventions  … plus additional requests/alarms/information by …  Castor.operations, castor-deployment  Experiment and Service Challenge mailinglists  Private s…  … but always the same group of people  currently 4 service managers (but <4 FTE’s) + 2 core developers  Drawbacks:  As things go wrong on many different levels (hardware failures, network problems, system crashes, configuration problems), debugging is not trivial. Neither is problem reporting…  Often many-to-many reporting leads to silence…  SLA: best effort, no formal 24 x 7 coverage, no measurable QoS

Jan van Eldik (IT/FIO/FS) 4 Apply Standard operational model  Re-factor the support area’s  tape-related problems will be handled by a separate flow  Rely more on CC operators and SysAdmin team  to provide 24 x 7 coverage  already used for H/W and OS level problems  we will need to provide procedures or fix bugs  3-level support structure, with escalations  1st level: FIO Service Manager on Duty one person + backup from a group of 5 Service Managers “simple cases”, responsible for follow-up of difficult cases As for lxplus/lxbatch  2nd level: Castor Service Manager Castor Operations team  3rd level: Developers Castor, SRM, Gridftp, rootd, FTS, LFC, …

Jan van Eldik (IT/FIO/FS) 5 Castor usage  Group activities in diskpools  T0: one user, large files, limited number of streams  Durable data: `write once, read many` need to manage additional SRM endpoints  User analysis: many users many small files –Atlas, CMS: average filesize <100 MB –40K files per server! –Overhead of tapewriting many (looooong lived!) open filehandles exception: Alice copy datafiles before opening them  There will be more…  Very different access patterns!  Castor is handling these access patterns in production

Jan van Eldik (IT/FIO/FS) 6 Workarounds  We have workarounds for various bugs  cleanup procedures in databases  Lemon actuators, cron jobs  throw hardware at the problem  Good  service continues to function  operational flexibility, big improvement over Castor-1!  Bad  bug fixes take time, as does their rollout  workarounds tend to lower priority of bug fixes  “temporary fixes” become harder to remove as time goes by…

Jan van Eldik (IT/FIO/FS) 7 Stager database operations  Examples  Stuck tape recalls, migration streams (caused by full tape pool, improper shutdown of rtcpclientd)  Stuck file request (scheduled for a disk server that “disappeared”)  We can stop/restart/rebuild/fix them through SQL statements (was not the case in Castor-1)

Jan van Eldik (IT/FIO/FS) 8  we suffer (infrequently…) from LSF meltdowns  very tight scheduling required (mbatchd is basically  overhead in the CASTOR LSF plugin  Castor schedules LSF jobs to the diskservers for fileaccess  scheduling is complicated (over-complicated?)  scheduling takes time  each running job involves 5 processes  non-negligible overhead  We observe load spikes on diskservers… LSF issues

Jan van Eldik (IT/FIO/FS) 9 Load spikes on diskservers  Not fully understood yet… but clearly related to bursts of user activity  To be studied:  Minimize footprint of individual job (reduce number of forked processes)  Workarounds:  Limit number of jobs per server (inefficient use of disk space, need more servers…)  Temporarily disable server (automatic, based on Lemon alarms)

Jan van Eldik (IT/FIO/FS) 10 …but the system performs! Diskserver from CMS user pool running at NIC speed

Jan van Eldik (IT/FIO/FS) 11 Access protocols  all Castor-2 diskservers have 3 access protocols configured  rfio  dangling connections when clients die (fix to be deployed)  rootd  no big problems  good response from developers  already worked on Castor-1  does not log to DLF  gridftp  all nodes have (managed!) grid certificates, monitoring, etc  well-behaved (CPU-hungry, but the diskservers have spare cycles)  does not log to DLF  expertise will be needed gridftp not ‘native’ protocol yet cope with evolving environment (SLC4, 64bit, VDT) castor-gridftp v2 (when required)

Jan van Eldik (IT/FIO/FS) 12 DLF  Distributed Logging Facility: logs events from instrumented components into Oracle  Queries take too long (lack of partitioning) fixed, but not yet deployed  Database full? Drop the tables!  stager needs dlfserver up and running (bug!)  Stager leaks memory cron job to restart it every 3 hours  dlfserver dies Lemon actuator restarts it But to stop the service for a DB upgrade, we need to stop the cron job and deactivate the actuator first!

Jan van Eldik (IT/FIO/FS) 13 Configuration consistency  Castor components are often individually configured, and store the information independently  LSF  Stager database  rmmaster  Inconsistencies lead to confusion (at best…), inefficiencies (unused diskservers), problems…  We derive configurations as much as we can from our Quattor Configuration DataBase  Workaround: run regular consistency checks, correct configurations by hand

Jan van Eldik (IT/FIO/FS) 14 rmmaster…  Is not stateless at all!!!  Looses state when daemon is restarted  Administrator must store and restore by hand when upgrading  IP renumbering of diskservers are not properly handled  rmmaster writes filesystems/diskserver information into stager DB, once and for all…  This component needs a re-design!

Jan van Eldik (IT/FIO/FS) 15 Castor and the grid  we run multiple SRM endpoints  castorgrid.cern.ch (aka castorsrm.cern.ch )  srm.cern.ch (aka castorgridsc.cern.ch )  soon: 4 endpoints to “durable” experiment data  later: SRM v2.1  castorgrid.cern.ch and srm.cern.ch provide similar functionality…  castorgrid.cern.ch proxies data, necessary for access Castor-1 stagers  but with different connectivity should no longer be the case ...but software packages and their configurations are very different  historical reasons…  leading to different grid-mapfiles  rationalizing this is slow, painful, and necessary  We have very little monitoring of SRM and GridFTP problems

Jan van Eldik (IT/FIO/FS) 16 Conclusions  Castor instances are large (LHC expts: 1.7M files, 463 TB) and complex (very different filesizes/access patterns)  And they are up-and-running!  To run the Castor services requires workarounds  We have workarounds for many problems (good!)  We have workarounds for many problems (bad!)  It takes us too long to get bugs fixed and deployed (and bugs are exported to external institutes…)  To be able to scale the operability  We are streamlining our support structure  We need to replace workarounds by fixes!