Download presentation
Presentation is loading. Please wait.
Published byEleanore Dorsey Modified over 9 years ago
1
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases Véronique Lefébure CERN-IT-FIO/FS
2
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t These slides are about: Nothing new Common sense –Sometimes good to be repeated There are some little details that can make a big difference
3
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t VOBOX: the CERN-IT-FIO definition: A box dedicated to a VO, running one (or more) VO service(s) IT-FIO “VOBOX Service” handles: –Choice of hardware according to user specifications –Base OS installation & software upgrades –Hardware monitoring & maintenance –Installation & monitoring of common services Eg: apache SLA document in preparation User-specific Service installation & configuration managed by the VO – in compliance with the SLA
4
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t VOBOX Hardware: Resource requirements and planning – it is not always easy to have an additional disk on demand because “/data” becomes full Hardware warranty –Plan for hardware renewal –Check warranty duration before moving to production Hardware naming and labeling –Make use of aliases to facilitate hardware replacement –Have a “good” name on the sticker Eg. All lxbiiii machines may be switched off by hand in case of a cooling problem
5
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t VOBOX software: Be informed of coming software upgrades –Register on the adequate announce mailing lists Test software upgrades –Have a “test” machine –Check for no package conflicts –Test that your applications are not broken Be ready for a reboot –Scheduled reboot: kernel upgrades, … (see SLA) –Unscheduled reboot: power cut, human mistake, … –Use init scripts
6
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t User Software and Data: Be ready for a full OS reinstallation –Hardware replacement –Security incident Have important data and configuration files regularly backed up Use central configuration database as much as possible –At CERN CC: Quattor/CDB
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Monitoring: Have your daemons monitored –Eg: with LEMON Automatic restart of daemons Automatic notification by email, by SMS Use the CC Operator service –The operator reacts to alarms Check that your machine is alarmed (i.e. not on “maintenance” state) –Provide your procedures and exact contact information –Use “hot-line” mailing list for emergency (one per VO)
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Service Reliability: Where needed, have a fail-over system –Use Load-balancing alias, … TEST the fail-over mechanism Make sure that no other machine is introduced under that alias –Have it on a different network switch
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Communication: Regularly meet in person (or at least use the phone from time to time) –Improved communication –Clarifications –Collaboration
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-Cases (1/6) CMS DBS: Criticality level ’10’ boxes: vocms02 + vocms05 Hw warranty till Oct 2009 IP switches: S513-C-IP218 and S513-C-IP216 Load-balanced alias: “cmsdbsprod” Load-balanced alias name defined at profile level Contact information: cms-dbs-support@cern.chcms-dbs-support@cern.ch Importance = “50” Piquet Call if needed ?Monitoring
11
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-Cases (2/6) CMS “Cessy->T0 transfer system”: Criticality level ’10’ (lxgate39) Importance = “45” NO Piquet Call if needed Only ONE machine ?Monitoring (xrootd monitored by LEMON) CMS considerations machine essential for us, somehow part of the online system software can't be load-balanced –why? What if the machine breaks? Would a spare and test machine be useful ? once real data operations start, machine needs to be up whenever there is detector activity (beam, cosmics, calibration). We have buffer spaces to bridge downtime of component and machines and there are provisions to shutdown and restart our software. But we design for steady-state operations and everything that gets us out of steady-state is a very big deal as it causes ripple effects through the rest of the system.
12
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-Cases (3/6) CMS “PHEDEX” Criticality level ‘9’ boxes: vocms01 + vocms20 Hw warranty till Oct 2009 IP switches: S513-C-IP217 and S513-C-IP305 Vocms20 = hot spare Contact information: cms-phedex-admins@cern.ch phedex.admin@cern.ch cms-phedex-admins@cern.chphedex.admin@cern.ch Importance = “50” Piquet Call if needed ?Monitoring. “Phedex Monitoring” currently runs on a CMS machine (not in CC yet)
13
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-Cases (4/6) LHCb : volhcb01 & volhcb02 “will be the critical boxes for CCRC08” –But not yet really in production; “these two machines will be different and they will run the various DIRAC3 services : WMS, Bookkeeping, transfer agent “ HW warranty till May and Oct 2009 Network switches: S513-C-IP36 & S513-C-IP218
14
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-Cases (5/6) ATLAS “voatlas10” Criticality level ‘high’ DDM (Distributed Data Management) HW warranty till May 2009 – ”During LHC period ATLAS will have 4 computers + 2 spares (hot backup) to run DDM central services, 10 computers + 3 spares to run site services (VO boxes)“ Note: dependency on many ARDA boxes still named “lxb7iii” etc …
15
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Use-cases (6/6) ALICE Voalice0i Criticality level ‘10’ Box functionality not specified in CDB (except for the 2 xrootd control nodes) User contact: one person only ?Use of special procedures ? recent experience: a kernel upgrade broke Alice applications Usefulness of a test machines ! Now ALICE has 2 machines in our Preprod name space
16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Conclusions: Think of reviewing –Configuration –Procedures –Hardware warranty regularly, with the IT Service Manager Foresee and use test machines
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.