20.10.20041 Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

Slides:



Advertisements
Similar presentations
CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.
CVS Service at CERN status and LCG-dedicated service CERN IT/PS/UI October 2003.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
Online System Status LHCb Week Beat Jost / Cern 9 June 2015.
HEPiX 2 nd Nov 2000 Alan Silverman Proposal to form a Large Cluster SIG Alan Silverman 2 nd Nov 2000 HEPiX – Jefferson Lab.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
AI How to: System Update and Additional Software
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
Operational procedures and tools for scheduled shutdowns at CC-IN2P3
WP4-install status update
Running Computers in CC
Status and plans of central CERN Linux facilities
Quattor Usage at Nikhef
Leanne Guy EGEE JRA1 Test Team Manager
Deploying Production GRID Servers & Services
Presentation transcript:

Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL

20 October 2004Thorsten Kleinwort IT/FIO/FS 2 Outline Farms at the CERN CC: The Tools Framework The Working Teams Real Life Use Cases Collaborations Summary Useful Links

20 October 2004Thorsten Kleinwort IT/FIO/FS 3 The Tools Framework ELFms Quattor: Installation (Kickstart + SWREP) Configuration (CDB + NCM) Management (SPMA + NCM) Lemon: Monitoring Batch system statistics LEAF: State management (SMS) Hardware management (HMS) LEMON QUATTOR LEAF =++

20 October 2004Thorsten Kleinwort IT/FIO/FS 4 The Tools Framework (cont’d) The evolution of the ELFms tools is described in various previous presentations: HEPiX II/2003 (Vanouver): ‘The new Fabric Management Tools in Production at CERN’ HEPiX I/2004 (Edinburgh): ‘ELFms, status, deployment’ by German Cancio ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): ‘Current Status of Fabric Management at CERN’ by German This HEPiX: `Experience in the use of quattor tool suite outside CERN’ => Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN

20 October 2004Thorsten Kleinwort IT/FIO/FS 5 Tools (cont’d): Other tools [interfacing CDB]: Script: PrepareInstall.pl: Does all necessary steps to prepare a machine install Can run with a list of hosts (for mass installs) Gets all the necessary information from CDB Creates a kickstart file for each node Local Script: maintenance: Script to rundown a node: Drains batch nodes Warns users on interactive nodes Can execute configurable script at the end, e.g. reboot

20 October 2004Thorsten Kleinwort IT/FIO/FS 6 Tools (cont’d) Automated Fabric [LEAF]: State Management System SMS: Other CDB changes are done by SMS: Change OS/Cluster Systems have state: ‘production’ or ‘standby’ Hardware Management System HMS: Workflow to track hardware changes [interfaces CDB]: New machine arrival Machine moves Machine interventions (Vendor calls), retirements

20 October 2004Thorsten Kleinwort IT/FIO/FS 7 The Working Teams Operator “Customers” Service Manager SysAdmins 24/7 Alarm display Following procedures: Acting on alarms Open Remedy tickets /phone notification Machine reboots New team Now 7 staff, more to hire Running more and more services in the CC Doing most of the install and maintenance work on farm PCs Following up h/w failures ‘Vendor calls’ Farm/Cluster resource planning Writing/improving the procedures/tools Following up on new problems Other groups/teams in CERN-IT, like: DB (ORACLE) GD (LCG) GM (EGEE) Experiments (Data Challenges) Changing requirements

20 October 2004Thorsten Kleinwort IT/FIO/FS 8 Another Management Tool Remedy: The problem tracking tool in CERN IT Used in different workflows, e.g. by: The Operator to open tickets following up on alarms The Service Managers to ask for machine interventions The SysAdmins to follow up on problems/general issues HMS is implemented as a Remedy Workflow as well Recently started to get statistics on hardware failures

20 October 2004Thorsten Kleinwort IT/FIO/FS 9 Real Life Use Cases Kernel upgrade (on LXBATCH, ~1500 hosts): 1.Put the new software into the repository (SWREP, precaching) 2.Put the new kernel RPM on the nodes: SPMA, with multi-package option (old kernel is still running!) 3.Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node 4.Drain the nodes by disabling new batch jobs (maintenance)

20 October 2004Thorsten Kleinwort IT/FIO/FS 10 Real Life Use Cases Kernel upgrade (cont’d): 5.Node reboots when it is drained (could be at any time) 6.New machine comes up with new kernel, and goes back into production immediately  Least downtime for each node. Capacity is always available: First reboot instantaneous, last one can be several days later Everything runs automatically, some cleanup has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm

20 October 2004Thorsten Kleinwort IT/FIO/FS 11 Real Life Use Cases (cont’d) Configure batch resources (LSF): LSF resources are defined, depending on availability, power and cluster of machines Resources are defined in CDB Configured on the node using NCM The master file is generated from CDB2SQL in a cron job every day (reconfig takes several minutes) Consistency of client/master due to CDB Resources assignments are done in CDB on (sub-) cluster level (template structure) Reassignments of (sub-)clusters in CDB are done with SMS tools

20 October 2004Thorsten Kleinwort IT/FIO/FS 12 Real Life Use Cases (cont’d) Emptying the Computer Centre For the refurbishment of the CERN Computer Centre all machines had to be moved, either from one side to the other, or downstairs (vault) ~ 2000 machines had to be moved Taking the opportunity to add machines to CDB As quattor and non-quattor nodes Batch machines were moved in ‘racks=44 nodes’: HMS was used to steer the moves SMS/maintenance to shut down the machines Rename/PrepareInstall to bring machines back

20 October 2004Thorsten Kleinwort IT/FIO/FS 13

20 October 2004Thorsten Kleinwort IT/FIO/FS 14 Real Life Use Cases (cont’d) New h/w arrival => mass installation New machines (~400) arrive at CERN (in bunches of 50 – 100) Racks have to be prepared: Network equipment Power supply (Console service) Plan machine membership (cluster) Put machine into CDB: h/w type Cluster type/OS

20 October 2004Thorsten Kleinwort IT/FIO/FS 15 Real Life Use Cases New h/w arrival (cont’d) Physical machine installation (HMS): New DNS entry OS installation: PrepareInstall Installation by the SysAdmin Burn-in test (h/w test, several days to weeks) Follow up on h/w problems with Vendor Add the machines to the alarm display (SURE) Put machines into production

20 October 2004Thorsten Kleinwort IT/FIO/FS 16

20 October 2004Thorsten Kleinwort IT/FIO/FS 17 Collaborations External ‘Customers’: EGEE, LCG, and other groups at CERN are now using Quattor managed machines: They benefit from standard, manageable, and reproducible machine setups They are able/should learn to do modifications themselves External sites using Quattor: IN2P3, NIKHEF, UAM Madrid,… discussing to or use already Quattor => see Rafael’s talk This helps to enhance the tools: Service nodes (for LCG-2) Having a wider usage Generalizing components

20 October 2004Thorsten Kleinwort IT/FIO/FS 18 Summary ELFms is deployed in production at CERN Established technology – from Prototype to Production Though enhancements are ongoing Fundamental part of our infrastructure Merged with our existing environment Quattor and Lemon are generic software Used by others inside/outside CERN Hopefully a fruitful collaboration in the future

20 October 2004Thorsten Kleinwort IT/FIO/FS 19 Useful Links: ELFms: Quattor: Lemon: LEAF: Previous presentations: HEPiX II/2003 (Vanouver): ‘The new Fabric Management Tools in Production at CERN’: HEPiX I/2004 (Edinburgh): ‘ELFms, status, deployment’ by German Cancio ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): ‘Current Status of Fabric Management at CERN’ by German Cancio

20 October 2004Thorsten Kleinwort IT/FIO/FS 20 Questions?