Status and plans of central CERN Linux facilities

Slides:

Advertisements

Similar presentations

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.

Fabric Management at CERN BT July 16 th 2002 CERN.ch.

ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.

CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.

6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans.

12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status

The CERN Computer Centres October 14 th 2005 CERN.ch.

Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,

WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.

1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.

LAL Site Report Michel Jouvin LAL / IN2P3

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

CERN Manual Installation of a UI – Oxford July - 1 LCG2 Administrator’s Course Oxford University, 19 th – 21 st July Developed.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.

Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.

Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.

Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.

EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.

Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.

Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF

Daniele Bonacorsi Andrea Sciabà

WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO

High Availability Linux (HA Linux)

U.S. ATLAS Grid Production Experience

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

N-Tier Architecture.

Monitoring and Fault Tolerance

Status of Fabric Management at CERN

Germán Cancio CERN IT/FIO LCG workshop, 24/3/04

WP4 Fabric Management 3rd EU Review Maite Barroso - CERN

LEMON – Monitoring in the CERN Computer Centre

BDII Performance Tests

WP4-install status update

The CREAM CE: When can the LCG-CE be replaced?

Running Computers in CC

German Cancio CERN IT .quattro architecture German Cancio CERN IT.

The INFN Tier-1 Storage Implementation

Porting LCG to IA64 Andreas Unterkircher CERN openlab May 2004

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Status and plans for bookkeeping system and production tools

Deploying Production GRID Servers & Services

Presentation transcript:

Status and plans of central CERN Linux facilities Thorsten Kleinwort IT/FIO-FS For PH/SFT Group 10.06.2005

Thorsten Kleinwort IT/FIO/FS Introduction 2 years ago: Post C5 on migration from RH6 to RH7 Now: Migration from RH7 to SLC3 Achievements: Scalability Tools framework Scope Conclusions & outlook Scale: o(500), Tools: Still migrating from old to new tools Tools: still in migration phase from old to new tools. New tools were written with final scale in mind. Scope: mostly LXBATCH and LXPLUS: well defined environment, new tools were forged for this scope. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System Scalability Tools framework Scope 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System SLC3 new default Operating System: LXPLUS fully migrated, new h/w small rest on RH7 o(5) LXBATCH 95% on SLC3 Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System SLC3 new default platform: LXPLUS fully migrated, new h/w small rest on RH7 o(5) LXBATCH 95% on SLC3 Rest to be migrated soon (even old h/w) Other clusters are migrated now as well: LXGATE, LXBUILD, LXSERV, … Still some problems on special Clusters with special hardware (disk, tape server) Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System Besides this ‘main’ OS, we have RHES: RH ES 2 as well as RH ES 3 Needed for ORACLE Now supporting also other architectures: ia64, {xf86_64} Needed for Service Challenge (CASTORGRID) No major problems, but: Additional work to provide/maintain those Minor differences, e.g. no AFS on ES, lilo on ia64 Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System Scalability Tools framework Scope 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Scalability Already reached 1000 nodes with RH7 Automated node installation Now at 2200 Quattor managed machines Machine arrive in bunches o(100) Installed/stress-tested/moved Now, cluster management automated Kernel upgrade on LXBATCH Vault move/renumbering Cluster upgrade to new version of OS At our current scale, you always have machines down, broken, reinstalled, on vendor call Machines are now usually bought/moved/handled in big numbers ~100 E.g. Vault move (Re-)installations are done in big numbers as well, but still single reinstalls Everything had to be automated to scale: Upgrade of OS/Kernel fully automated: Whenever machine is drained 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Cluster upgrade workflow 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Scalability Batch System LSF: We are up to 50000 jobs in ~2500 slots So far o.k., except for AFS copy -> NFS Infrastructure has to scale as well: Power, cooling, space, network,… At our current scale, you always have machines down, broken, reinstalled, on vendor call Machines are now usually bought/moved/handled in big numbers ~100 E.g. Vault move (Re-)installations are done in big numbers as well, but still single reinstalls Everything had to be automated to scale: Upgrade of OS/Kernel fully automated: Whenever machine is drained 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System Scalability Tools framework Scope 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS CDB: Web access tool 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface Lemon Monitoring, including web interface With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Lemon Start Page 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Lemon: E.g. LXBATCH 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL (r/o) interface Lemon Monitoring, including web interface LEAF, the SMS and HMS framework With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS LEAF: CCTracker & HMS 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Tools framework We rely on other tools/groups: All Linux version come from Linux Support: Need for new version increases their workload, too AIMS: our boot/installation service LANDB: now SOAP interface instead of the web Good collaboration The scale increases the pressure for robust tools on their side as well With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Operating System Scalability Tools framework Scope 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Scope Original scope for the framework was LXBATCH/LXPLUS Framework adapted to other clusters: LXGATE, LXBUILD: Similar to LXPLUS/LXBATCH Disk Server, Tape Server: Different h/w, larger variety, more special configuration Non – FIO cluster: LXPARC GM (EGEE): several clusters, used for tests and prototyping GD (LCG test clusters) Originally, while going from RH6 to RH7, we were also diminishing our other platforms, to reduce the diversity. With the new tools in place, the number of (divert) clusters increased again. E.g.: Disk Servers Tape Servers For these two new types of clusters, the tools have to be enhanced: New configuration components had to be written, others were not used h/w variety much bigger on disk servers, therefore more kernel dirvers needed, kernel parameters have to be tweaked. Still on going issue of problems with some tape server kernel drivers and SLC3. The level of automation is lower for disk/tapes servers as for CPU servers 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Scope These new clusters: Increase the scale even further Enlarge the requirements for the tools, e.g. New NCM components New SMS/HMS states/workflows Additional local users,… Come with new OS requirements, e.g. RH ES for ORACLE servers Ia64 support for new CASTORGRID machines Proper testing for new s/w, OS, kernel has to be done on the cluster level 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Fabric Services as part of the GRID Additional LCG s/w was incorporated into our Framework All SLC3 LXBATCH nodes (>800 MHz) are WN CERN-PROD biggest site >1800 CPUs UI available on LXPLUS CE on LXGATE, 2 at the moment SE, cluster of 6 machines, running SRM and CASTORGRID All upgraded to LCG_2_4_0 19 July 2018 Thorsten Kleinwort IT/FIO/FS

GOC Entry for CERN-PROD 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS GRID Monitoring: 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS GRID Resource Infos: 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Thorsten Kleinwort IT/FIO/FS Conclusions & outlook Not only migrated to one new OS Next one: SLC4 or SLC5? Tools are ready, no major problems foreseen We have overcome some scalability issues Prepared to go to LHC scale Tools: Gone from machine automation to cluster automation Improve usability Increase robustness Decrease necessary expert level Scope: From LXBATCH/LXPLUS to many different clusters How to manage non-FIO Clusters? 19 July 2018 Thorsten Kleinwort IT/FIO/FS