Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking CPU ServersDisk ServersNetwork FabricTape Drives.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
German Cancio – WP4 developments Partner Logo WP4-install progress CERN, 19/6/2002 for WP4-install.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
Jean-Philippe Baud, IT-GD, CERN November 2007
System Monitoring with Lemon
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
LEMON – Monitoring in the CERN Computer Centre
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Leanne Guy EGEE JRA1 Test Team Manager
Presentation transcript:

Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004

ELFms – German Cancio - n° 2 Outline u ELFms and its subsystems: n Quattor n Lemon n LEAF u Deployment status

ELFms – German Cancio - n° 3 ELFms in a nutshell ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management u ELFms manages and controls most of the nodes in the CERN CC n ~2100 nodes out of ~ 2400 n Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size,..) n Supported OS: Linux (RH7, RHES2.1, RHES3) and Solaris (9) Node Configuration Management Node Management

ELFms – German Cancio - n° 4

ELFms – German Cancio - n° 5 Quattor Quattor takes care of the configuration, installation and management of fabric nodes  A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…) Defined in templates arranged in hierarchies – common properties set only once  Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management Quattor was developed in the scope of EU DataGrid. Development and maintenance now coordinated by CERN/IT

ELFms – German Cancio - n° 6 Configuration Database CDB pan GUI Scripts CLI Node CCM Cache XML RDBMS SQLSQL SOAPSOAP HTTPHTTP Node Management Agents LEAF, LEMON, others

ELFms – German Cancio - n° 7 Node Management Agents Configuration Database CDB GUI Scripts CLI Node CCM Cache RDBMS SQLSQL SOAPSOAP pan XML HTTPHTTP CERN CC name_srv1: time_srv1: ip-time-1 lxbatch cluster/name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster/name: lxplus pkg_add (lsf5.1) disk_srv lxplus001 eth0/ip: pkg_add (lsf6_beta) lxplus020 eth0/ip: lxplus029

ELFms – German Cancio - n° 8 Configuration Database CDB pan Node CCM Cache XML RDBMS SQLSQL HTTPHTTP GUI Scripts CLI SOAPSOAP

ELFms – German Cancio - n° 9 Configuration Database CDB pan GUI Scripts CLI Node XML RDBMS SQLSQL SOAPSOAP HTTPHTTP CCM Cache Node Management Agents

ELFms – German Cancio - n° 10 Configuration Database CDB pan GUI Scripts CLI Node CCM Cache XML SOAPSOAP HTTPHTTP RDBMS SQLSQL LEAF, LEMON, others

ELFms – German Cancio - n° 11 Configuration Database CDB pan GUI Scripts CLI XML RDBMS SQLSQL SOAPSOAP HTTPHTTP Node CCM Cache Node Management Agents

ELFms – German Cancio - n° 12 Managing (cluster) nodes Install server base OS dhcp pxe nfs/http Vendor System installer RH73, RHES, Fedora,… System services AFS,LSF,SSH,accounting.. Installed software kernel, system, applications.. CCM Node Configuration Manager (NCM) RPM, PKG nfs http ftp Software Servers packages (RPM, PKG) SWRep packages CDB Standard nodesManaged nodes Install Manager Node (re)install cache SW package Manager (SPMA)

ELFms – German Cancio - n° 13 Node Management Agents u NCM (Node Configuration Manager): framework system, where service specific plug-ins called Components make the necessary system changes to bring the node to its CDB desired state Regenerate local config files (eg. /etc/sshd/sshd_config ), restart/reload services (SysV scripts) n Large number of components available (system and Grid services) u SPMA (Software Package Mgmt Agent) and SWRep: Manage all or a subset of packages on the nodes n Full control on production nodes: full control - on development nodes: non-intrusive, configurable management of system and security updates. n Package manager, not only upgrader (roll-back and transactions) u Portability: Generic framework; plug-ins for NCM and SPMA available for RHL (RH7, RHES3) and Solaris 9 u Scalability to O(10K) nodes n Automated replication for redundant / load balanced CDB/SWRep servers n Use scalable protocols eg. HTTP and replication/proxy/caching technology (slides here)slides here

ELFms – German Cancio - n° 14

ELFms – German Cancio - n° 15 Lemon – LHC Era Monitoring

ELFms – German Cancio - n° 16 LEMON u Monitoring sensors and agent n Large amount of metrics (~ 10 sensors implementing 150 metrics) n Plug-in architecture: new sensors and metrics can easily be added n Asynchronous push/pull protocol between sensors and agent n Available for Linux and Solaris u Repository n Data insertion via TCP or UDP n Data retrieval via SOAP n Backend implementations for text file and Oracle SQL n Keeps current and historical samples – no aging out of data but archiving on TSM and CASTOR u Correlation Engines and ‘self-healing’ Fault Recovery n allows plug-in correlations accessing collected metrics and external information (eg. quattor CDB, LSF), and also launch configured recovery actions n Eg. average number of users on LXPLUS, total number of active LCG batch nodes n Eg. cleaning up /tmp if occupancy > x %, restart daemon D if dead, … u Visualization n Next slide u As with Quattor, LEMON is an EDG development now maintained by CERN/IT

ELFms – German Cancio - n° 17

ELFms – German Cancio - n° 18

ELFms – German Cancio - n° 19 LEAF - LHC Era Automated Fabric LEAF (LHC Era Automated Fabric): Collection of workflows for automated node hardware and state management u HMS (Hardware Management System): n Track systems trough all steps in lifecycle eg. installation, moves, vendor calls, retirement n Automatically requests installs, retires etc. to technicians n GUI to locate equipment physically n HMS implementation is CERN specific, but concepts and design should be generic u SMS (State Management System): n Automated handling high-level configuration steps, eg. s Reconfigure and reboot all LXPLUS nodes for new kernel s Reallocate nodes inside LXBATCH for Data Challenges s Drain and reconfig node X for diagnosis / repair operations n extensible framework – plug-ins for site-specific operations possible n Issues all necessary (re)configuration commands on top of quattor CDB and NCM s Uses a state transition engine u HMS and SMS interface to Quattor and LEMON (or rather: sit on top!) for setting/getting node information respectively

ELFms – German Cancio - n° 20 LEAF screenshots

ELFms – German Cancio - n° 21 ELFms status – Quattor (I) u Manages (almost) all Linux boxes in the computer centre n ~ 2100 nodes, to grow to ~ 8000 in n LXPLUS, LXBATCH, LXBUILD, disk and tape servers, Oracle DB servers n Solaris clusters, server nodes and desktops to come for Solaris9 u Starting: head nodes using Apache proxy technology for software and configuration distribution u Misc developments pending, like n Fine-grained ACL protection to templates n HTTPS instead of HTTP for CDB profile and SW transport

ELFms – German Cancio - n° 22 ELFms status – Quattor (II) u LCG-2 WN configuration components available n Configuration components for RM, EDG/LCG setup, Globus n Progressive reconfiguration of LXBATCH nodes as LCG-2 WN’s u Community driven effort to use quattor for general LCG-2 configuration n Coordinated by staff from IN2P3 and NIKHEF n Aim is to provide a complete porting of EDG-LCFG config components to Quattor for all LCG services n CERN and UAM Madrid providing generic installation instructions and site- independent packaging, as well as a Savannah development portalSavannah development portal s Installation toolkit, user’s guide, tutorials available u EGEE has chosen quattor for managing their integration testbeds u Tier1/2 sites as well as LHC experiments evaluating using quattor for managing their own farms

ELFms – German Cancio - n° 23 ELFms status – LEMON (I) u Smooth production running of MSA agent and Oracle-based repository at CERN-CC n 150 metrics sampled every 30s -> 1d n ~ 1 GB of monitoring data / day on ~ 2100 nodes n New sensors and metrics, eg. tape robots, temperature, SMART disk info u GridICE project uses LEMON for data collection u Gathering experiment requirements and interfacing to grid-wide monitoring systems (MonaLisa, GridICE) n Good interaction with, and gathered feedback from CMS DC04 n Archived raw monitoring data will be used for CMS computing TDR u Visualization: n Operators - Test interface to new generation alarm systems (LHC control alarm system) n Finish status display pages

ELFms – German Cancio - n° 24 ELFms status – LEMON (II) u Work on redundancy solutions for Monitoring Repository (homegrown and/or Oracle Streams) u Quality of Service indicators, correlations and actuators (in collaboration with BARC India) n Ie. “tell LEAF to reassign two more nodes from LXBATCH to LXPLUS since capacity insufficient”) n Provide batch job mix indicators for improved I/O and CPU load equilibrium

ELFms – German Cancio - n° 25 ELFms status - LEAF u HMS in full production for all nodes in CC n HMS heavily used during CC node migration u SMS in production for LXBATCH u Next steps: n Deploy SMS across more clusters n Tighter HMS/SMS integration (automatic put nodes in and out production during eg. rack moves) u Developing ‘asset management’ GUI replacing PC finder n Client of HMS and SMS n Drag&drop nodes to automatically initiate HMS moves n Multiple select nodes, then initiate action eg. kernel upgrade n Interface to LEMON GUI

ELFms – German Cancio - n° 26 Summary u ELFms is deployed in production at CERN n Stabilized results from 3-year developments within EDG and LCG n Established technology n Providing real added-on value for day-to-day operations u Quattor and LEMON are generic software n Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS u CERN will help with Quattor (and LEMON) deployment at other sites n We provide site-independent software and installation instructions n Collaboration for providing missing pieces, eg. configuration components, GUI’s, beginner’s user guides? u More information:

ELFms – German Cancio - n° 27

ELFms – German Cancio - n° 28 WP4 architecture concepts u Information model. Configuration is distinct from monitoring n Configuration == desired state (what we want) n Monitoring == actual state (what we have) u Modularity n Open interfaces and protocols u Extensibility n Allow for 3 rd -party and site specific plug-ins and add-ons u Scalability n Thousands of nodes u Automation n Minimize manual interventions u Node autonomy n Operations are handled locally whenever possible u Site autonomy n A site must keep control of its local resources

Automated management…, 26/7/2004 The Use of Quattor a status report Some people from LCG participating institutes took the initiative to develop some essential Quattor modules for the installation, configuration and updates of the LCG2 software suite. 1.First workshop: 8 dedicated testing sites and some others participated In March just after the LCG workshop An critical analysis was made of the usage of LCFGng for the EDG software. Decided on a globnal configuration schema forthe various grid components 2.Priorities: Primarily for LCG2 For non-CERN worker nodes initially, then CE, BDII, SE 3.Work done: Some modules written Proper testbed defined and operational 4.Outlook: Expected LCG-2 complete install end of the summer Use in the EGEE JRA1 testing test bed Expect from CERN to keep supporting the Quattor core team

ELFms – German Cancio - n° 30 Improvements wrt EDG-LCFG u New and powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation u SQL query backend u Portability n Plug-in architecture -> Linux and Solaris u Enhanced components n Sharing of configuration data between components now possible n New component support libraries n Native configuration access API (NVA-API) u Stick to the standards where possible n Installation subsystem uses system installer n Components don’t replace SysV init.d subsystem u Modularity n Clearly defined interfaces and protocols n Mostly independent modules n “light” functionality built in (eg. package management) u Improved scalability n Enabled for proxy technology n NFS mounts not necessary any longer u Enhanced management of software packages n ACL’s for SWRep n Multiple versions installable n No need for RPM ‘header’ files u Last but not least…: Support! n EDG-LCFG is frozen and obsoleted (no ports to newer Linux versions) n LCFG -> EDG-LCFGng -> quattor

ELFms – German Cancio - n° 31 Differences with ASIS/SUE SUE: u Focus on configuration, not installation u Powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation n Sharing of configuration data between components now possible u Central Configuration Database u Supports unconfiguring services u Improved depenency model n Pre/post dependencies u Revamped component support libraries ASIS: u Scalability n HTTP vs. shared file system u Supports native packaging system (RPM, PKG) u Manages all software on the node u ‘real’ Central Configuration database u (But: no end-user GUI, no package generation tool)

ELFms – German Cancio - n° 32 Differences with ROCKS u Rocks: better documentation, nice GUI, easy to setup u Design principle: reinstall nodes in case of configuration changes n No configuration or software updates on running systems n Suited for production? Efficiency on batch nodes, upgrades / reconfigs on 24/24,7/7 servers (eg. gzip security fix, reconfig of CE address on WN’s) u Assumptions on network structure (private,public parts) and node naming u No indication on how to extend the predefined node types or extend the configured services u Limited configuration capacities (key/value) u No multiple package versions (neither on repository, nor simultaneously on a single node) n Eg. different kernel versions on specific node types u Works only for RH Linux (Anaconda installer extensions)

ELFms – German Cancio - n° 33 NCM Component example [...] sub Configure { my ($self,$config) # access configuration information my $arch=$config->getValue('/system/architecture’); # CDB API $self->Fail (“not supported") unless ($arch eq ‘i386’); # (re)generate and/or update local config file(s) open (myconfig,’/etc/myconfig’); … # notify affected (SysV) services if required if ($changed) { system(‘/sbin/service myservice reload’); … } sub Unconfigure {... }

ELFms – German Cancio - n° 34 Key concepts behind quattor u Autonomous nodes: n Local configuration files n No remote management scripts n No reliance on global file systems AFS/NFS u Central control: n Primary configuration is kept centrally (and replicated on the nodes) n A single source for all configuration information u Reproducibility: n Idempotent operations n Atomicity of operations u Scalability: n Load balanced servers, scalable protocols u Use of standards: n HTTP, XML, RPM/PKG, SysV init scripts, … u Portability: n Linux, Solaris