DataGrid is a project funded by the European Commission under contract IST-2000-25182 IT Post-C5, 12.12.2003 Managing Computer Centre machines with Quattor.

Slides:



Advertisements
Similar presentations
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
Advertisements

Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
Understanding and Managing WebSphere V5
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
quattor NCM components introduction tutorial German Cancio CERN IT/FIO.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
CERN Manual Installation of a UI – Oxford July - 1 LCG2 Administrator’s Course Oxford University, 19 th – 21 st July Developed.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
F. Rademakers - CERN/EPLinux Certification - FOCUS Linux Certification Fons Rademakers.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
05/29/2002Flavia Donno, INFN-Pisa1 Packaging and distribution issues Flavia Donno, INFN-Pisa EDG/WP8 EDT/WP4 joint meeting, 29 May 2002.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
INFSO-RI Enabling Grids for E-sciencE SCDB C. Loomis / Michel Jouvin (LAL-Orsay) Quattor Tutorial LCG T2 Workshop June 16, 2006.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
SPMA & SWRep: Basic exercises HEPiX hands-on, NIKHEF 5/03 German Cancio
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Microsoft Management Seminar Series SMS 2003 Change Management.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
German Cancio – WP4 developments Partner Logo WP4-install progress CERN, 19/6/2002 for WP4-install.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
System Monitoring with Lemon
Netscape Application Server
High Availability Linux (HA Linux)
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Module 01 ETICS Overview ETICS Online Tutorials
Presentation transcript:

DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor Germán Cancio and Piotr Poznański IT/FIO Post C5, 12/12/03

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 2 Outline u Concepts u Architecture and Functionality u Deployment, next steps

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 3 u Part of, together with n LEMON monitoring system n LEAF Hardware and State Mgmt system quattor in a nutshell u : fabric management system developed by EDG WP4 n Configuration, installation and management of fabric nodes u Used to manage most of the Linux nodes in the CERN CC n >1700 nodes out of ~ 2000 n Multiple functionality (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (memory, HD size,..)

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 4 Key concepts behind quattor u Autonomous nodes: n Local configuration files n No remote management scripts n No reliance on global file systems AFS/NFS u Central control: n Primary configuration is kept centrally (and replicated on the nodes) n A single source for all configuration information u Reproducibility: n Idempotent operations n Atomicity of operations u Scalability: n Load balanced servers, scalable protocols u Use of standards: n HTTP, XML, RPM/PKG, SysV init scripts, … u Portability: n Linux, Solaris

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 5 quattor architecture - overview u Configuration Management n Configuration Database n Configuration access and caching n Graphical and Command Line Interfaces u Node and Cluster Management n Automated node installation n Node Configuration Management n Software distribution and management Node Configuration Management Node Management

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 6 Configuration Management

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 7 Configuration Information u Configuration is expressed using the Pan language u Information is arranged in templates n Common properties set only once u Using templates it is possible to create hierarchies to match service structure CERN CC name_srv1: time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srvlxplus001 eth0/ip: pkg_add (lsf5.1_debug) lxplus020 eth0/ip: lxplus029

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 8 Configuration Management Infrastructure CDB pan GUI Scripts CLI CCM Cache Node PERLPERL XML RDBMS SQLSQL SOAPSOAP HTTPHTTP

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 9 Configuration Database (CDB) u Keeps complete configuration information u Configuration describes the desired state of the managed machines. u Data consistency is enforced by a transaction mechanism n All changes are done in transactions u Configuration is validated and kept under version control n Built-in validation (e.g. types), user defined validation u Going back to previous versions of the configuration is possible n Full history is kept in CVS. u Conflicts of concurrent modification of the same configuration information are detected

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 10 SQL Query Interface u We can ask about properties spanning across machines u We can run SQL queries (SELECT) and create views: n “give me all machines with more than 512 Mbytes of memory” n “give me all machines that belong to lxplus” u Portability: available for Oracle and MySQL

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 11 Examples of information in CDB u Hardware n CPU n Hard disk n Network card n Memory size n Node location in CC u Software n Repository definitions n Service definitions = groups of packages (RPMs) u System n Partition table n Load balancing information u Cluster information n Cluster name and type n Batch master u Audit information n Contract type and number n Purchase date

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 12 Graphical User Interface - PanGUIn

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 13 Configuration Cache Manager (CCM) u Runs on every managed node u Provides a local interface to the node’s configuration information u Information is downloaded from CDB and cached: n The access to the configuration is fast n Avoid peaks on CDB servers n Disconnected operation are supported n Information is kept in sync with CDB using notification/polling mechanism u Access to local configuration information is performed through an easy-to-use API

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 14 Node (Cluster) Management

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 15 Managing (cluster) nodes Install server base OS dhcp pxe nfs/http Vendor System installer RH73, RHES, Fedora,… System services AFS,LSF,SSH,accounting.. Installed software kernel, system, applications.. CCM Node Configuration Manager (NCM) RPM, PKG nfs http ftp Software Servers packages (RPM, PKG) SWRep packages CDB Standard nodesManaged nodes Install Manager Node (re)install cache SW package Manager (SPMA)

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 16 Install Manager u Sits on top of the standard vendor installer, and configures it n Which OS version to install n Network and partition information n What core packages n Custom post-installation instructions u Automated generation of control file (KickStart) u It also takes care of managing DHCP (and TFTP/PXE) entries u Can get its configuration information from CDB or via command line u Available for RedHat Linux (Anaconda installer) n Allows for plugins for other distributions (SuSE, Debian) or Solaris

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 17 Node Configuration (I) u NCM (Node Configuration Manager) is responsible for ensuring that reality on a node reflects the desired state in CDB. u Framework system, where service specific plug-ins called Components make the necessary system changes n Regenerate local config files (eg. /etc/sshd/sshd_config) n Restard/reload services (SysV scripts) n configuration dependencies (eg. configure network before sendmail) u Components invoked on boot, via cron or on CDB config changes u Porting of SUE features to NCM components started, to be completed with next CERN certified Linux and Solaris versions. n Currently available: grub, quota, snmp, dns, automounter, network, inetd, globuscfg, spma, sysaccounting, edg-cfg, … n keep portability between Linux/Solaris whenever possible

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 18 Node Configuration (II) u Component support libraries for ease of component development n Configuration information access n Configuration file manipulation n Advanced file operations n Process management n Exception handling libraries u A tool geared towards sysadmins/operators allows to query/visualize the node’s configuration profile.

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 19 Software Management (I - Server) (SPMA and SWRep were introduced in post-C5 14/3/03) u SWRep = Software Repository u Universal repository for storing Software: n Extendable to multiple platforms and packagers (RH Linux RPM, Solaris PKG, others like Debian pkg) n Multiple package versions/releases u Management (“product maintainers”) interface: n ACL based mechanism to grant/deny modification rights (packages associated to “areas”) u Client access: via standard protocols n HTTP, AFS/NFS, FTP u Replication: using standard tools (rsync) n load balancing, redundancy

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 20 Software Management (II - Clients) u SPMA = Software Package Management Agent u Manage all or a subset of packages on the nodes n On production nodes: wipe out unknown packages, (re)install missing ones. n On development nodes (or desktops): non-intrusive, configurable management of system and security updates. u Package manager, not only upgrader n Can roll back package versions n Transactional verification of operations u Portability: Generic plug-in framework n Plug-ins available for Linux RPM and Solaris PKG, (can be extended) u Scalability: n Supports HTTP (also FTP, AFS/NFS) n time smearing n Package pre-caching u Possible to access multiple repositories (division/experiment specific) u Modularity: can be configured via CDB, or locally

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 21 Deployment

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 22 Quattor CERN u Quattor is used by FIO to manage most CC Linux nodes: n >1700 nodes, 15 clusters – to be scaled up to >5000 in (LHC) n LXPLUS, LXBATCH, LXSHARE, LXBUILD, disk and tape servers, Oracle DB servers n RedHat 7.3 and RHES 2.1 n RHES30 (also on IA64) to come soon u Not deployed: InstallMgr – as CERN legacy solution (AIMS) still in use n AIMS interfaced to CDB by FIO (not fully automatic) u Server cluster (LXSERV) hosting CDB and SWRep replicas n 4 RH73 nodes n CDB: ~ 260 general templates, and 2 templates per node (one derived from LANDB) : >3800 templates in total n SWRep: > 5900 software packages u Solaris clusters, server nodes and desktops to come for Solaris9 n Cf. Ignacio’s C5 presentation

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 23 FIO usage CERN-CC u LSF batch system upgrade: n Upgrade from LSF 4.2 to LSF 5.1 on >1000 nodes within 15 minutes, without service interruption u Security updates: n All security upgrades are done by SPMA s SSH security updates s KDE upgrades (~ 400 MB per node) on >700 nodes s etc … (~once a week!) u Kernel upgrades: n SPMA can handle multiple versions of the same package -> n Allows to separate in time installation and activation (after reboot) of new kernel n NCM component configures which kernel version to use

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 24 Deployment outside CERN-CC u EDG: no time for wide deployment n Estimated effort for moving from LCFG to quattor exceeded remaining EDG lifetime n EDG focus on stability rather than middleware functionality u Tutorials held at HEPiX and EDG conferences have caused positive feedback and interests: n Experiments: LHCb, Atlas n HEP institutes: UAM Madrid, LAL/IN2P3, Liverpool University, NIKHEF n Projects: Grille 5K (CNRS France)

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 25 Work in Progress (I) u EDG finish-up n End user documentation, install guide, packaging n FIO will continue maintaining Quattor (as part of ELFms) after EDG finishes. u Remaining developments n CDB fine grained access control n Scalability audit n Data encryption mechanisms: for sensitive data (ACLs) n Specialized GUIs and CLIs (eg. operators) u Improved procedures and workflows n Take into account new commands and functionality n Release cycle (‘test’, ‘new’, ‘production’ branches of CDB information) u Finish migration out of legacy tools n Finish SUE migration port of SUE features to NCM components in time for next certified Linux n ASIS + SUE/security + rpmupdate phaseout was finished in August.

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 26 Work in Progress (II) u Single CDB for CERN-CC n Inclusion of Solaris clusters and nodes u More integration with ELFms LEMON/LEAF n Interfaces from LEAF HMS/SMS to quattor being developed n LEMON sensors for quattor u Upgrade LXSERV service cluster n New hardware n Split in front-end / back end nodes u LCG-2 integration for LXBATCH nodes (WorkerNodes) n Deploy software from EDG, LCG, experiment SW n Deploy NCM components for local grid services

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 27

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 28 Differences with ASIS/SUE SUE: u Focus on configuration, not installation u Powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation n Sharing of configuration data between components now possible u Central Configuration Database u Supports unconfiguring services u Improved depenency model n Pre/post dependencies u Revamped component support libraries ASIS: See post-C5 14/3/2003 u Scalability n HTTP vs. shared file system u Supports native packaging system (RPM, PKG) u Manages all software on the node u ‘real’ Central Configuration database u (But: no end-user GUI, no package generation tool)

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 29 Differences with EDG-LCFG u New and powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation u Portability n Plug-in architecture -> Linux and Solaris u Enhanced components n Sharing of configuration data between components now possible n New component support libraries n Native configuration access API (NVA-API) u Stick to the standards where possible n Installation subsystem uses system installer n Components don’t replace SysV init.d subsystem u Modularity n Clearly defined interfaces and protocols n Mostly independent modules n “light” functionality built in (eg. package management) u Removed non-scalable protocols n NFS mounts not necessary any longer u Enhanced management of software packages n ACL’s for SWRep n No need for RPM ‘header’ files

Managing Computer Centre Machines with Quattor – Post-C5 – Cancio, Poznański - n° 30 NCM Component example [...] sub Configure { my ($self,$config) # access configuration information my $arch=$config->getValue('/system/architecture’); # CDB API $self->Fail (“not supported") unless ($arch eq ‘i386’); # (re)generate and/or update local config file(s) open (myconfig,’/etc/myconfig’); … # notify affected (SysV) services if required if ($changed) { system(‘/sbin/service myservice reload’); … } sub Unconfigure {... }