Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
05/29/2002Flavia Donno, INFN-Pisa1 Packaging and distribution issues Flavia Donno, INFN-Pisa EDG/WP8 EDT/WP4 joint meeting, 29 May 2002.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking CPU ServersDisk ServersNetwork FabricTape Drives.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
German Cancio – WP4 developments Partner Logo WP4-install progress CERN, 19/6/2002 for WP4-install.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
Jean-Philippe Baud, IT-GD, CERN November 2007
System Monitoring with Lemon
Blueprint of Persistent Infrastructure as a Service
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
LEMON – Monitoring in the CERN Computer Centre
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Presentation transcript:

Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio, T. Kleinwort, W. Tomlin et al. Presented by G. Cancio

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 2 Outline u CERN-CC and the ELFms framework n Quattor n Lemon n LEAF u Deployment status

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 3 Fabric Management with ELFms ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management u ELFms manages and controls most of the nodes in the CERN CC n ~2100 nodes out of ~ 2700 n Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size,..) n Supported OS: Linux (RH7, RHES2.1, Scientific Linux 3 – IA32&IA64) and Solaris (9) Node Configuration Management Node Management

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 4

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 5 Quattor Quattor takes care of the configuration, installation and management of fabric nodes  A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…) Defined in templates arranged in hierarchies – common properties set only once  Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management Quattor was initially developed in the scope of EU DataGrid. Development and maintenance now coordinated by CERN/IT

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 6 Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Architecture Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 7 Quattor Deployment u Quattor in complete control of Linux boxes (~ 2100 nodes, to grow to ~ 8000 in ) n Replacement of legacy tools (SUE and ASIS) at CERN during 2003 u CDB holding information of > 95% of systems in CERN-CC u Over 90 NCM configuration components developed n From basic system configuration to Grid services setup… (including desktops) u SPMA used for managing all software n ~ 2 weekly security and functional updates (including kernel upgrades) n Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to v5) in 15 mins, without service interruption n Handles (occasional) downgrades as well u Developments ongoing: n Fine-grained ACL protection to templates n Deployment of HTTPS instead of HTTP (usage of host certificates) n XML configuration profile generation speedup (eg. parallel generation) u Proxy architecture for enhanced scalability …

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 8 Back-end/front-end setup DNS-load balanced HTTP MS Backend (“Master”) Frontend rsync Server cluster Rack 1Rack 2…… Rack N … Installation images, RPMs, configuration profiles

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 9 Proxy server setup DNS-load balanced HTTP MM’ Backend (“Master”) Frontend L1 proxies L2 proxies (“Head” nodes) Server cluster HHH … Rack 1Rack 2…… Rack N Installation images, RPMs, configuration profiles

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 10 LCG/EGEE u EGEE and LCG have chosen quattor for managing their integration testbeds u Community effort to use quattor for fully automated LCG-2 configuration for all services n Aim is to provide a complete porting of LCFG configuration components n Most service configurations (WN, CE, UI,..) already available n Minimal intrusiveness into site specific environments u More and more sites (IN2P3, NIKHEF, UAM Madrid..) and projects (GridPP) discussing or adopting quattor as basic fabric management framework… u … leading to improved core software robustness and completeness n Identified and removed site dependencies and assumptions n Documentation, installation guides, bug tracking, release cycles

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 11

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 12 Lemon – LHC Era Monitoring Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 13 Deployment and Enhancements u Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC n 150 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes n No aging-out of data but archiving on MSS (CASTOR) u Usage outside CERN-CC, collaborations n GridICE, CMS-Online (DAQ nodes) n BARC India (collaboration on QoS) n Interface with MonaLisa being discussed u Hardened and enhanced EDG software n Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) u Re-engineered Correlation and Fault Recovery n PERL-plugin based correlations engine for derived metrics (eg. average of LXPLUS users, load average & total active LXBATCH nodes) n Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) u Alarm system for operators – gateway to future LHC control alarm system (LASER) u Developing redundancy layer for Repository (Oracle Streams) u Status and performance visualization pages …

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 14 lemon-status

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 15

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 16 LEAF - LHC Era Automated Fabric u LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: u HMS (Hardware Management System): n Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement n Automatically requests installs, retires etc. to technicians n GUI to locate equipment physically n HMS implementation is CERN specific, but concepts and design should be generic u SMS (State Management System): n Automated handling (and tracking of) high-level configuration steps s Reconfigure and reboot all LXPLUS nodes for new kernel and/or physical move s Drain and reconfig nodes for diagnosis / repair operations n Issues all necessary (re)configuration commands via Quattor n extensible framework – plug-ins for site-specific operations possible

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 17 LEAF Deployment u HMS in full production for all nodes in CC n HMS heavily used during CC node migration (~ 1500 nodes) u SMS in production for all quattor managed nodes u Next steps: n More automation, and handling of other HW types for HMS n More service specific SMS clients (eg. tape & disk servers) u Developing ‘asset management’ GUI n Multiple select, drag&drop nodes to automatically initiate HMS moves and SMS operations n Interface to LEMON GUI

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 18 u ELFms is deployed in production at CERN n Stabilized results from 3-year developments within EDG and LCG n Established technology - from Prototype to Production n Consistent full-lifecycle management and high automation level n Providing real added-on value for day-to-day operations u Quattor and LEMON are generic software n Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS Summary =++ u More information:

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 19

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 20 Use Case: Move rack of machines Node HMS NW DB SMS Quattor CDB Operations technicians 1. Import 2. Set to standby 3. Update 4. Refresh 5. Take out of production Close queues and drain jobs Disable alarms 6. Shutdown work order 7. Request move 10. Install work order 8. Update 9. Update 11. Set to production 12. Update 13. Refresh 14. Put into production

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 21 Improvements wrt EDG-LCFG u New and powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation u SQL query backend u Portability n Plug-in architecture -> Linux and Solaris u Enhanced components n Sharing of configuration data between components now possible n New component support libraries n Native configuration access API (NVA-API) u Stick to the standards where possible n Installation subsystem uses system installer n Components don’t replace SysV init.d subsystem u Modularity n Clearly defined interfaces and protocols n Mostly independent modules n “light” functionality built in (eg. package management) u Improved scalability n Enabled for proxy technology n NFS mounts not necessary any longer u Enhanced management of software packages n ACL’s for SWRep n Multiple versions installable n No need for RPM ‘header’ files u Last but not least…: Support! n EDG-LCFG is frozen and obsoleted (no ports to newer Linux versions) n LCFG -> EDG-LCFGng -> quattor

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 22 Differences with ASIS/SUE SUE: u Focus on configuration, not installation u Powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation n Sharing of configuration data between components now possible u Central Configuration Database u Supports unconfiguring services u Improved depenency model n Pre/post dependencies u Revamped component support libraries ASIS: u Scalability n HTTP vs. shared file system u Supports native packaging system (RPM, PKG) u Manages all software on the node u ‘real’ Central Configuration database u (But: no end-user GUI, no package generation tool)

Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 23 Differences with ROCKS u Rocks: better documentation, nice GUI, easy to setup u Design principle: reinstall nodes in case of configuration changes n No configuration or software updates on running systems n Suited for production? Efficiency on batch nodes, upgrades / reconfigs on 24/24,7/7 servers (eg. gzip security fix, reconfig of CE address on WN’s) u Assumptions on network structure (private,public parts) and node naming u No indication on how to extend the predefined node types or extend the configured services u Limited configuration capacities (key/value) u No multiple package versions (neither on repository, nor simultaneously on a single node) n Eg. different kernel versions on specific node types u Works only for RH Linux (Anaconda installer extensions)