Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
05/29/2002Flavia Donno, INFN-Pisa1 Packaging and distribution issues Flavia Donno, INFN-Pisa EDG/WP8 EDT/WP4 joint meeting, 29 May 2002.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
System Monitoring with Lemon
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
LEMON – Monitoring in the CERN Computer Centre
WP4-install status update
FTS Monitoring Ricardo Rocha
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Presentation transcript:

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 2 Outline u The ELFms framework n Quattor n Lemon n LEAF u Deployment status

German Cancio – CERN/IT - n° 3 Fabric Management with ELFms (I) ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management u ELFms manages and controls most of the nodes in the CERN CC n ~2600 nodes out of ~ 3500 n Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size,..) n Supported OS: Linux (RH7, RHES2/3/4, Scientific Linux 3/4 – 32/64bit) and Solaris Node Configuration Management Node Management

German Cancio – CERN/IT - n° 4 Development is now coordinated by CERN/IT in collaboration with other HEP institutes Fabric Management with ELFms (II) Quattor/Lemon are used in production in/outside CERN LCG T1/T2 sites, ranging from nodes/site Complete configuration of system and LCG Grid middleware via Quattor Integration with Grid services e.g. monitoring (GridICE, MonALISA) ELFms (Quattor/Lemon) were started in the scope of EU DataGrid.

German Cancio – CERN/IT - n° 5

German Cancio – CERN/IT - n° 6 Quattor Quattor takes care of the configuration, installation and management of fabric nodes  A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…)  Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management

German Cancio – CERN/IT - n° 7 Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Architecture Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP

German Cancio – CERN/IT - n° 8 Configuration Information u Configuration is expressed using a language called Pan u Information is arranged into templates n Common properties set only once u Using templates it is possible to create hierarchies to match service structures CERN CC name_srv1: time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srv lxplus001 eth0/ip: pkg_add (lsf6_beta) lxplus020 eth0/ip: lxplus029

German Cancio – CERN/IT - n° 9 Quattor Deployment u Quattor in complete control of Linux boxes (~ 2600 nodes, to grow to ~ in 2008) u CDB holding information of all systems in CERN-CC u Over 90 NCM configuration components developed n From basic system configuration to Grid services setup… (including desktops) u SPMA used for managing all software n ~ 2 weekly security and functional updates (including kernel upgrades) n Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to v5) in 15 mins, without service interruption n Handles (occasional) downgrades as well u Developments ongoing: n CDB: Fine-grained ACL protection to templates, namespaces, stronger typing, improved SQL/XMLDB backend … n Security: Deployment of HTTPS instead of HTTP (usage of host certificates) n Re-engineering of Software Repository (BARC) u Proxy architecture for enhanced scalability …

German Cancio – CERN/IT - n° 10 Proxy server setup DNS-load balanced HTTP MM’ Backend (“Master”) Frontend L1 proxies L2 proxies (“Head” nodes) Server cluster HHH … Rack 1Rack 2…… Rack N Installation images, RPMs, configuration profiles

German Cancio – CERN/IT - n° 11 LCG/EGEE u EGEE and LCG have chosen quattor for managing their integration testbeds u Components available for a fully automated LCG-2 configuration u Many sites (a dozen, including LAL/IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework… n In India: BARC, VECCAL (ALICE experiment) u … leading to improved core software robustness and completeness n Identified and removed site dependencies and assumptions n Documentation, installation guides, bug tracking, release cycles

German Cancio – CERN/IT - n° 12

German Cancio – CERN/IT - n° 13 Lemon – LHC Era Monitoring Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

German Cancio – CERN/IT - n° 14 Deployment and Enhancements u Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC n ~ 200 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes n No aging-out of data but archiving on MSS (CASTOR) u Usage outside CERN-CC, collaborations n GridICE (>100 LCG sites) n CMS-Online n IN2P3 n Others… u Hardened and enhanced EDG software n Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) n Generic multi-purpose sensor by BARC u Correlation and Fault Recovery n Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) n Being re-engineered by BARC u Security for sample transport (TCP and UDP) (BARC) u Status and performance visualization pages …

German Cancio – CERN/IT - n° 15 Monitoring the Fabric Using a web-based status display: u CC Overview

German Cancio – CERN/IT - n° 16 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes

German Cancio – CERN/IT - n° 17 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s

German Cancio – CERN/IT - n° 18 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power

German Cancio – CERN/IT - n° 19 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power u Error trending

German Cancio – CERN/IT - n° 20 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power u Error trending u Batch system

German Cancio – CERN/IT - n° 21 Next Steps… u Service based views (user/mgmt perspective) n Synoptical view of what services are running how – appropriate for end users and managers n Needs to be built on top of Quattor and Lemon n Would require a separate service definition DB u Alarm system for operators n Allow 24/24h 7/7d operators to receive, acknowledge, ignore, hide, process alarms received via Lemon n Integrated into the Lemon Status pages

German Cancio – CERN/IT - n° 22 Quattor-LEMON integration Quattor and Lemon are tightly integrated at CERN u Configuration of Lemon Agent and Server: n CDB holds definitions of all sensors, metric classes, and metric instances n An NCM component (ncm-fmonagent) generates the Agent config file n Another NCM component updates the Oracle Server configuration u Configuration of Lemon Web Pages: n Information on what clusters exist, and what nodes belong to which cluster, is extracted from CDBSQL

German Cancio – CERN/IT - n° 23 Quattor-LEMON integration (II) u Visualization of Quattor configuration n Indexed CDB templates available, linked to node and cluster status pages n XML profiles display u Alarm generation n E.g. generate an alarm if the configured kernel version differs from the actual one u Visualization of CC equipment n Geometry of CC (racks, robots, etc) n Location of each node in the CC (what rack) u Examples (CERN server) Examples

German Cancio – CERN/IT - n° 24

German Cancio – CERN/IT - n° 25 LEAF - LHC Era Automated Fabric u LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: u HMS (Hardware Management System): n Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement n Automatically requests installs, retires etc. to technicians n GUI to locate equipment physically n HMS implementation is CERN specific (based on Remedy workflows), but concepts and design should be generic u SMS (State Management System): n Automated handling (and tracking of) high-level configuration steps s Reconfigure and reboot all cluster nodes for new kernel and/or physical move s Drain and reconfig nodes for diagnosis / repair operations n Issues all necessary (re)configuration commands via Quattor n extensible framework – plug-ins for site-specific operations possible

German Cancio – CERN/IT - n° 26 Use Case: Move rack of machines Node HMS NW DB SMS Quattor CDB ServiceMgr Technicians 1. new location 2. Set to standby 3. Update 4. Refresh 5. Take out of production Close queues and drain jobs Disable alarms 6. Request move 9. Install work order 7a. Update 7b. Update 10. Set to production 11. Update 12. Refresh 13. Put into production

German Cancio – CERN/IT - n° 27 LEAF Deployment u HMS in full production for all nodes in CC n HMS heavily used during CC node migration (~ 1500 nodes) u SMS in production for all quattor managed nodes u Current work: n More automation, and handling of other HW types for HMS n More service specific SMS clients (eg. tape & disk servers) u Developing ‘asset management’ GUI (CCTracker) -> BARC n Multiple select, drag&drop nodes to automatically initiate HMS moves and SMS operations n Interface to LEMON GUI

German Cancio – CERN/IT - n° 28 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment

German Cancio – CERN/IT - n° 29 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment n properties

German Cancio – CERN/IT - n° 30 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment n properties u Initiate and track workflows on hardware and services n e.g. add/remove/retire operations, update properties, kernel and OS upgrades, etc

German Cancio – CERN/IT - n° 31 u ELFms is deployed in production at CERN n Stabilized results from 3-year developments within EDG and LCG n Established technology - from Prototype to Production n Consistent full-lifecycle management and high automation level n Providing real added-on value for day-to-day operations u Quattor and LEMON are generic software n Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS Summary =++ u More information:

German Cancio – CERN/IT - n° 32 Differences with ROCKS u Rocks: better documentation, nice GUI, easy to setup u Design principle: reinstall nodes in case of configuration changes n No configuration or software updates on running systems n Suited for production? Efficiency on batch nodes, upgrades / reconfigs on 24/24,7/7 servers (eg. gzip security fix, reconfig of CE address on WN’s) u Assumptions on network structure (private,public parts) and node naming u No indication on how to extend the predefined node types or extend the configured services u Limited configuration capacities (key/value) u No multiple package versions (neither on repository, nor simultaneously on a single node) n Eg. different kernel versions on specific node types u Works only for RH Linux (Anaconda installer extensions)