Status of Fabric Management at CERN

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking CPU ServersDisk ServersNetwork FabricTape Drives.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Distribution of ATLAS Software and configuration data Costin Caramarcu on behalf of ATLAS TDAQ SysAdmins.
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
Servizi core INFN Grid presso il CNAF: setup attuale
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
Job monitoring and accounting data visualization
System Monitoring with Lemon
Database System Concepts and Architecture
Monitoring and Fault Tolerance
Overview – SOE PatchTT November 2015.
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
LEMON – Monitoring in the CERN Computer Centre
Miroslav Siket, Dennis Waldron
Consulting Services JobScheduler Architecture Decision Template
Overview – SOE PatchTT December 2013.
Configuration for gLite
WP4-install status update
Running Computers in CC
FTS Monitoring Ricardo Rocha
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Quattor Usage at Nikhef
Sending data to EUROSTAT using STATEL and STADIUM web client
Presentation transcript:

Status of Fabric Management at CERN LHC Computing Comprehensive Review 14/11/05 Presented by G. Cancio – IT-FIO

Fabric Management with ELFms ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: : configuration, installation and management of nodes : system / service monitoring : hardware / state management ELFms manages and controls heterogeneous CERN-CC environment Supported OS: Linux (RHES2/3, SLC3 32/64bit) and Solaris 9 Functionality: batch nodes, disk servers, tape servers, DB, web, … Heterogeneous hardware: CPU, memory, HD size,.. Node Configuration Management

http://quattor.org

Quattor Quattor takes care of the configuration, installation and management of fabric nodes A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, …) Cluster setup (name and type, batch system, load balancing info…) Site setup Defined in templates arranged in hierarchies – common properties set only once Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management

Node Configuration Manager NCM Architecture Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP XML configuration profiles SW server(s) HTTP SW Repository RPMs Install server HTTP / PXE System installer Install Manager base OS Node Configuration Manager NCM CompA CompB CompC ServiceA ServiceB ServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes

Quattor at CERN (I) Quattor in complete control of CC Linux boxes (~ 2600 nodes) Over 100 NCM configuration components developed for full automation of (almost) all Linux services LCG: Components available for a fully automated LCG-2 configuration EGEE: is using quattor for managing their gLite integration testbeds gLite configuration mechanism integration underway

Quattor at CERN (II) Flexible and automated reconfiguration / reallocation of CC resources demonstrated with ATLAS TDAQ tests Creation of an ATLAS DAQ / HLT test cluster Automated configuration of specific system and network parameters according to ATLAS requirements Automated installation on ATLAS software in RPM format Reallocated significant fraction of LXBATCH (700 nodes) to new cluster during June/July All resources reinstalled and re-integrated into LXBATCH in 36h Linux for Control systems (“LinuxFC”) Quattor-based project for managing Linux servers used in LHC Control systems Strict requirements on system configuration and software management E.g. versioning and cloning of configurations, rollback of configuration and software, validation and verification of configurations, remote management capabilities…

Quattor outside CERN Sites using Quattor in production: 17 13 LCG sites vs. 3 non-LCG sites including NIKHEF, DESY, CNAF, IN2P3:LAL/DAPNIA/CPPM,.. Used for managing grid and/or local services Ranging from 4 to 600 nodes; total ~ 1000, plans to grow to ~ 2600

Quattor Next Steps Improvements in the Configuration DB: Security ACL support for CDB configuration templates Control who can access what templates with an ACL based mechanism Support for CDB Namespaces (“test”, “production” setups, manage multiple sites) Performance improvements (SQL back-end) Security deployment of secure XML profile transport (HTTPS)

http://cern.ch/lemon

Lemon Lemon (LHC Era Monitoring) is a client-server tool suite for monitoring status and performance comprising: a monitoring agent running on each node and sending data to the central repository sensors to measure the values of various metrics (managed by the agent) Several sensors exist to monitor node performance, process, hw and sw monitoring, database monitoring, security, alarms, “external” metrics e.g. power consumption a central repository to store the full monitoring history two implementations, Oracle or flat file based an RRD/Web based display framework

Architecture Monitoring Repository Lemon CLI RRDTool / PHP Correlation TCP/UDP SOAP backend SQL RRDTool / PHP apache HTTP Correlation Engines Nodes Monitoring Agent Sensor Web browser Lemon CLI User User Workstations

Lemon Web interface Using a web-based status display: CC Overview

Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes

Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s

Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s Batch system

Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s Batch system Database (Oracle) Monitoring

Lemon Deployment CERN Computer Centre: ~ 400 metrics sampled every 30s -> 1d; ~ 1.5 GB of data / day on ~ 2600 nodes Interfaced to Quattor… Monitoring configuration CDB discrepancy detection Outside CERN-CC: LCG sites (180 sites with 1,100 nodes – used by GridIce) AB department at CERN (~100 nodes), CMS online (64 nodes and planning for 400+) Others (TU Aachen, S3group/US, BARC India, evaluations by IN2P3, CNAF)

Lemon Next Steps Service based views (user perspective) Synoptical view of what services are running how – appropriate for end users and managers Needs to be built on top of Quattor and Lemon Will require a separate service definition DB Alarm system for operators Allow operators to receive, acknowledge, ignore, hide, process alarms received via Lemon Alarm reduction facilities Security SSL (RSA,DSA or X509) based authentication and possibility of encryption of data between agent and server Access – XML based secure access to Repository data

http://cern.ch/leaf

LEAF - LHC Era Automated Fabric LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: HMS (Hardware Management System): Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement HMS implementation is CERN specific (based on Remedy workflows), but concepts and design should be generic SMS (State Management System): Automated handling (and tracking of) high-level configuration steps E.g. reconfigure and reboot all cluster nodes for new kernel and/or physical move Heavily used during this year’s ATLAS TDAQ tests GUI for HMS/SMS being developed: CCTracker …

CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment

CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties

CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties Initiate and track workflows on hardware and services e.g. add/remove/retire operations, update properties, kernel and OS upgrades, etc

Summary ELFms in smooth operation at CERN and other T1/T2 institutes Grid and local services New domains being entered Online DAQ/HLT farms: ATLAS TDAQ/HLT tests Accelerator Controls: LinuxFC project Core framework developments finished and matured, but still work to be done CDB extensions Displays for Operator Console and Service Status Views Security http://cern.ch/elfms