Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status of Fabric Management at CERN

Similar presentations


Presentation on theme: "Status of Fabric Management at CERN"— Presentation transcript:

1 Status of Fabric Management at CERN
LHC Computing Comprehensive Review 14/11/05 Presented by G. Cancio – IT-FIO

2 Fabric Management with ELFms
ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: : configuration, installation and management of nodes : system / service monitoring : hardware / state management ELFms manages and controls heterogeneous CERN-CC environment Supported OS: Linux (RHES2/3, SLC3 32/64bit) and Solaris 9 Functionality: batch nodes, disk servers, tape servers, DB, web, … Heterogeneous hardware: CPU, memory, HD size,.. Node Configuration Management

3

4 Quattor Quattor takes care of the configuration, installation and management of fabric nodes A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, …) Cluster setup (name and type, batch system, load balancing info…) Site setup Defined in templates arranged in hierarchies – common properties set only once Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management

5 Node Configuration Manager NCM
Architecture Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP XML configuration profiles SW server(s) HTTP SW Repository RPMs Install server HTTP / PXE System installer Install Manager base OS Node Configuration Manager NCM CompA CompB CompC ServiceA ServiceB ServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes

6 Quattor at CERN (I) Quattor in complete control of CC Linux boxes (~ 2600 nodes) Over 100 NCM configuration components developed for full automation of (almost) all Linux services LCG: Components available for a fully automated LCG-2 configuration EGEE: is using quattor for managing their gLite integration testbeds gLite configuration mechanism integration underway

7 Quattor at CERN (II) Flexible and automated reconfiguration / reallocation of CC resources demonstrated with ATLAS TDAQ tests Creation of an ATLAS DAQ / HLT test cluster Automated configuration of specific system and network parameters according to ATLAS requirements Automated installation on ATLAS software in RPM format Reallocated significant fraction of LXBATCH (700 nodes) to new cluster during June/July All resources reinstalled and re-integrated into LXBATCH in 36h Linux for Control systems (“LinuxFC”) Quattor-based project for managing Linux servers used in LHC Control systems Strict requirements on system configuration and software management E.g. versioning and cloning of configurations, rollback of configuration and software, validation and verification of configurations, remote management capabilities…

8 Quattor outside CERN Sites using Quattor in production: 17
13 LCG sites vs. 3 non-LCG sites including NIKHEF, DESY, CNAF, IN2P3:LAL/DAPNIA/CPPM,.. Used for managing grid and/or local services Ranging from 4 to 600 nodes; total ~ 1000, plans to grow to ~ 2600

9 Quattor Next Steps Improvements in the Configuration DB: Security
ACL support for CDB configuration templates Control who can access what templates with an ACL based mechanism Support for CDB Namespaces (“test”, “production” setups, manage multiple sites) Performance improvements (SQL back-end) Security deployment of secure XML profile transport (HTTPS)

10

11 Lemon Lemon (LHC Era Monitoring) is a client-server tool suite for monitoring status and performance comprising: a monitoring agent running on each node and sending data to the central repository sensors to measure the values of various metrics (managed by the agent) Several sensors exist to monitor node performance, process, hw and sw monitoring, database monitoring, security, alarms, “external” metrics e.g. power consumption a central repository to store the full monitoring history two implementations, Oracle or flat file based an RRD/Web based display framework

12 Architecture Monitoring Repository Lemon CLI RRDTool / PHP Correlation
TCP/UDP SOAP backend SQL RRDTool / PHP apache HTTP Correlation Engines Nodes Monitoring Agent Sensor Web browser Lemon CLI User User Workstations

13 Lemon Web interface Using a web-based status display: CC Overview

14 Lemon Web interface Using a web-based status display: CC Overview
Clusters and nodes

15 Lemon Web interface Using a web-based status display: CC Overview
Clusters and nodes VO’s

16 Lemon Web interface Using a web-based status display: CC Overview
Clusters and nodes VO’s Batch system

17 Lemon Web interface Using a web-based status display: CC Overview
Clusters and nodes VO’s Batch system Database (Oracle) Monitoring

18 Lemon Deployment CERN Computer Centre:
~ 400 metrics sampled every 30s -> 1d; ~ 1.5 GB of data / day on ~ nodes Interfaced to Quattor… Monitoring configuration CDB discrepancy detection Outside CERN-CC: LCG sites (180 sites with 1,100 nodes – used by GridIce) AB department at CERN (~100 nodes), CMS online (64 nodes and planning for 400+) Others (TU Aachen, S3group/US, BARC India, evaluations by IN2P3, CNAF)

19 Lemon Next Steps Service based views (user perspective)
Synoptical view of what services are running how – appropriate for end users and managers Needs to be built on top of Quattor and Lemon Will require a separate service definition DB Alarm system for operators Allow operators to receive, acknowledge, ignore, hide, process alarms received via Lemon Alarm reduction facilities Security SSL (RSA,DSA or X509) based authentication and possibility of encryption of data between agent and server Access – XML based secure access to Repository data

20

21 LEAF - LHC Era Automated Fabric
LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: HMS (Hardware Management System): Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement HMS implementation is CERN specific (based on Remedy workflows), but concepts and design should be generic SMS (State Management System): Automated handling (and tracking of) high-level configuration steps E.g. reconfigure and reboot all cluster nodes for new kernel and/or physical move Heavily used during this year’s ATLAS TDAQ tests GUI for HMS/SMS being developed: CCTracker …

22 CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment

23 CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties

24 CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties Initiate and track workflows on hardware and services e.g. add/remove/retire operations, update properties, kernel and OS upgrades, etc

25 Summary ELFms in smooth operation at CERN and other T1/T2 institutes
Grid and local services New domains being entered Online DAQ/HLT farms: ATLAS TDAQ/HLT tests Accelerator Controls: LinuxFC project Core framework developments finished and matured, but still work to be done CDB extensions Displays for Operator Console and Service Status Views Security


Download ppt "Status of Fabric Management at CERN"

Similar presentations


Ads by Google