Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.

Similar presentations


Presentation on theme: "Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT."— Presentation transcript:

1 Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

2 German Cancio – CERN/IT - n° 2 Outline u The ELFms framework n Quattor n Lemon n LEAF u Deployment status

3 German Cancio – CERN/IT - n° 3 Fabric Management with ELFms (I) ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management u ELFms manages and controls most of the nodes in the CERN CC n ~2600 nodes out of ~ 3500 n Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size,..) n Supported OS: Linux (RH7, RHES2/3/4, Scientific Linux 3/4 – 32/64bit) and Solaris Node Configuration Management Node Management

4 German Cancio – CERN/IT - n° 4 Development is now coordinated by CERN/IT in collaboration with other HEP institutes Fabric Management with ELFms (II) Quattor/Lemon are used in production in/outside CERN LCG T1/T2 sites, ranging from 50-800 nodes/site Complete configuration of system and LCG Grid middleware via Quattor Integration with Grid services e.g. monitoring (GridICE, MonALISA) ELFms (Quattor/Lemon) were started in the scope of EU DataGrid.

5 German Cancio – CERN/IT - n° 5 http://quattor.org

6 German Cancio – CERN/IT - n° 6 Quattor Quattor takes care of the configuration, installation and management of fabric nodes  A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…)  Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management

7 German Cancio – CERN/IT - n° 7 Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Architecture Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP

8 German Cancio – CERN/IT - n° 8 Configuration Information u Configuration is expressed using a language called Pan u Information is arranged into templates n Common properties set only once u Using templates it is possible to create hierarchies to match service structures CERN CC name_srv1: 137.138.16.5 time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srv lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta) lxplus020 eth0/ip: 137.138.4.225 lxplus029

9 German Cancio – CERN/IT - n° 9 Quattor Deployment u Quattor in complete control of Linux boxes (~ 2600 nodes, to grow to ~ 6-8000 in 2008) u CDB holding information of all systems in CERN-CC u Over 90 NCM configuration components developed n From basic system configuration to Grid services setup… (including desktops) u SPMA used for managing all software n ~ 2 weekly security and functional updates (including kernel upgrades) n Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to v5) in 15 mins, without service interruption n Handles (occasional) downgrades as well u Developments ongoing: n CDB: Fine-grained ACL protection to templates, namespaces, stronger typing, improved SQL/XMLDB backend … n Security: Deployment of HTTPS instead of HTTP (usage of host certificates) n Re-engineering of Software Repository (BARC) u Proxy architecture for enhanced scalability …

10 German Cancio – CERN/IT - n° 10 Proxy server setup DNS-load balanced HTTP MM’ Backend (“Master”) Frontend L1 proxies L2 proxies (“Head” nodes) Server cluster HHH … Rack 1Rack 2…… Rack N Installation images, RPMs, configuration profiles

11 German Cancio – CERN/IT - n° 11 Quattor @ LCG/EGEE u EGEE and LCG have chosen quattor for managing their integration testbeds u Components available for a fully automated LCG-2 configuration u Many sites (a dozen, including LAL/IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework… n In India: BARC, VECCAL (ALICE experiment) u … leading to improved core software robustness and completeness n Identified and removed site dependencies and assumptions n Documentation, installation guides, bug tracking, release cycles

12 German Cancio – CERN/IT - n° 12 http://cern.ch/lemon

13 German Cancio – CERN/IT - n° 13 Lemon – LHC Era Monitoring Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

14 German Cancio – CERN/IT - n° 14 Deployment and Enhancements u Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC n ~ 200 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes n No aging-out of data but archiving on MSS (CASTOR) u Usage outside CERN-CC, collaborations n GridICE (>100 LCG sites) n CMS-Online n IN2P3 n Others… u Hardened and enhanced EDG software n Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) n Generic multi-purpose sensor by BARC u Correlation and Fault Recovery n Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) n Being re-engineered by BARC u Security for sample transport (TCP and UDP) (BARC) u Status and performance visualization pages …

15 German Cancio – CERN/IT - n° 15 Monitoring the Fabric Using a web-based status display: u CC Overview

16 German Cancio – CERN/IT - n° 16 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes

17 German Cancio – CERN/IT - n° 17 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s

18 German Cancio – CERN/IT - n° 18 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power

19 German Cancio – CERN/IT - n° 19 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power u Error trending

20 German Cancio – CERN/IT - n° 20 Monitoring the Fabric Using a web-based status display: u CC Overview u Clusters and nodes u VO’s u Power u Error trending u Batch system

21 German Cancio – CERN/IT - n° 21 Next Steps… u Service based views (user/mgmt perspective) n Synoptical view of what services are running how – appropriate for end users and managers n Needs to be built on top of Quattor and Lemon n Would require a separate service definition DB u Alarm system for operators n Allow 24/24h 7/7d operators to receive, acknowledge, ignore, hide, process alarms received via Lemon n Integrated into the Lemon Status pages

22 German Cancio – CERN/IT - n° 22 Quattor-LEMON integration Quattor and Lemon are tightly integrated at CERN u Configuration of Lemon Agent and Server: n CDB holds definitions of all sensors, metric classes, and metric instances n An NCM component (ncm-fmonagent) generates the Agent config file n Another NCM component updates the Oracle Server configuration u Configuration of Lemon Web Pages: n Information on what clusters exist, and what nodes belong to which cluster, is extracted from CDBSQL

23 German Cancio – CERN/IT - n° 23 Quattor-LEMON integration (II) u Visualization of Quattor configuration n Indexed CDB templates available, linked to node and cluster status pages n XML profiles display u Alarm generation n E.g. generate an alarm if the configured kernel version differs from the actual one u Visualization of CC equipment n Geometry of CC (racks, robots, etc) n Location of each node in the CC (what rack) u Examples (CERN server) Examples

24 German Cancio – CERN/IT - n° 24 http://cern.ch/leaf

25 German Cancio – CERN/IT - n° 25 LEAF - LHC Era Automated Fabric u LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: u HMS (Hardware Management System): n Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement n Automatically requests installs, retires etc. to technicians n GUI to locate equipment physically n HMS implementation is CERN specific (based on Remedy workflows), but concepts and design should be generic u SMS (State Management System): n Automated handling (and tracking of) high-level configuration steps s Reconfigure and reboot all cluster nodes for new kernel and/or physical move s Drain and reconfig nodes for diagnosis / repair operations n Issues all necessary (re)configuration commands via Quattor n extensible framework – plug-ins for site-specific operations possible

26 German Cancio – CERN/IT - n° 26 Use Case: Move rack of machines Node HMS NW DB SMS Quattor CDB ServiceMgr Technicians 1. new location 2. Set to standby 3. Update 4. Refresh 5. Take out of production Close queues and drain jobs Disable alarms 6. Request move 9. Install work order 7a. Update 7b. Update 10. Set to production 11. Update 12. Refresh 13. Put into production

27 German Cancio – CERN/IT - n° 27 LEAF Deployment u HMS in full production for all nodes in CC n HMS heavily used during CC node migration (~ 1500 nodes) u SMS in production for all quattor managed nodes u Current work: n More automation, and handling of other HW types for HMS n More service specific SMS clients (eg. tape & disk servers) u Developing ‘asset management’ GUI (CCTracker) -> BARC n Multiple select, drag&drop nodes to automatically initiate HMS moves and SMS operations n Interface to LEMON GUI

28 German Cancio – CERN/IT - n° 28 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment

29 German Cancio – CERN/IT - n° 29 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment n properties

30 German Cancio – CERN/IT - n° 30 Managing the Fabric Visualize, locate and manage CC objects using high-level workflows u Visualize n physical location of equipment n properties u Initiate and track workflows on hardware and services n e.g. add/remove/retire operations, update properties, kernel and OS upgrades, etc

31 German Cancio – CERN/IT - n° 31 u ELFms is deployed in production at CERN n Stabilized results from 3-year developments within EDG and LCG n Established technology - from Prototype to Production n Consistent full-lifecycle management and high automation level n Providing real added-on value for day-to-day operations u Quattor and LEMON are generic software n Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS Summary =++ u More information: http://cern.ch/elfms http://cern.ch/elfms

32 German Cancio – CERN/IT - n° 32 Differences with ROCKS u Rocks: better documentation, nice GUI, easy to setup u Design principle: reinstall nodes in case of configuration changes n No configuration or software updates on running systems n Suited for production? Efficiency on batch nodes, upgrades / reconfigs on 24/24,7/7 servers (eg. gzip security fix, reconfig of CE address on WN’s) u Assumptions on network structure (private,public parts) and node naming u No indication on how to extend the predefined node types or extend the configured services u Limited configuration capacities (key/value) u No multiple package versions (neither on repository, nor simultaneously on a single node) n Eg. different kernel versions on specific node types u Works only for RH Linux (Anaconda installer extensions)


Download ppt "Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT."

Similar presentations


Ads by Google