Presentation is loading. Please wait.

Presentation is loading. Please wait.

Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,

Similar presentations


Presentation on theme: "Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,"— Presentation transcript:

1 Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio, T. Kleinwort, W. Tomlin et al. Presented by G. Cancio

2 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 2 Outline u CERN-CC and the ELFms framework n Quattor n Lemon n LEAF u Deployment status

3 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 3 Fabric Management with ELFms ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management u ELFms manages and controls most of the nodes in the CERN CC n ~2100 nodes out of ~ 2700 n Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size,..) n Supported OS: Linux (RH7, RHES2.1, Scientific Linux 3 – IA32&IA64) and Solaris (9) Node Configuration Management Node Management

4 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 4 http://quattor.org

5 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 5 Quattor Quattor takes care of the configuration, installation and management of fabric nodes  A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…) Defined in templates arranged in hierarchies – common properties set only once  Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management Quattor was initially developed in the scope of EU DataGrid. Development and maintenance now coordinated by CERN/IT

6 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 6 Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Architecture Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP

7 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 7 Quattor Deployment u Quattor in complete control of Linux boxes (~ 2100 nodes, to grow to ~ 8000 in 2006-8) n Replacement of legacy tools (SUE and ASIS) at CERN during 2003 u CDB holding information of > 95% of systems in CERN-CC u Over 90 NCM configuration components developed n From basic system configuration to Grid services setup… (including desktops) u SPMA used for managing all software n ~ 2 weekly security and functional updates (including kernel upgrades) n Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to v5) in 15 mins, without service interruption n Handles (occasional) downgrades as well u Developments ongoing: n Fine-grained ACL protection to templates n Deployment of HTTPS instead of HTTP (usage of host certificates) n XML configuration profile generation speedup (eg. parallel generation) u Proxy architecture for enhanced scalability …

8 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 8 Back-end/front-end setup DNS-load balanced HTTP MS Backend (“Master”) Frontend rsync Server cluster Rack 1Rack 2…… Rack N … Installation images, RPMs, configuration profiles

9 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 9 Proxy server setup DNS-load balanced HTTP MM’ Backend (“Master”) Frontend L1 proxies L2 proxies (“Head” nodes) Server cluster HHH … Rack 1Rack 2…… Rack N Installation images, RPMs, configuration profiles

10 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 10 Quattor @ LCG/EGEE u EGEE and LCG have chosen quattor for managing their integration testbeds u Community effort to use quattor for fully automated LCG-2 configuration for all services n Aim is to provide a complete porting of LCFG configuration components n Most service configurations (WN, CE, UI,..) already available n Minimal intrusiveness into site specific environments u More and more sites (IN2P3, NIKHEF, UAM Madrid..) and projects (GridPP) discussing or adopting quattor as basic fabric management framework… u … leading to improved core software robustness and completeness n Identified and removed site dependencies and assumptions n Documentation, installation guides, bug tracking, release cycles

11 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 11 http://cern.ch/lemon

12 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 12 Lemon – LHC Era Monitoring Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

13 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 13 Deployment and Enhancements u Smooth production running of Monitoring Agent and Oracle-based repository at CERN-CC n 150 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes n No aging-out of data but archiving on MSS (CASTOR) u Usage outside CERN-CC, collaborations n GridICE, CMS-Online (DAQ nodes) n BARC India (collaboration on QoS) n Interface with MonaLisa being discussed u Hardened and enhanced EDG software n Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) u Re-engineered Correlation and Fault Recovery n PERL-plugin based correlations engine for derived metrics (eg. average of LXPLUS users, load average & total active LXBATCH nodes) n Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) u Alarm system for operators – gateway to future LHC control alarm system (LASER) u Developing redundancy layer for Repository (Oracle Streams) u Status and performance visualization pages …

14 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 14 lemon-status

15 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 15 http://cern.ch/leaf

16 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 16 LEAF - LHC Era Automated Fabric u LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: u HMS (Hardware Management System): n Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement n Automatically requests installs, retires etc. to technicians n GUI to locate equipment physically n HMS implementation is CERN specific, but concepts and design should be generic u SMS (State Management System): n Automated handling (and tracking of) high-level configuration steps s Reconfigure and reboot all LXPLUS nodes for new kernel and/or physical move s Drain and reconfig nodes for diagnosis / repair operations n Issues all necessary (re)configuration commands via Quattor n extensible framework – plug-ins for site-specific operations possible

17 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 17 LEAF Deployment u HMS in full production for all nodes in CC n HMS heavily used during CC node migration (~ 1500 nodes) u SMS in production for all quattor managed nodes u Next steps: n More automation, and handling of other HW types for HMS n More service specific SMS clients (eg. tape & disk servers) u Developing ‘asset management’ GUI n Multiple select, drag&drop nodes to automatically initiate HMS moves and SMS operations n Interface to LEMON GUI

18 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 18 u ELFms is deployed in production at CERN n Stabilized results from 3-year developments within EDG and LCG n Established technology - from Prototype to Production n Consistent full-lifecycle management and high automation level n Providing real added-on value for day-to-day operations u Quattor and LEMON are generic software n Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS Summary =++ u More information: http://cern.ch/elfms http://cern.ch/elfms

19 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 19

20 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 20 Use Case: Move rack of machines Node HMS NW DB SMS Quattor CDB Operations technicians 1. Import 2. Set to standby 3. Update 4. Refresh 5. Take out of production Close queues and drain jobs Disable alarms 6. Shutdown work order 7. Request move 10. Install work order 8. Update 9. Update 11. Set to production 12. Update 13. Refresh 14. Put into production

21 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 21 Improvements wrt EDG-LCFG u New and powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation u SQL query backend u Portability n Plug-in architecture -> Linux and Solaris u Enhanced components n Sharing of configuration data between components now possible n New component support libraries n Native configuration access API (NVA-API) u Stick to the standards where possible n Installation subsystem uses system installer n Components don’t replace SysV init.d subsystem u Modularity n Clearly defined interfaces and protocols n Mostly independent modules n “light” functionality built in (eg. package management) u Improved scalability n Enabled for proxy technology n NFS mounts not necessary any longer u Enhanced management of software packages n ACL’s for SWRep n Multiple versions installable n No need for RPM ‘header’ files u Last but not least…: Support! n EDG-LCFG is frozen and obsoleted (no ports to newer Linux versions) n LCFG -> EDG-LCFGng -> quattor

22 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 22 Differences with ASIS/SUE SUE: u Focus on configuration, not installation u Powerful configuration language n True hierarchical structures n Extendable data manipulation language n (user defined) typing and validation n Sharing of configuration data between components now possible u Central Configuration Database u Supports unconfiguring services u Improved depenency model n Pre/post dependencies u Revamped component support libraries ASIS: u Scalability n HTTP vs. shared file system u Supports native packaging system (RPM, PKG) u Manages all software on the node u ‘real’ Central Configuration database u (But: no end-user GUI, no package generation tool)

23 Current Status of Fabric Management at CERN – CHEP 2004 – German Cancio et al. - n° 23 Differences with ROCKS u Rocks: better documentation, nice GUI, easy to setup u Design principle: reinstall nodes in case of configuration changes n No configuration or software updates on running systems n Suited for production? Efficiency on batch nodes, upgrades / reconfigs on 24/24,7/7 servers (eg. gzip security fix, reconfig of CE address on WN’s) u Assumptions on network structure (private,public parts) and node naming u No indication on how to extend the predefined node types or extend the configured services u Limited configuration capacities (key/value) u No multiple package versions (neither on repository, nor simultaneously on a single node) n Eg. different kernel versions on specific node types u Works only for RH Linux (Anaconda installer extensions)


Download ppt "Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,"

Similar presentations


Ads by Google