Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.

Slides:



Advertisements
Similar presentations
CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
Advertisements

ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
Technical Architectures
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
Understanding and Managing WebSphere V5
Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
quattor NCM components introduction tutorial German Cancio CERN IT/FIO.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
SPMA & SWRep: Basic exercises HEPiX hands-on, NIKHEF 5/03 German Cancio
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Presentation on developments for the period Oct Feb 2007 C.S.R.C.Murthy, Salim A. Pathan, Rohitashva Sharma & Dinesh Sarode.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Scientific Linux Inventory Project (SLIP) Troy Dawson Connie Sieh.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
StratusLab Final Periodic Review
StratusLab Final Periodic Review
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Quattor Usage at Nikhef
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
Saranya Sriram Developer Evangelist | Microsoft
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Presentation transcript:

Fabric Management: Progress and Plans PEB Tim Smith IT/FIO

2003/09/23Fabric Management: Contents  Manageable Nodes  Framework for management  Deployed at CERN  Plans

2003/09/23Fabric Management: /var/run /var/log /etc/init.d /etc /etc/sysconfig Standard / Autonomous Nodes  Linux Standards Base   Distributed / Autonomous  Decoupling of nodes  Local config files  Local programs do all work  Avoid inherent drift  No external crontabs  No remote mgmt scripts  No remote application updates  No parallel cmd engines  No global file-systems AFS/NFS

2003/09/23Fabric Management: /var/run /var/log /etc/init.d /etc /etc/sysconfig RPM Standard / Autonomous Nodes RPM

2003/09/23Fabric Management: Config Manager Package Manager /var/run /var/log /etc/init.d /etc /etc/sysconfig RPM Standard / Autonomous Nodes  All SW in RPM  Control complete SW set  XML desired state  Transactions  Update local cfgs  Manage through SysV scripts RPM

2003/09/23Fabric Management: RPM Config Manager RPM Package Manager /var/run /var/log /etc/init.d /etc /etc/sysconfig RPM Standard / Autonomous Nodes RPM

2003/09/23Fabric Management: Framework Principles  Installation  Automation of complete process  Reproducible  Clear installation:  Short, clean, and reproducible Kickstart base installation  All Software installation by RPM  System configuration by one tool  Configuration information through unique interface, from unique store  Maintenance:  The running system must be ‘updateable’  The updates must go into new installations afterwards  System updates have to be automated, but steerable

2003/09/23Fabric Management: Framework for Management Node Cfg Cache CDB Config Manager Monitoring Manager Cfg AgentMonAgent Desired StateActual State

2003/09/23Fabric Management: Fault Manager SW Rep Framework for Management Node Cfg Cache SW Cache SW Agent SW Manager CDB Config Manager XMLRPM http Monitoring Manager Cfg AgentMonAgent State Manager HardWare Manager

2003/09/23Fabric Management: SW Rep ELFms: Node Cfg Cache SW Cache SPMA SWRep CDB OraMon NCMMSA SMS HMS LEMON LEAF

2003/09/23Fabric Management: quattor configuration GUI CLI Pan XML Server Module SQL/LDAP/HTTP Server Module SQL/LDAP/HTTP N V A AP I CDB Installation... CCM Cache Node Configuration Data Base (CDB) Configuration Information store. The information is updated in transactions, it is validated and versioned. Pan Templates are compiled into XML profiles Server Modules Provide different access patterns to Configuration Information Configuration Information is stored in the local cache. It is accessed via NVA-API HTTP + notifications nodes are notified about changes of their configuration nodes fetch the XML profiles via HTTP Pan Templates with configuration information are input into CDB via GUI & CLI pan SOAPSOAP

2003/09/23Fabric Management: Config/SW Considerations  Hierarchical configuration specification  Graph rather than tree structure  Common properties set only once  Node profiles  Complete specification in one XML file  Local cache  Transactions / Notifications  Externally specified, Versioned: CVS repos.  One tool to manage all SW: SPMA  System and application  Update verification nodes + release cycle  Procedures and Workflows

2003/09/23Fabric Management: NCM details  Modules:  “Components”  Invocation and notification framework  Component support libraries  Configuration query tool  NCM is responsible for ensuring that reality on a node reflects the desired state in CDB.  components which make the necessary system changes  components do not install software, just ensure config is correct  components register with the framework to be notified when the node configuration is updated in CDB.  Usually, this implies regenerating and/or updating local config files (eg. /etc/sshd_config )  Use standard system facilities (SysV scripts) for managing services  Components notify services using SysV scripts when cfg changes.  Possible to define configuration dependencies between components  Eg. configure network before sendmail

2003/09/23Fabric Management: Hardware variety hardware_diskserv_elonex_1100 hardware_elonex_500 hardware_elonex_600 hardware_elonex_800 hardware_elonex_800_mem1024mb hardware_elonex_800_mem128mb hardware_seil_2002 hardware_seil_2002_interactiv hardware_seil_2003 hardware_siemens_550 hardware_techas_600 hardware_techas_600_2 hardware_techas_600_mem512mb hardware_techas_800 template hardware_diskserver_elonex_1100; "/hardware/cpus" = list(create("hardware_cpu_GenuineIntel_Pentium_III_1100"), create("hardware_cpu_GenuineIntel_Pentium_III_1100")); "/hardware/harddisks" = nlist("sda", create("pro_hardware_harddisk_WDC_20")); "/hardware/ram" = list(create("hardware_ram_1024")); "/hardware/cards/nic" = list(create("hardware_card_nic_Broadcom_BCM5701")); structure template hardware_cpu_GenuineIntel_Pentium_III_1100; "vendor" = "GenuineIntel"; "model" = "Intel(R) Pentium(R) III CPU family 1133MHz"; "speed" = 1100; structure template hardware_card_nic_Broadcom_BCM5701; "manufacturer" = "Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet"; "name" = "3Com Corporation 3C996B-T 1000BaseTX"; "media" = "GigaBit Ethernet"; "bus" = "pci";

2003/09/23Fabric Management: What is in CDB ?  Hardware  CPU  Hard disk  Network card  Memory size  Location  Software  Repository definitions  Service definitions = groups of packages (RPMs)  System  Partition table  Cluster name and type  CCDB name  Site release  Load balancing information

2003/09/23Fabric Management: quattor installation CCM SPMA NCM Components Cdispd NCM Registration Notification SPMA SPMA.cfg CDB nfs http ftp Mgmt API ACL’s Client Nodes SWRep Servers cache Packages (rpm, pkg) packages (RPM, PKG) PXE DHCP Mgmt API ACL’s Installation server DHCP handling KS/JS PXE handling KS/JS generator Node Install CCM Node (re)install?

2003/09/23Fabric Management: SWRep  Client-server toolsuite for the storage of software packages  Universal repository:  Extendable to multiple platforms and package formats (RHLinux/RPM, Solaris/PKG,… others like Debian dpkg)  Multiple package versions/releases  Currently, RPM for RH73, RH21ES and soon RH10 on LXSERV cluster  Management (“product maintainers”) interface:  ACL based mechanism to grant/deny modification rights (packages associated to “areas”)  Management done by FIO/FS  Client access: via standard protocols  On LXSERV: HTTP  Replication: using standard tools (eg. rsync)  Availability, load balancing

2003/09/23Fabric Management: SPMA  Runs on every target node  All LXPLUS/LXBATCH/LXSHARE/ … nodes; except old RH61 nodes  Plug-in framework allows for portability  FIO/FS uses it for RPM, PS group will use it for Solaris (PKG)  Can manage either all or a subset of packages on the nodes  Configured by FIO/FS to manage ALL packages.  Which means: Not configured packages are wiped out.  And: Packages missing on node but in configuration are (re)installed.  Addresses scalability  Packages can be stored ahead in a local cache, avoiding peak loads on software repository servers (simultaneous upgrades of large farms)

2003/09/23Fabric Management: SPMA (II)  SPMA functionality: 1.Compares the packages currently installed on the local node with the packages listed in the configuration 2.Computes the necessary install/deinstall/upgrade operations 3.Invokes the packager RPMT/PKGT with the right operation transaction set  RPMT (RPM transactions) is a small tool on top of the RPM libraries, which allows for multiple simultaneous package operations resolving dependencies (unlike RPM)  Example: ‘upgrade X, deinstall Y, downgrade Z, install T’ and verify/resolve appropriate dependencies

2003/09/23Fabric Management: Currently deployed at CERN  Scale  1400 node profiles  3M lines of node XML specifications  From 28k lines of PAN definitions  Scalability  3 node load balanced web servers cluster  For Configuration profiles  For SW Repository  Time smeared transactions  Pre-deployment caching  Security  GPG key pair per host – during installation  Per host encrypted files served

2003/09/23Fabric Management: Examples  Clusters  LXPLUS, LXBATCH, LXBUILD, LXSHARE, OracleDBs  TapeServers, DiskServers, StageServers  Supported Platforms  RH packages  RHES1300 packages  RH10  KDE/Gnome security patch rollout  0.5 GB onto 700 nodes  15 minute time smearing from 3 lb-servers  LSF 4-5 transition  10 minutes  No queues or jobs stopped  C.f. last time 3 weeks, 3 people!

2003/09/23Fabric Management: Plans  Deploying NCM  Port SUE features to NCM for RH10  Use for Solaris10  Deploying State Manager (SMS)  Use case targetted GUIs  Web Servers  Split: FrontEnd / BackEnd Architecture

2003/09/23Fabric Management: Conclusions  Maturity brings…  Degradation of initial state definition  HW + SW  Accumulation of innocuous temporary procedures  Scale brings…  Marginal activities become full time  Many hands on the systems  Combat with strong management automation