Presentation is loading. Please wait.

Presentation is loading. Please wait.

20.10.20041 Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

Similar presentations


Presentation on theme: "20.10.20041 Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL."— Presentation transcript:

1 20.10.20041 Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL

2 20 October 2004Thorsten Kleinwort IT/FIO/FS 2 Outline Farms at the CERN CC: The Tools Framework The Working Teams Real Life Use Cases Collaborations Summary Useful Links

3 20 October 2004Thorsten Kleinwort IT/FIO/FS 3 The Tools Framework ELFms Quattor: Installation (Kickstart + SWREP) Configuration (CDB + NCM) Management (SPMA + NCM) Lemon: Monitoring Batch system statistics LEAF: State management (SMS) Hardware management (HMS) LEMON QUATTOR LEAF =++

4 20 October 2004Thorsten Kleinwort IT/FIO/FS 4 The Tools Framework (cont’d) The evolution of the ELFms tools is described in various previous presentations: HEPiX II/2003 (Vanouver): ‘The new Fabric Management Tools in Production at CERN’ HEPiX I/2004 (Edinburgh): ‘ELFms, status, deployment’ by German Cancio ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): ‘Current Status of Fabric Management at CERN’ by German This HEPiX: `Experience in the use of quattor tool suite outside CERN’ => Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN

5 20 October 2004Thorsten Kleinwort IT/FIO/FS 5 Tools (cont’d): Other tools [interfacing CDB]: Script: PrepareInstall.pl: Does all necessary steps to prepare a machine install Can run with a list of hosts (for mass installs) Gets all the necessary information from CDB Creates a kickstart file for each node Local Script: maintenance: Script to rundown a node: Drains batch nodes Warns users on interactive nodes Can execute configurable script at the end, e.g. reboot

6 20 October 2004Thorsten Kleinwort IT/FIO/FS 6 Tools (cont’d) Automated Fabric [LEAF]: State Management System SMS: Other CDB changes are done by SMS: Change OS/Cluster Systems have state: ‘production’ or ‘standby’ Hardware Management System HMS: Workflow to track hardware changes [interfaces CDB]: New machine arrival Machine moves Machine interventions (Vendor calls), retirements

7 20 October 2004Thorsten Kleinwort IT/FIO/FS 7 The Working Teams Operator “Customers” Service Manager SysAdmins 24/7 Alarm display Following procedures: Acting on alarms Open Remedy tickets Email/phone notification Machine reboots New team Now 7 staff, more to hire Running more and more services in the CC Doing most of the install and maintenance work on farm PCs Following up h/w failures ‘Vendor calls’ Farm/Cluster resource planning Writing/improving the procedures/tools Following up on new problems Other groups/teams in CERN-IT, like: DB (ORACLE) GD (LCG) GM (EGEE) Experiments (Data Challenges) Changing requirements

8 20 October 2004Thorsten Kleinwort IT/FIO/FS 8 Another Management Tool Remedy: The problem tracking tool in CERN IT Used in different workflows, e.g. by: The Operator to open tickets following up on alarms The Service Managers to ask for machine interventions The SysAdmins to follow up on problems/general issues HMS is implemented as a Remedy Workflow as well Recently started to get statistics on hardware failures

9 20 October 2004Thorsten Kleinwort IT/FIO/FS 9 Real Life Use Cases Kernel upgrade (on LXBATCH, ~1500 hosts): 1.Put the new software into the repository (SWREP, precaching) 2.Put the new kernel RPM on the nodes: SPMA, with multi-package option (old kernel is still running!) 3.Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node 4.Drain the nodes by disabling new batch jobs (maintenance)

10 20 October 2004Thorsten Kleinwort IT/FIO/FS 10 Real Life Use Cases Kernel upgrade (cont’d): 5.Node reboots when it is drained (could be at any time) 6.New machine comes up with new kernel, and goes back into production immediately  Least downtime for each node. Capacity is always available: First reboot instantaneous, last one can be several days later Everything runs automatically, some cleanup has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm

11 20 October 2004Thorsten Kleinwort IT/FIO/FS 11 Real Life Use Cases (cont’d) Configure batch resources (LSF): LSF resources are defined, depending on availability, power and cluster of machines Resources are defined in CDB Configured on the node using NCM The master file is generated from CDB2SQL in a cron job every day (reconfig takes several minutes) Consistency of client/master due to CDB Resources assignments are done in CDB on (sub-) cluster level (template structure) Reassignments of (sub-)clusters in CDB are done with SMS tools

12 20 October 2004Thorsten Kleinwort IT/FIO/FS 12 Real Life Use Cases (cont’d) Emptying the Computer Centre For the refurbishment of the CERN Computer Centre all machines had to be moved, either from one side to the other, or downstairs (vault) ~ 2000 machines had to be moved Taking the opportunity to add machines to CDB As quattor and non-quattor nodes Batch machines were moved in ‘racks=44 nodes’: HMS was used to steer the moves SMS/maintenance to shut down the machines Rename/PrepareInstall to bring machines back

13 20 October 2004Thorsten Kleinwort IT/FIO/FS 13

14 20 October 2004Thorsten Kleinwort IT/FIO/FS 14 Real Life Use Cases (cont’d) New h/w arrival => mass installation New machines (~400) arrive at CERN (in bunches of 50 – 100) Racks have to be prepared: Network equipment Power supply (Console service) Plan machine membership (cluster) Put machine into CDB: h/w type Cluster type/OS

15 20 October 2004Thorsten Kleinwort IT/FIO/FS 15 Real Life Use Cases New h/w arrival (cont’d) Physical machine installation (HMS): New DNS entry OS installation: PrepareInstall Installation by the SysAdmin Burn-in test (h/w test, several days to weeks) Follow up on h/w problems with Vendor Add the machines to the alarm display (SURE) Put machines into production

16 20 October 2004Thorsten Kleinwort IT/FIO/FS 16

17 20 October 2004Thorsten Kleinwort IT/FIO/FS 17 Collaborations External ‘Customers’: EGEE, LCG, and other groups at CERN are now using Quattor managed machines: They benefit from standard, manageable, and reproducible machine setups They are able/should learn to do modifications themselves External sites using Quattor: IN2P3, NIKHEF, UAM Madrid,… discussing to or use already Quattor => see Rafael’s talk This helps to enhance the tools: Service nodes (for LCG-2) Having a wider usage Generalizing components

18 20 October 2004Thorsten Kleinwort IT/FIO/FS 18 Summary ELFms is deployed in production at CERN Established technology – from Prototype to Production Though enhancements are ongoing Fundamental part of our infrastructure Merged with our existing environment Quattor and Lemon are generic software Used by others inside/outside CERN Hopefully a fruitful collaboration in the future

19 20 October 2004Thorsten Kleinwort IT/FIO/FS 19 Useful Links: ELFms: http://cern.ch/elfmshttp://cern.ch/elfms Quattor: http://quattor.org/http://quattor.org/ Lemon: http://cern.ch/lemonhttp://cern.ch/lemon LEAF: http://cern.ch/leafhttp://cern.ch/leaf Previous presentations: HEPiX II/2003 (Vanouver): http://www.triumf.ca/hepix2003 http://www.triumf.ca/hepix2003 ‘The new Fabric Management Tools in Production at CERN’: HEPiX I/2004 (Edinburgh): http://www.nesc.ac.uk/esi/events/291/ http://www.nesc.ac.uk/esi/events/291/ ‘ELFms, status, deployment’ by German Cancio ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): http://chep2004.web.cern.ch/chep2004/ http://chep2004.web.cern.ch/chep2004/ ‘Current Status of Fabric Management at CERN’ by German Cancio

20 20 October 2004Thorsten Kleinwort IT/FIO/FS 20 Questions?


Download ppt "20.10.20041 Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL."

Similar presentations


Ads by Google