Managing A Large Farm: CSF Andrew Sansum 26 November 2002.

Slides:



Advertisements
Similar presentations
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Advertisements

GENI Experiment Control Using Gush Jeannie Albrecht and Amin Vahdat Williams College and UC San Diego.
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration Debugging.
Lesson 5 Computer-Related Issues
OpenVMS System Management A different perspective by Andy Park TrueBit b.v.
Chapter 10 Server Administration1 Ch. 10 – Server Administration MIS 431 – created Spring 2006.
Hands-On Microsoft Windows Server 2003 Chapter 2 Installing Windows Server 2003, Standard Edition.
Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.
Design, Implementation and Maintenance
Remote Monitoring and Desktop Management Week-7. SNMP designed for management of a limited range of devices and a limited range of functions Monitoring.
Computing Fundamentals Module Lesson 3 — Maintaining and Protecting Hardware Computer Literacy BASICS.
Chapter 7Assembling Your Own Computer System  7.1Assembling the Hardware 7.1Assembling the Hardware 7.1Assembling the Hardware  7.2Installing the Operating.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
1 Chapter Overview Computer Cases Motherboards ROM BIOS.
Tier 1A Storage Procurement 2001/2002 Andrew Sansum CLRC eScience Centre.
Chapter 7: Using Windows Servers to Share Information.
Chapter 2 Applying Practical Automation Speaker : Chuang-Hung Shih Date :
Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
By Anthony W. Hill & Course Technology1 Common End User Problems.
Chapter Fourteen Windows XP Professional Fault Tolerance.
1 Maintain System Integrity Maintain Equipment and Consumables ICAS2017B_ICAU2007B Using Computer Operating system ICAU2231B Caring for Technology Backup.
Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.
RANCID / WebSVN AfNOG 12, Dar Es Salaam, Tanzania.
Software Development Software Testing. Testing Definitions There are many tests going under various names. The following is a general list to get a feel.
Computer Literacy BASICS
2  Supervisor : MENG Sreymom  SNA 2012_Group4  Group Member  CHAN SaratYUN Sinot  PRING SithaPOV Sopheap  CHUT MattaTHAN Vibol  LON SichoeumBEN.
Module 15 Managing Windows Server® 2008 Backup and Restore.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.
Brief Overview: Options for Licence & Support Open Source Job Scheduler Software- und Organisations-Service GmbH 
P3 - prepare a computer for installation/upgrade By Ridjauhn Ryan.
CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer Progress Sonic.
Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
1 Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
A+ Guide to Managing and Maintaining Your PC Fifth Edition Chapter 23 Purchasing a PC or Building Your Own.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Disk Server Deployment at RAL Castor F2F RAL - Feb 2009 Martin Bly.
©Select Office Solutions 2000 PC support at your door Computer Training Business Analysis Strategic & Business Planning.
Your Electricity Expenses During Winter and Summer Cooling fans are an essential component of any electronic system. The reason for this is that electronics.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarksEGEE-III INFSO-RI MPI on the grid:
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
1 Determining a client’s peripheral requirements  Determine current business practices  Determine peripheral requirements  Analyse and document existing.
Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
Computer Literacy BASICS
Chapter Objectives In this chapter, you will learn:
Electronic Plan By Ben Smithers and Dana Natov
Securing Network Servers
Purchasing a PC or Building Your Own
IC3 GS3 Standard COMPUTING FUNDAMENTALS Module
Adam Backman Chief Cat Wrangler – White Star Software
Troubleshooting Tools
Warranty procedure Index Overview
TaxAssist Accountants, Watford
WLCG Service Interventions
How Can Hosted PBX Help You Gain The Communication Balance
CONFIGURING HARDWARE DEVICE & START UP PROCESS
Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
Lesson 5 Computer-Related Issues
SLAC Security Compromise 1998
SUSE Linux Enterprise Desktop Administration
Computer Literacy BASICS
PLANNING A SECURE BASELINE INSTALLATION
The Troubleshooting theory
Capitalize on Your Business’s Technology
Presentation transcript:

Managing A Large Farm: CSF Andrew Sansum 26 November 2002

Overview Will cover many of the large scale issues associated with big CPU/disk farms Intent is to provoke discussion rather than provide answers: I dont claim to be an expert! Many RAL solutions are dated but new staff will soon be making changes.

Large Farms The BIG differences BIG is not beautiful - –A small mistake can proliferate: –problems can multiply, –many components can become involved. –THINK before you make changes! –Manual login on 500 nodes is major disaster! Funding bodies often expect big farms to be run more professionally.

Hardware Specification Good quality hardware is vital. Go with a reputable company Evaluate quality of solution. Check for component compatibility Consider long warranties or be prepared for major interventions yourself (eg replace all the fans)

Power Requirements Is there enough (steady state). Right plugs!! Cope with surge on power up (think about power sequencing). What impact do PSUs have on power supply (cf. SLAC) - neutral current imbalance - higher order harmonics… Remote/Automated power up/down is nice (eg APC units) Worry about equipment on different phases

Cooling Cooling must be sufficient! Must be able to cope with local hot spots. If cooling fails - things get hot very fast - monitoring/automated shutdown.

Installation Netboot/PXE avoids need for manual insertion of floppies. Use something like kickstart to: –Speed up installation task –Maintain record of configuration –Allow automated reconfiguration LCFG not recommended - but maybe successors?

Configuration Management Autorpm is useful for maintaining updates, but update from local managed copy - control changes! Test changes before rolling out!!!!!!!! Need to ensure coherent, reproducible configuration - tricky! –LCFG is good at this but cumbersome –Kickstart needs great care - update kickstart AND systems independently?

Management Tools Very simple at RAL. Local parallel ssh Parallel rsh/ssh commands: prsh seems popular. Project C3 seems worth a look Oscar bundles many interesting tools together

Exception monitoring Need to spot problems before users do. Run daemon or crontab checking for errors. On detection: –Notify: SURE, Bigbrother,... (not !) –Automated fixup (Daemon restart, filesystem cleanup...) –Automated Drain/Remove from configuration. Automated power down/up. Automated DNS updates.

Incident Tracking Keep track of significant interventions. –Which hosts keep crashing. Dates, times errors etc. –What disks failed - serial numbers of returns - returns outstanding... Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it...

Hardware Management Many systems, eventually means: –Many system crashes. –Many hardware failures Consider purchasing 3 years warranty. On-site is easier. Define standard hardware (re) certification procedure. Make use of junior staff (operators postgrads, gran,...!)

Utilisation/Capacity planning Monitor everything you can conveniently manage. –MRTG is standard network monitoring –Ganglia appears to be popular for system utilisation etc. –PBS accounting records (or process accounting).

Conclusions Careful planning, specification and hardware selection can pay dividends. Get smart or invest in lots of staff Monitor so you know what is going on. Many issues raised - few solutions offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!