Presentation on theme: "Managing A Large Farm: CSF Andrew Sansum 26 November 2002."— Presentation transcript:
Managing A Large Farm: CSF Andrew Sansum 26 November 2002
Overview Will cover many of the large scale issues associated with big CPU/disk farms Intent is to provoke discussion rather than provide answers: I dont claim to be an expert! Many RAL solutions are dated but new staff will soon be making changes.
Large Farms The BIG differences BIG is not beautiful - –A small mistake can proliferate: –problems can multiply, –many components can become involved. –THINK before you make changes! –Manual login on 500 nodes is major disaster! Funding bodies often expect big farms to be run more professionally.
Hardware Specification Good quality hardware is vital. Go with a reputable company Evaluate quality of solution. Check for component compatibility Consider long warranties or be prepared for major interventions yourself (eg replace all the fans)
Power Requirements Is there enough (steady state). Right plugs!! Cope with surge on power up (think about power sequencing). What impact do PSUs have on power supply (cf. SLAC) - neutral current imbalance - higher order harmonics… Remote/Automated power up/down is nice (eg APC units) Worry about equipment on different phases
Cooling Cooling must be sufficient! Must be able to cope with local hot spots. If cooling fails - things get hot very fast - monitoring/automated shutdown.
Installation Netboot/PXE avoids need for manual insertion of floppies. Use something like kickstart to: –Speed up installation task –Maintain record of configuration –Allow automated reconfiguration LCFG not recommended - but maybe successors?
Configuration Management Autorpm is useful for maintaining updates, but update from local managed copy - control changes! Test changes before rolling out!!!!!!!! Need to ensure coherent, reproducible configuration - tricky! –LCFG is good at this but cumbersome –Kickstart needs great care - update kickstart AND systems independently?
Management Tools Very simple at RAL. Local parallel ssh Parallel rsh/ssh commands: prsh seems popular. Project C3 seems worth a look Oscar bundles many interesting tools together
Exception monitoring Need to spot problems before users do. Run daemon or crontab checking for errors. On detection: –Notify: SURE, Bigbrother,... (not !) –Automated fixup (Daemon restart, filesystem cleanup...) –Automated Drain/Remove from configuration. Automated power down/up. Automated DNS updates.
Incident Tracking Keep track of significant interventions. –Which hosts keep crashing. Dates, times errors etc. –What disks failed - serial numbers of returns - returns outstanding... Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it...
Hardware Management Many systems, eventually means: –Many system crashes. –Many hardware failures Consider purchasing 3 years warranty. On-site is easier. Define standard hardware (re) certification procedure. Make use of junior staff (operators postgrads, gran,...!)
Utilisation/Capacity planning Monitor everything you can conveniently manage. –MRTG is standard network monitoring –Ganglia appears to be popular for system utilisation etc. –PBS accounting records (or process accounting).
Conclusions Careful planning, specification and hardware selection can pay dividends. Get smart or invest in lots of staff Monitor so you know what is going on. Many issues raised - few solutions offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!