Presentation on theme: "Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD."— Presentation transcript:
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD
Chris Brew Outline Hardware –Current Grid User –New Machine Room Issues –Power, Air Conditioning & Space Plans –Tier 3 –Configuration Management –Common Backup Issues –Log processing Windows
Chris Brew Current Grid Cluster CPU: –52 x Dual Opteron 270 Dual Core CPUs, 4GB RAM –40 x Dual PIV Xeon 2.8Ghz, 2GB RAM –All running SL3 glite-WN Disk: –8 x 24 Slot dCache Pool Servers Areca ARC RAID cards 22 x WD5000YS RAID 6 (Storage) – 10TB 2 x WD1600YD RAID 1 (System) 64 bit SL4, Single large xfs file system Misc: –GridPP Front Ends running, Torque, LFC/NFS, R-GMA, dCache Head –Ex WNs running CE, DHCPD/TFTP pxeboot server Network now at 10Gb/s but external link still limited by Firewall
Chris Brew Current User Cluster User Interfaces –7 ex WNs from dual 1.4GHz PIII to dual 2.8 GHz PIV 6 x SL3 (1 test, 2 general, 3 expt) 1 SL4 test UI 2 x Dell PowerEdge 1850 Disk Servers –Dell PERC 4/DC RAID card –6 x 300GB disks in Dell PowerVault 220 SCSI shelf –Serves Home and experiment areas via NFS Master copy on one server rsync’d to backup server 1-4 times daily Home area backed up to ADS daily Same hardware as Windows solution, common spares
Chris Brew Other Miscellaneous Boxen Extra Boxes –Install/Scratch/Internal Web server –Monitoring Server –External Web Server –Minos CVS Server –NIS Master –Security Box (Central Logger and Tripwire) New Kit (undergoing burnin now) –32 x Dual Intel Woodcrest 5130 Dual Core CPUs, 8GB RAM (Streamline) –13 Viglen HS160a Disk servers
Chris Brew Machine Room Issues Too much equipment for our small departmental Computer room Taken over adjacent “Display” area –Historically part of computer room –Already has raised floor, and three phase power, though new distribution panel needed for latter –Common air conditioning with Computer Room Refurbished power distribution, installed kit and powered on: –Temp in new area rose to 26°C, temp in old area fell by 1 °C –“Consulting” engineer called in by estates to “rebalance” air conditioning. Very successful - Old/New now 21.5/22.7 °C –Also calculated total capacity of plant at 50kW of cooling currently we are using ~30kW Next step is to refurbish the power in the old machine room to reinstate the three phase supply
Chris Brew Monitoring 2 Different monitoring systems –Ganglia: Monitors per host metrics and records histories to produce graphs, good for trending and viewing current and historic status –Nagios: Monitors “services” and issues alerts, good for raising alerts and viewing “what’s currently bad”. See other talk In view of current lack of effort, program to get as much monitoring as possible in Nagios to be automatically alerted on. –Recently added alerts for SAM tests and Yumit/Patiki updates
Chris Brew Plans 1: Tier 3 Physicists seem to want access to batch other than on the grid so need to provide local access Rather then run 2 batch systems want to give local user access to Grid batch workers Need to: –Merge grid and user cluster account databases Modify YAIM to use NIS pool accounts –Change maui settings to Fairshare Grid/Non-Grid before VO before Users
Chris Brew Plans 2: cfengine Getting to be too many worker nodes to manage with current ad hoc system need to move towards a full configuration management system After asking around decide upon cfengine Test deployment promising Working on re-implementing the Worker Node install in cfengine Still need to find good solution for secure key distribution to newly installed nodes
Chris Brew Plans 3: Common Backup Current backup of important files for Unix is to the Atlas Data Store –Not sure how much longer the ADS is going to be around, need to look for another solution Was intending to look at Amanda but… –Dept bought new 30 slot tape robot for Windows Backup –Veritas Backup software in use on Windows supports Linux Clients Just starting tests on a single node. Will keep you posted.
Chris Brew Plan 4: Reliable Hardware Plan to purchase an new class of “more reliable” worker node type machines –Dual system disks in hot swap caddys –Possibly redundant hot swap power supplies Use this type of machines for running Grid services, Local services (Databases, web servers etc.) and User Interfaces
Chris Brew Issues 1: Log Processing Already running Central Syslog Server (soon to be expanded to 2 hosts for redundancy). As with our Tripwire a fairly passive system –Hope to get enough info off the system to get some useful info after the event Would like some system to monitor these logs and flag “interesting” events. Would prefer little or no training required.
Chris Brew Windows, etc. Still using Windows XP, with Office 2003 and Hummingbird eXceed –Are looking at Vista and Office 2007 but not yet seriously and have no plans for rollout yet Now managed at Business Unit level rather than department Looking for synergies between Unix and Windows support: –Common file server hardware –Common Backup Solution Recently equipped PPD Meeting room with Polycom rollabout VideoConferencing system.