Presentation on theme: "Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009."— Presentation transcript:
Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009
Outline Resilience Hardware resilience Software changes resilience Manpower resilience Communication Site resilience status General status Conclusions
Resilience Definition: 1.The power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2.Ability to recover readily from illness, depression, adversity, or the like; buoyancy. Translation: – Hardware resilience: Redundancy and capacity. – Manpower resilience: Continuity – Software resilience: Simplicity and easiness of maintenance – Communication resilience: Effectiveness
Hardware resilience The system has to be redundant and has capacity enough to take the load. There are many levels of redundancy and capacity with increasing cost – Single machine components: disks, memory, CPUs – Full Redundancy: replication of services in the same room – Full redundancy paranoid: replication of services in different places Clearly there is a tradeoff on how important is a service and how much money a site has to do the replication
Manpower resilience The man power has to insure continuity of service. This continuity is lost when people change. – It takes many months to train a new system administrator – It takes even longer in the grid environment where there are no well defined guidelines, the documentation is dispersed and most of the knowledge goes from mouth to mouth Protocols and procedure for almost every action should be written to ensure continuity. – How to shut down a service for maintenance – What to do in case of security breach – Who to call is the main link to JANET goes down – What to do to update the software – What to do to reinsert a node in the batch system after a memory replacement –......
Software resilience Simplicity and easiness of maintenance are a key component to at least two things: – Service recovery in case disaster strikes – Less steep learning curve for new people The grid software is neither simple nor easy to maintain. It is complicated, ill-documented and changes continuously at the least. – Dcache is a flagship example of this and this is why it is being abandoned by many sites. – But there is also a problem with continuous changes in the software itself: lcg-CE, glite-CE,cream-CE, 4 or 5 storage sysems that are almost incompatible with each other, RB or WMS or experiments pilot frameworks, SRM yes, no SRM is dead............................................................................................................................................
Communication Communication has to be effective. If one mean of communication is not effective it should be replaced with one more effective – I was always missing SA1 ACL requests for the SVN repository I redirected them to the manchester helpdesk. Now I respond within 2 hours during working hours – System admins in Manchester weren't listening to each other during meetings now there is a rule to write EVERYTHING in the tickets. – Atlas putting offline sites was a problem because the action was written in the atlas shifter elogs. Now they'll write it in the ticket so the site is made aware immediately of what is happening.
Lancaster Twin CEs New kit has dual PSU All systems in cfengine Daily back up of databases Current machine room has new redundant air con Temperature sensors with nagios alarms have been installed 2 nd machine room with modern chilled racks – Available in july Only on fibre uplink to JANET
Liverpool Strong points: Reviewed and fixed single points of failure 2 years ago. High spec servers with RAID1 and dual PSU. UPS on critical servers, RAIDS and switches. Distributed software servers with high level of redundancy. Active rack monitoring Nagios, Ganglia and custom scripts. RAID6 on SE data servers. WAN connection has redundancy and automatic failover recovery. Spares for long lead time items. Capability of maintaining our own hardware.
Liverpool (cont.) Weak points: BDII and MON nodes are old hardware. Single CE is single point of failure. Only 0.75 FTE over 3 years dedicated to grid admin. Air-con is ageing and in need of constant maintenance University has agreed to install new water-cooled racks for future new hardware.
Manchester Machine room: 2 generators + 3 UPS + 3 air cond unit – Uni staff dedicated to the maintenance Two independent clusters (2CEs, 2x2 SEs, 2 SW servers) All main services have raid1 and memory and disks have also been upgraded They are in the same rack, attached to different PDUs Services can be restarted from remote All services and worker nodes are installed and maintained with kickstart+cfengine which allows to reinstall the system within an hour – Anything that cannot go in cfengine goes in YAIM pre/local/post in an effort to eliminate any forgettable manual steps All services are monitored Backup system of all databases is in place
Manchester (cont) We lack protocols and procedures for dealing in the same way when a situation occurs – Started to write from things as simple as switching off machines for maintenance Disaster recovery happening only when a disaster happens Irregular maintenance periods brought to clashes with generators routine tests RT system used for comunication with users but also to log everything that is done in the T2 – Bad comunication between sys admins has been a major problem
Sheffield The main weak point for Sheffield is the limited physical access to the cluster. We have it 9-17 weekdays only. We use quite expensive SCSI disk for exp-software, as it's expensive we do not have a spare disk in the case of failure. So we need some time to order it plus to write all experimental software back CE and the Mon Box have only one power supply and only one disk each. In future perhaps RAID1 system with 2 PSUs for CE and the Mon box. It would be good to have UPS. DPM head node already has 2 PSUs and RAID5 system with extra disk. We have similar WN's, CE and MonBox, so can find spare parts. We managed to have quite stable reliability
General Status (1) 17%4.525.6182.5DPMyes SL4Glite3.1 Sheffiel d 15%39.2 142/104/ 202160 dcache/D PM/xrootdyes SL4Glite3.1 Manche ster 10%13.7130559 Dcache -> DPMyes SL4Glite3.1 Liverpo ol 19%39.62001040DPMyes SL4Glite3.1 Lancast er Stor age usag e % Used Storage (TB) Storage (TB) CPU (kSI2K) SRM brand Space Tokens SRM2. 2OS Middle wareSite