Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009.

Similar presentations


Presentation on theme: "Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009."— Presentation transcript:

1 Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009

2 Themes (last 9 months) Improve planning Recruitment Re-engineer production and operations processes Enhance resilience Test it works (STEP09) Move to R89 Test new Disaster Management System Final preparations for data taking 31 March 2014 Tier-1 Status

3 Apr May Jun Jul Aug Sep Oct Nov Freeze Update contingency STEP R89 Migration CASTOR Hardware Resilience New Hardware CASTOR upgrade SL5 upgrade Test disaster Management system LFC/FTS 3D prepare for STEP prepare for R89 Prepare for data taking The Plan SRM + nameserver

4 Recruitment complete Recruitment has been tough (but good team in place now) –Initially STFC freeze –Later, hard to recruit 31 March 2014 Tier-1 Status STFC freeze

5 Meeting Experiment Needs VO survey carried out in April –Based on a series of qualitative and quantitative questions –Very helpful and considered feedback from most significant Vos Generally very positive: Key findings –Communication between Tier-1 and VOs generally working well –Production team have made a big difference –Meeting commitments/expectations of LHC VOs –VOs not always clear on Tier-1 priorities (since tried to address this by liaison meeting) –Non LHC VOs particularly commented that although support was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) - –Documentation poor (need to work on this still) 31 March 2014 Tier-1 Status

6 Production Team/Production ops Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla) –Handle operation exceptions (NAGIOS alerts/pager callouts) –track tickets –Monitor routine metrics, loads, network rates –Ensure operational status is communicated to VOs –Represent Tier-1 to WLCG daily operations –Oversee downtime planning, agree near term downtime plan –Oversee progression of Service Incident reports –(re-)engineer operational processes Night-time/weekend team of 5 staff on-call at any time (2 hour response): –Primary on-call (triage and fix easy faults) –Secondary on-call: CASTOR, Grid on-call, Fabric, Database 31 March 2014 Tier-1 Status

7 Callout rate Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents 31 March 2014 Tier-1 Status

8 Process Improvement 31 March 2014 Tier-1 Status Service is complex Frequent routine interventions – eg:. Add disk servers to class take disk servers offline Mistakes occur if not engineered out. Work in progress but critical if we are to meet high expectations

9 CASTOR (I) Process of gradual improvement, tracking down causes of individual transfer failures. Improving processes (eg disk server intervention status) –Applied ORACLE patch to fix the Big ID bug –Series of CASTOR minor version upgrades to 2.1.7-27. These have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring –Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP) –Tuning changes ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance. SRM upgrades to version 2.7.15 31 March 2014 Tier-1 Status

10 CASTOR: Downtime (2008-2009) 31 March 2014 Tier-1 Status 2.1.7 upgrade R89

11 CASTOR (III): Plans September –Nameserver upgraded to 2.1.8 –SRM upgrade to version 2.8 –CIP upgrade to version 2 (in progress) 2009Q4 –optimizing the ORACLE database –Additional resilience –Disaster recovery testing 31 March 2014 Tier-1 Status

12 STEP09: Operations Overview Generally very smooth operation: –Most service systems relatively unloaded plenty of spare capacity –Calm atmosphere. Daytime production team monitored service Only one callout, Most of the team even took two days out off site for department meeting! –Very good liaison with VOs and good idea what was going on. In regular informal contact with UK representatives –Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. Robot broke down for several hours (stuck handbot led to all drives de- configured in CASTOR). Caught up quickly. Very useful exercise – learned a lot, but very reassuring –More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

13 STEP09: Batch Service Farm typically running > 2000 jobs. By 9 th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) Problem 1: ATLAS job submission exceeded 32K files on CE –See hole on 9 th. We thought ATLAS had paused took time to spot. Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. –Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9 th June. See decrease in (red) ALICE work. Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

14 STEP09: Network –Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download. –Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s –RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!

15 STEP09: Tape Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared). –Typical average rate of 35MB/s per drive (1 day average) –Lower than we would like (looking for nearer 45MB/s) –On CMS instance, modified write policy gave > 60MB/s but reads more challenging to optimise.

16 R89: Migration Migration planning started early 2008 (building early 2006) Detailed equipment documentation together with a requirements document was sent to vendors during September 2008 Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible) Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks). Started 22 nd July and ended 6 th August Completed to schedule 31 March 2014 Tier-1 Status

17 R89 Migration 43 racks moved 31 March 2014 Tier-1 Status Wed 17 Fri 19 Mon 22 Wed 24 Fri 26 Mon 29 Wed 1 Fri 3 Mon 6 Drain WMS Drain CEs batch workers start Drain FTS Critical Services Castor Core + Disk Batch workers complete Disk complete CASTOR restarting Restart

18 Disasters: Swine Flu First to test new disaster management system Easy to handle – trivial to generate a contingency plan based on existing template. Situation regularly assessed. Tier-1 response initially running ahead of RAL site planning. Reached level 2 in DMS with assessment meetings every 2 weeks. Work mainly on remote working and communication strategy Now downgraded to level 1 until significant rise in case frequency Expect to dust off again before Christmas 31 March 2014 Tier-1 Status

19 Disasters: Air-conditioning(I) 31 March 2014 Tier-1 Status Two cooling failures in 3 days cold isle hot isle 15 25 35 45 shutdown room reaches equilibrium Monday day: both chiller systems shutdown, restarted quickly Tuesday: one chiller shutdown and failed over to second chiller Wednesday night: both chillers shutdown could not restart After third event decided not to restart Tier-1 chiller restart

20 Disasters: Air –conditioning (II) Initial post mortem started after first (daytime) event –Thermal monitoring, callout and automated shutdown in R89 not fully implemented/working correctly –urgent remedial work underway Second, night-time incident raised further concerns –Tier-1 called out and rapidly escalated –But automated shutdown still in test mode –Forced to do manual shutdown –Operations thermal callout failed to work as required –Site security did not escalate BMS alarm (not expected alarm) –Escalation to building services very slow (owing to R89 being still under warranty/acceptance) –Chillers could not be restarted –No explanation of cause of outage Concluded we would not restart Tier-1 until issues resolved 31 March 2014 Tier-1 Status

21 Disasters: Air-conditioning (III) Critical Services continued to run: –Separate, redundant cooling system in UPS room. –Tape robotics and CASTOR core OK too (low temperature room) By Friday: –Tier-1 response at disaster level 3 (meeting held with VOs and PMB) –Building services believed that cooling was stable and fault could not recur. –all necessary automation, callout and escalation processes in place –Nevertheless Tier-1 team not prepared to run hardware unattended over the weekend. On Monday: –Full service restart –plan to baby-sit service during Mon/Tue evening Forensics and post-mortem continued 31 March 2014 Tier-1 Status

22 Disasters: Air-conditioning (IV) Monday 10 th incident believed to be caused by a planned reboot of the Building Management System (BMS) –Caused pumps to stop –Low pressure caused chiller valves to close –BMS returned but system deadlocked Tuesday 11 th – single chiller trip followed by failover –logs do not allow diagnosis. Wednesday 12 th – BMS detected overpressure in cooling system and triggered shutdown –Probably true over pressure (1.9 Bar) –Settings (1.7bar) considered to be too low –Now raised to 2.5 Bar and only calls out –System tested to 6 bar. –Investigations continue 31 March 2014 Tier-1 Status

23 Disasters: Water Leak Water found dripping on tape robot!!!!!! I dont believe this is happening moment Should not be able to happen as no planned water supplies above machine room. Fortunately Tier-1 already shut-down so turn off robot too. STK engineer investigates and concludes that damage is mainly superficial splash damage, drive heads not contaminated, tapes (60 splashed) probably OK. Indication that had been occurring occasionally for several weeks 31 March 2014 Tier-1 Status

24 Disasters: Water leak Cause: condensation from 1 st floor cooling system –Incorrect damper setting (air intake) led to excess condensation –Condensation collected in drip tray and pumped –Tray too small and pump inadequate –Water overflowed tray and tracked along floor to hole Remedy –Place umbrella over robot –Chillers switched off – 1 st floor inspected daily! –Planning underway to re-engineer drip trays/pumps alarms, etc. –Monitor tape error rate 31 March 2014 Tier-1 Status

25 Procurements Disk, CPU and robotics procurements delayed from January/February delivery dates –New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24 drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity Eventually delivered in May, but entangled in R89 migration, –New Robot in production in July –CPU completed acceptance test and deploying into SL5 –One Lot of disk (1PB) ready for deployment –Second Lot failed acceptance (many drive ejects) Positive aspects of acceptance failure –Two Lot risk avoidance strategy worked –Vendor 1 week load test failed to find fault –Our 28 day acceptance caught fault before kit reached production 31 March 2014 Tier-1 Status

26 LFC, FTS and 3D Now complete Upgrade back end RAID arrays and Oracle servers –Replace elderly RAID arrays with pair of new EMC RAID arrays –Better support (we hope) –Better performance Move to ORACLE RAC for LFC/FTS (increased resilience) Separate ATLAS LFC from general LFC Upgrade 3D servers and move to new RAID arrays Work commenced on testing replication of LFC for disaster contingency 31 March 2014 Tier-1 Status

27 Quattor – Story so Far Began work in earnest in June 2009 Set up Quattor Working Group instance to manage deployment and configuration of new hardware. –leverages strong QWG support for gLite Have SL5 torque/maui server under Quattor control Are (as of today) deploying 220+ new WNs in SL5 batch service Significant work to get up and running. New way of working. Have uncovered and helped fix a number of bugs and issues in the process

28 Quattor – Next Steps As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them Move CEs and other grid service nodes to Quattor Gradually migrate non-grid services to Quattor control AQUILON –Database backend to Quattor developed by Morgan Stanley Improves scalability and manageability (MS are managing >15,000 nodes) –Will first deploy at RAL –Then plan to make Aquilon make usable by other grid sites as well

29 Available at http://www.gridpp.rl.ac.uk/statushttp://www.gridpp.rl.ac.uk/status Constantly evolving – Components can be added/updated/removed Present components – SAM Tests Latest test results for critical services Locally cached for 10 minutes to reduce load – Downtimes Ongoing and upcoming downtimes pulled from GOCDB Red colour for OUTAGE and yellow for AT_RISK – Notices Latest information on Tier 1 operations Only Tier 1 staff can post – Ganglia plots of key components from the Tier1 farm Feedback welcome Dashboard

30

31 Next week - 14 th - 18 th September! LHC only (for now) – but all VOs affected New batch service - lcgbatch01 Quattorised torque/maui server Quattorised worker nodes New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being retired, other CEs reconfigured Same queue configuration Use submit filter script on CEs to add SLX property requirement as required SL5 Migration (I)

32 CPU08 going straight into SL5 now (~1800 job slots) All 64-bit capable existing WNs will be reinstalled eventually Non-LHC vos will get new CE for migration after dust settles No plan to retire SL4 WNs completely yet SL5 Migration (II)

33 October Freeze No planned upgrades beyond September except possibly network upgrade. Recognise that some change will have to take place Need to put in place lightweight change control process –Allow changes where benefit outweighs risk Expect increased stability as downtimes reduce Apply pressure once more to reduce low grade failures. 31 March 2014 Tier-1 Status

34 Conclusion Recent staff additions have had a huge impact on quality of service we operate. Tier-1 development plan for 2009 nearly complete. Positive feedback from STEP09 that service meets requirements. Still a few major items (like SL5) to get through (fingers crossed). Probably still some R89 suprises in pipeline. Looking forward to start of data taking 31 March 2014 Tier-1 Status


Download ppt "Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009."

Similar presentations


Ads by Google