Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009.

Slides:



Advertisements
Similar presentations
Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Project Status David Britton,15/Dec/ Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
David Britton, 28/May/ TeV Collisions 27 km circumference m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Empowering Business in Real Time. © Copyright 2009, OSIsoft Inc. All rights Reserved. Data Center & IT Monitoring Use Cases Regional Seminar Series Carolyn.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
Event Management & ITIL V3
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
The CREAM CE: When can the LCG-CE be replaced?
Oxford Site Report HEPSYSMAN
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

Tier-1 – Final preparations for data Andrew Sansum 9 th September 2009

Themes (last 9 months) Improve planning Recruitment Re-engineer production and operations processes Enhance resilience Test it works (STEP09) Move to R89 Test new Disaster Management System Final preparations for data taking 31 March 2014 Tier-1 Status

Apr May Jun Jul Aug Sep Oct Nov Freeze Update contingency STEP R89 Migration CASTOR Hardware Resilience New Hardware CASTOR upgrade SL5 upgrade Test disaster Management system LFC/FTS 3D prepare for STEP prepare for R89 Prepare for data taking The Plan SRM + nameserver

Recruitment complete Recruitment has been tough (but good team in place now) –Initially STFC freeze –Later, hard to recruit 31 March 2014 Tier-1 Status STFC freeze

Meeting Experiment Needs VO survey carried out in April –Based on a series of qualitative and quantitative questions –Very helpful and considered feedback from most significant Vos Generally very positive: Key findings –Communication between Tier-1 and VOs generally working well –Production team have made a big difference –Meeting commitments/expectations of LHC VOs –VOs not always clear on Tier-1 priorities (since tried to address this by liaison meeting) –Non LHC VOs particularly commented that although support was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) - –Documentation poor (need to work on this still) 31 March 2014 Tier-1 Status

Production Team/Production ops Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla) –Handle operation exceptions (NAGIOS alerts/pager callouts) –track tickets –Monitor routine metrics, loads, network rates –Ensure operational status is communicated to VOs –Represent Tier-1 to WLCG daily operations –Oversee downtime planning, agree near term downtime plan –Oversee progression of Service Incident reports –(re-)engineer operational processes Night-time/weekend team of 5 staff on-call at any time (2 hour response): –Primary on-call (triage and fix easy faults) –Secondary on-call: CASTOR, Grid on-call, Fabric, Database 31 March 2014 Tier-1 Status

Callout rate Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents 31 March 2014 Tier-1 Status

Process Improvement 31 March 2014 Tier-1 Status Service is complex Frequent routine interventions – eg:. Add disk servers to class take disk servers offline Mistakes occur if not engineered out. Work in progress but critical if we are to meet high expectations

CASTOR (I) Process of gradual improvement, tracking down causes of individual transfer failures. Improving processes (eg disk server intervention status) –Applied ORACLE patch to fix the Big ID bug –Series of CASTOR minor version upgrades to These have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring –Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP) –Tuning changes ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance. SRM upgrades to version March 2014 Tier-1 Status

CASTOR: Downtime ( ) 31 March 2014 Tier-1 Status upgrade R89

CASTOR (III): Plans September –Nameserver upgraded to –SRM upgrade to version 2.8 –CIP upgrade to version 2 (in progress) 2009Q4 –optimizing the ORACLE database –Additional resilience –Disaster recovery testing 31 March 2014 Tier-1 Status

STEP09: Operations Overview Generally very smooth operation: –Most service systems relatively unloaded plenty of spare capacity –Calm atmosphere. Daytime production team monitored service Only one callout, Most of the team even took two days out off site for department meeting! –Very good liaison with VOs and good idea what was going on. In regular informal contact with UK representatives –Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. Robot broke down for several hours (stuck handbot led to all drives de- configured in CASTOR). Caught up quickly. Very useful exercise – learned a lot, but very reassuring –More at:

STEP09: Batch Service Farm typically running > 2000 jobs. By 9 th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) Problem 1: ATLAS job submission exceeded 32K files on CE –See hole on 9 th. We thought ATLAS had paused took time to spot. Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. –Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9 th June. See decrease in (red) ALICE work. Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

STEP09: Network –Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download. –Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s –RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!

STEP09: Tape Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared). –Typical average rate of 35MB/s per drive (1 day average) –Lower than we would like (looking for nearer 45MB/s) –On CMS instance, modified write policy gave > 60MB/s but reads more challenging to optimise.

R89: Migration Migration planning started early 2008 (building early 2006) Detailed equipment documentation together with a requirements document was sent to vendors during September 2008 Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible) Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks). Started 22 nd July and ended 6 th August Completed to schedule 31 March 2014 Tier-1 Status

R89 Migration 43 racks moved 31 March 2014 Tier-1 Status Wed 17 Fri 19 Mon 22 Wed 24 Fri 26 Mon 29 Wed 1 Fri 3 Mon 6 Drain WMS Drain CEs batch workers start Drain FTS Critical Services Castor Core + Disk Batch workers complete Disk complete CASTOR restarting Restart

Disasters: Swine Flu First to test new disaster management system Easy to handle – trivial to generate a contingency plan based on existing template. Situation regularly assessed. Tier-1 response initially running ahead of RAL site planning. Reached level 2 in DMS with assessment meetings every 2 weeks. Work mainly on remote working and communication strategy Now downgraded to level 1 until significant rise in case frequency Expect to dust off again before Christmas 31 March 2014 Tier-1 Status

Disasters: Air-conditioning(I) 31 March 2014 Tier-1 Status Two cooling failures in 3 days cold isle hot isle shutdown room reaches equilibrium Monday day: both chiller systems shutdown, restarted quickly Tuesday: one chiller shutdown and failed over to second chiller Wednesday night: both chillers shutdown could not restart After third event decided not to restart Tier-1 chiller restart

Disasters: Air –conditioning (II) Initial post mortem started after first (daytime) event –Thermal monitoring, callout and automated shutdown in R89 not fully implemented/working correctly –urgent remedial work underway Second, night-time incident raised further concerns –Tier-1 called out and rapidly escalated –But automated shutdown still in test mode –Forced to do manual shutdown –Operations thermal callout failed to work as required –Site security did not escalate BMS alarm (not expected alarm) –Escalation to building services very slow (owing to R89 being still under warranty/acceptance) –Chillers could not be restarted –No explanation of cause of outage Concluded we would not restart Tier-1 until issues resolved 31 March 2014 Tier-1 Status

Disasters: Air-conditioning (III) Critical Services continued to run: –Separate, redundant cooling system in UPS room. –Tape robotics and CASTOR core OK too (low temperature room) By Friday: –Tier-1 response at disaster level 3 (meeting held with VOs and PMB) –Building services believed that cooling was stable and fault could not recur. –all necessary automation, callout and escalation processes in place –Nevertheless Tier-1 team not prepared to run hardware unattended over the weekend. On Monday: –Full service restart –plan to baby-sit service during Mon/Tue evening Forensics and post-mortem continued 31 March 2014 Tier-1 Status

Disasters: Air-conditioning (IV) Monday 10 th incident believed to be caused by a planned reboot of the Building Management System (BMS) –Caused pumps to stop –Low pressure caused chiller valves to close –BMS returned but system deadlocked Tuesday 11 th – single chiller trip followed by failover –logs do not allow diagnosis. Wednesday 12 th – BMS detected overpressure in cooling system and triggered shutdown –Probably true over pressure (1.9 Bar) –Settings (1.7bar) considered to be too low –Now raised to 2.5 Bar and only calls out –System tested to 6 bar. –Investigations continue 31 March 2014 Tier-1 Status

Disasters: Water Leak Water found dripping on tape robot!!!!!! I dont believe this is happening moment Should not be able to happen as no planned water supplies above machine room. Fortunately Tier-1 already shut-down so turn off robot too. STK engineer investigates and concludes that damage is mainly superficial splash damage, drive heads not contaminated, tapes (60 splashed) probably OK. Indication that had been occurring occasionally for several weeks 31 March 2014 Tier-1 Status

Disasters: Water leak Cause: condensation from 1 st floor cooling system –Incorrect damper setting (air intake) led to excess condensation –Condensation collected in drip tray and pumped –Tray too small and pump inadequate –Water overflowed tray and tracked along floor to hole Remedy –Place umbrella over robot –Chillers switched off – 1 st floor inspected daily! –Planning underway to re-engineer drip trays/pumps alarms, etc. –Monitor tape error rate 31 March 2014 Tier-1 Status

Procurements Disk, CPU and robotics procurements delayed from January/February delivery dates –New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24 drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity Eventually delivered in May, but entangled in R89 migration, –New Robot in production in July –CPU completed acceptance test and deploying into SL5 –One Lot of disk (1PB) ready for deployment –Second Lot failed acceptance (many drive ejects) Positive aspects of acceptance failure –Two Lot risk avoidance strategy worked –Vendor 1 week load test failed to find fault –Our 28 day acceptance caught fault before kit reached production 31 March 2014 Tier-1 Status

LFC, FTS and 3D Now complete Upgrade back end RAID arrays and Oracle servers –Replace elderly RAID arrays with pair of new EMC RAID arrays –Better support (we hope) –Better performance Move to ORACLE RAC for LFC/FTS (increased resilience) Separate ATLAS LFC from general LFC Upgrade 3D servers and move to new RAID arrays Work commenced on testing replication of LFC for disaster contingency 31 March 2014 Tier-1 Status

Quattor – Story so Far Began work in earnest in June 2009 Set up Quattor Working Group instance to manage deployment and configuration of new hardware. –leverages strong QWG support for gLite Have SL5 torque/maui server under Quattor control Are (as of today) deploying 220+ new WNs in SL5 batch service Significant work to get up and running. New way of working. Have uncovered and helped fix a number of bugs and issues in the process

Quattor – Next Steps As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them Move CEs and other grid service nodes to Quattor Gradually migrate non-grid services to Quattor control AQUILON –Database backend to Quattor developed by Morgan Stanley Improves scalability and manageability (MS are managing >15,000 nodes) –Will first deploy at RAL –Then plan to make Aquilon make usable by other grid sites as well

Available at Constantly evolving – Components can be added/updated/removed Present components – SAM Tests Latest test results for critical services Locally cached for 10 minutes to reduce load – Downtimes Ongoing and upcoming downtimes pulled from GOCDB Red colour for OUTAGE and yellow for AT_RISK – Notices Latest information on Tier 1 operations Only Tier 1 staff can post – Ganglia plots of key components from the Tier1 farm Feedback welcome Dashboard

Next week - 14 th - 18 th September! LHC only (for now) – but all VOs affected New batch service - lcgbatch01 Quattorised torque/maui server Quattorised worker nodes New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being retired, other CEs reconfigured Same queue configuration Use submit filter script on CEs to add SLX property requirement as required SL5 Migration (I)

CPU08 going straight into SL5 now (~1800 job slots) All 64-bit capable existing WNs will be reinstalled eventually Non-LHC vos will get new CE for migration after dust settles No plan to retire SL4 WNs completely yet SL5 Migration (II)

October Freeze No planned upgrades beyond September except possibly network upgrade. Recognise that some change will have to take place Need to put in place lightweight change control process –Allow changes where benefit outweighs risk Expect increased stability as downtimes reduce Apply pressure once more to reduce low grade failures. 31 March 2014 Tier-1 Status

Conclusion Recent staff additions have had a huge impact on quality of service we operate. Tier-1 development plan for 2009 nearly complete. Positive feedback from STEP09 that service meets requirements. Still a few major items (like SL5) to get through (fingers crossed). Probably still some R89 suprises in pipeline. Looking forward to start of data taking 31 March 2014 Tier-1 Status