Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009.

Slides:

Advertisements

Similar presentations

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Advertisements

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

Event Management & ITIL V3

WLCG Service Report ~~~ WLCG Management Board, 24 th November

Will / Can Clouds Replace Grids? A Three-Point Grid Support Group, IT Department, CERN.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Database Administrator RAL Proposed Workshop Goals Dirk Duellmann, CERN.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

SC4 Planning Planning for the Initial LCG Service September 2005.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

Ian Bird GDB CERN, 9 th September Sept 2015

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

Julia Andreeva on behalf of the MND section MND review.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

WLCG Collaboration Workshop 21 – 25 April 2008, CERN Remaining preparations GDB, 2 nd April 2008.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.

LCG Introduction John Gordon, STFC-RAL GDB June 11 th, 2008.

~~~ LCG-LHCC Referees Meeting, 16th February 2010

WLCG Service Interventions

Workshop Summary Dirk Duellmann.

WLCG Service Report 5th – 18th July

Presentation transcript:

Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009

Overview The last LHCC Mini-Review of LCG (July 2008) covered the first half of 2008 – including the 2 phases of CCRC’08 This report will cover the experiences gained throughout the entire year but will focus primarily on issues since the time of the last mini-review N.B. “Performance” is interpreted as performance of the service – not throughput / # jobs etc – the latter will be covered by the experiments but some representative plots are included for completeness  One of the concerns at that time was “the overlap between the experiments was less than optimal and some important aspects of the Computing Models were not fully tested even in May”  Whilst there has been much progress in this area, further inter- and intra- VO tests – particularly in the areas of reprocessing and analysis – are still outstanding 2

WLCG Service Summary Great strides have been made in the past year, witnessed by key achievements such as wide-scale production deployment of SRM v2.2 services, successful completion of CCRC’08 and support for experiment production and data taking Daily operations con-calls – together with the weekly summary – are key to follow-up of service problems Some straightforward steps for improving service delivery have been identified and are being carried out  Full 2009-scale testing of the remaining production + analysis Use Cases is urgently required – without a successful and repeatable demonstration we cannot assume that this will work! 3

Key Performance Indicators Since the beginning of last year we have held week-daily conference calls open to all experiments and sites to follow-up on short-term operations issues These have been well attended by the experiments, with somewhat more patchy attendance from sites but minutes are widely and rapidly read by members of the WLCG Management Board and beyond minutes A weekly summary is given to the Management Board where we have tried to evolve towards a small set of Key Performance Indicators These currently include a summary of the GGUS tickets opened in the previous week by the LHC VOs, as well as more important service incidents requiring follow-up: Service Incident Reportssummary 4

Critical Services: Targets Targets (not commitments) proposed for Tier0 services Similar targets requested for Tier1s/Tier2s Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿Tier1s: 95% of problems resolved <1 working day ? ¿Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 5 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x5011 / alarm 99% 1 hourOperator response to alarm / call to x5011 / alarm 100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% This activity was triggered by the experiments at the WLCG Overview Board in The proposal below has been formally accepted by the CMS Computing Management and presented to the MB and other bodies on several occasions. It continues to be the baseline against we regularly check our response, particularly in the case of alarm tickets.

GGUS Summary – First Week of VOUSERTEAMALARMTOTAL ALICE1001 ATLAS CMS3003 LHCb2013 This summary is rather representative although somewhat more systematic usage of GGUS would help (constantly encouraged…) This is strongly requested by sites (amongst others…) As well as this top level summary, tables of tickets submitted by, assigned to and affecting VOs are available There are also non-VO-related tickets, e.g. for infrastructure issues – typically gLite upgrades etc.

How Are We Doing? The total number of VO-specific tickets is rather low – alarm tickets in particular – with (usually) correct and timely follow-up  The experiments are doing a lot of useful work, day in, day out, work-day, weekend and holiday! And they appreciate it and acknowledge it!  But there are still avoidable holes in service delivery and some areas of significant concern However, we could – relatively easily – do better in some areas, with LESS effort and stress… 7

What Needs to be Improved? We still see “emergency” interventions that could be “avoided” – or at least foreseen and scheduled at a more convenient time Often DB applications where e.g. tables grow and grow until performance hits a wall  emergency cleanup (that goes wrong?); indices “lost”; bad manipulations; … We still see scheduled interventions that are not sufficiently well planned – that run well into “overtime” or have to be “redone” e.g. several Oracle upgrades end last year overran – we should be able to schedule such an upgrade by now (after 25+ years!) Or those that are not well planned or discussed with service providers / users and have a big negative impact on ongoing production e.g. some network reconfigurations and other interventions – particularly on more complex links, e.g. to US; debugging and follow-up not always satisfactory There are numerous concrete examples of the above concerning many sites: they are covered in the weekly reports to the MB and are systematically followed up  Much more serious are chronic (weeks, months) problems that have affected a number of Tier1 sites – more later… 8

Major Service Incidents Quite a few such incidents are “DB-related” in the sense that they concern services with a DB backend The execution of a “not quite tested” procedure on ATLAS online led – partly due to the Xmas shutdown – to a break in replication of ATLAS conditions from online out to Tier1s of over 1 month (online-offline was restored much earlier) Various Oracle problems over many weeks affected numerous services (CASTOR, SRM, FTS, LFC, ATLAS conditions) at ASGC  need for ~1FTE of suitably qualified personnel at WLCG Tier1 sites, particularly those running CASTOR; recommendations to follow CERN/3D DB configuration & perform a clean Oracle+CASTOR install; communication issues Various problems affecting CASTOR+SRM services at RAL over prolonged period, including “Oracle bugs” strongly reminiscent of those seen at CERN with earlier Oracle version: very similar (but not identical) problems seen recently at CERN & ASGC (not CNAF…) Plus not infrequent power + cooling problems [ + weather! ] Can take out an entire site – main concern is controlled recovery (and communication) 9

10 At the November 2008 WLCG workshops a recommendation was made that each WLCG Tier1 site should have at least 1 FTE of DBA effort. This effort (preferably spread over multiple people) should proactively monitor the databases behind the WLCG services at that site: CASTOR/dCache, LFC/FTS, conditions and other relevant applications. The skills required include the ability to backup and recover, tune and debug the database and associated applications. At least one WLCG Tier1 does not have this effort available today.

How Can We Improve? Change Management Plan and communicate changes carefully; Do not make untested changes on production systems – these can be extremely costly to recover from. Incident Management The point is to learn from the experience and hopefully avoid similar problems in the future; Documenting clearly what happened together with possible action items is essential.  All teams must buy into this: it does not work simply by high-level management decision (which might not even filter down to the technical teams involved). CERN IT plans to address this systematically (ITIL) as part of its Programme of Work 11

Additional KPIs 1.Downtimes: scheduled, overrun & unscheduled all give measures of service quality at a given site The data exist (scheduled downtimes that overrun are not always easy to identify) but no convenient summary as provided by GGUS exists (AFAIK – AIA) 2.“VOviews”: existing “dashboard” information – used daily by the experiments to monitor their production and the status of sites Ongoing work to present this existing information in views that summarize: i.A VOs activities across all sites it is using; ii.A sites performance in terms of the VOs it is serving  But all of these KPIs really only work when things are ~under control – e.g. GGUS We need (first) to reduce the rate of critical incidents – those that trigger a Service Incident Report – particularly for 2. 12

13 GridMap Test Page (Template)

Concrete Actions 1.Review on a regular (3-6 monthly?) basis open Oracle “Service Requests” that are significant risk factors for the WLCG service (Tier0+Tier1s+Oracle) The first such meeting is being setup, will hopefully take place prior to CHEP Perform “technology-oriented” reviews of the main storage solutions (CASTOR, dCache) focussing on service and operational issues Follow-on to Jan/Feb workshops in these areas; again report at pre-CHEP WLCG Collaboration Workshop 3.Perform Site Reviews – initially Tier0 and Tier1 sites – focussing again and service and operational issues. Will take some time to cover all sites; proposal is for review panel to include members of the site to be reviewed who will participate also in the review before and after their site 14

The Goal The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents – just a (tabular?) summary of the KPIs  We are currently very far from this target with (typically) multiple service incidents that are either: New in a given week; Still being investigating or resolved several to many weeks later By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS) 15

CMS jobs in last 3 months of 2008 sorted by activity More than 100K submitted and terminated per day Big fraction of CMS jobs (25-30%) are jobs of the analysis users, Running on a big variety of sites CMS analysis jobs in last 3 months of 2008 sorted by site

CMS distributed MC production ~5*10^9 produced events over last quarter of 2008 CMS transfer activity in the last quarter of 2008 Average rate : ~370 MB/s

Constant improvement of the quality of the infrastructure Comparison of the CMS site availability based on the results of SAM tests specific for CMS VO First and last quarter of 2008.

Service Priorities 1.Stability – up to the limits that are currently possible 2.Clarity – where not (well always…)  All sites and experiments should use consistently the existing meetings and infrastructure where applicable Join the daily WLCG operations con-call regularly – particularly when there have been problems at your site and / or when upcoming interventions are foreseen Always submit GGUS tickets – private s / phone calls do not leave a trace Use the established mailing lists – quite a few mails still “get lost” or do not reach all of the intended people / sites  The LCG Home Page is your entry point! 19

Outstanding Tests It is essential that key activities – such as reprocessing – are tested under realistic 2009 conditions as soon as possible. This requires inter-VO scheduling.  Without such testing the sites cannot guarantee that they can support this key activity, nor can the experiments assume it! Similar testing of 2009 Analysis Use Cases is also required – this is likely to stress further the systems but is also essential for the same reasons cited above 20

2009 Data Taking – The Prognosis Production activities will work sufficiently well – many Use Cases have been tested extensively and for prolonged periods at a level equal to (or even greater than) the peak loads that can be expected from 2009 LHC operation Yes, there will be problems but we must focus on restoring the service as rapidly and as systematically as possible Analysis activities are an area of larger concern – by definition the load is much less predictable Flexible Analysis Services and Analysis User Support will be key In parallel, we must transition to a post-EGEE III environment – whilst still not knowing exactly what this entails… But we do know what we need to run stable Grid Services! 21

WLCG Service Summary Great strides have been made in the past year, witnessed by key achievements such as wide-scale production deployment of SRM v2.2 services, successful completion of CCRC’08 and support for experiment production and data taking Daily operations con-calls – together with the weekly summary – are key to follow-up of service problems Some straightforward steps for improving service delivery have been identified and are being carried out  Full 2009-scale testing of the remaining production + analysis Use Cases is urgently required – without a successful and repeatable demonstration we cannot assume that this will work! 22