Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009.

Summary of 2008 LCG operation ~~~ Performance and Experience Jamie.Shiers@cern.ch ~~~ LCG-LHCC Mini Review, 16 th February 2009

Overview The last LHCC Mini-Review of LCG (July 2008) covered the first half of 2008 – including the 2 phases of CCRC’08 This report will cover the experiences gained throughout the entire year but will focus primarily on issues since the time of the last mini-review N.B. “Performance” is interpreted as performance of the service – not throughput / # jobs etc – the latter will be covered by the experiments but some representative plots are included for completeness  One of the concerns at that time was “the overlap between the experiments was less than optimal and some important aspects of the Computing Models were not fully tested even in May”  Whilst there has been much progress in this area, further inter- and intra- VO tests – particularly in the areas of reprocessing and analysis – are still outstanding 2

WLCG Service Summary Great strides have been made in the past year, witnessed by key achievements such as wide-scale production deployment of SRM v2.2 services, successful completion of CCRC’08 and support for experiment production and data taking Daily operations con-calls – together with the weekly summary – are key to follow-up of service problems Some straightforward steps for improving service delivery have been identified and are being carried out  Full 2009-scale testing of the remaining production + analysis Use Cases is urgently required – without a successful and repeatable demonstration we cannot assume that this will work! 3

Key Performance Indicators Since the beginning of last year we have held week-daily conference calls open to all experiments and sites to follow-up on short-term operations issues These have been well attended by the experiments, with somewhat more patchy attendance from sites but minutes are widely and rapidly read by members of the WLCG Management Board and beyond minutes A weekly summary is given to the Management Board where we have tried to evolve towards a small set of Key Performance Indicators These currently include a summary of the GGUS tickets opened in the previous week by the LHC VOs, as well as more important service incidents requiring follow-up: Service Incident Reportssummary 4

Critical Services: Targets Targets (not commitments) proposed for Tier0 services Similar targets requested for Tier1s/Tier2s Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) The MoU lists targets for responding to problems (12 hours for T1s) ¿Tier1s: 95% of problems resolved <1 working day ? ¿Tier2s: 90% of problems resolved < 1 working day ?  Post-mortem triggered when targets not met! 5 Time IntervalIssue (Tier0 Services)Target End 2008Consistent use of all WLCG Service Standards100% 30’Operator response to alarm / call to x5011 / alarm e-mail99% 1 hourOperator response to alarm / call to x5011 / alarm e-mail100% 4 hoursExpert intervention in response to above95% 8 hoursProblem resolved90% 24 hoursProblem resolved99% This activity was triggered by the experiments at the WLCG Overview Board in 2007. The proposal below has been formally accepted by the CMS Computing Management and presented to the MB and other bodies on several occasions. It continues to be the baseline against we regularly check our response, particularly in the case of alarm tickets.

GGUS Summary – First Week of 2009 6 VOUSERTEAMALARMTOTAL ALICE1001 ATLAS265031 CMS3003 LHCb2013 This summary is rather representative although somewhat more systematic usage of GGUS would help (constantly encouraged…) This is strongly requested by sites (amongst others…) As well as this top level summary, tables of tickets submitted by, assigned to and affecting VOs are available There are also non-VO-related tickets, e.g. for infrastructure issues – typically gLite upgrades etc.

How Are We Doing? The total number of VO-specific tickets is rather low – alarm tickets in particular – with (usually) correct and timely follow-up  The experiments are doing a lot of useful work, day in, day out, work-day, weekend and holiday! And they appreciate it and acknowledge it!  But there are still avoidable holes in service delivery and some areas of significant concern However, we could – relatively easily – do better in some areas, with LESS effort and stress… 7

What Needs to be Improved? We still see “emergency” interventions that could be “avoided” – or at least foreseen and scheduled at a more convenient time Often DB applications where e.g. tables grow and grow until performance hits a wall  emergency cleanup (that goes wrong?); indices “lost”; bad manipulations; … We still see scheduled interventions that are not sufficiently well planned – that run well into “overtime” or have to be “redone” e.g. several Oracle upgrades end last year overran – we should be able to schedule such an upgrade by now (after 25+ years!) Or those that are not well planned or discussed with service providers / users and have a big negative impact on ongoing production e.g. some network reconfigurations and other interventions – particularly on more complex links, e.g. to US; debugging and follow-up not always satisfactory There are numerous concrete examples of the above concerning many sites: they are covered in the weekly reports to the MB and are systematically followed up  Much more serious are chronic (weeks, months) problems that have affected a number of Tier1 sites – more later… 8

Major Service Incidents Quite a few such incidents are “DB-related” in the sense that they concern services with a DB backend The execution of a “not quite tested” procedure on ATLAS online led – partly due to the Xmas shutdown – to a break in replication of ATLAS conditions from online out to Tier1s of over 1 month (online-offline was restored much earlier) Various Oracle problems over many weeks affected numerous services (CASTOR, SRM, FTS, LFC, ATLAS conditions) at ASGC  need for ~1FTE of suitably qualified personnel at WLCG Tier1 sites, particularly those running CASTOR; recommendations to follow CERN/3D DB configuration & perform a clean Oracle+CASTOR install; communication issues Various problems affecting CASTOR+SRM services at RAL over prolonged period, including “Oracle bugs” strongly reminiscent of those seen at CERN with earlier Oracle version: very similar (but not identical) problems seen recently at CERN & ASGC (not CNAF…) Plus not infrequent power + cooling problems [ + weather! ] Can take out an entire site – main concern is controlled recovery (and communication) 9

10 At the November 2008 WLCG workshops a recommendation was made that each WLCG Tier1 site should have at least 1 FTE of DBA effort. This effort (preferably spread over multiple people) should proactively monitor the databases behind the WLCG services at that site: CASTOR/dCache, LFC/FTS, conditions and other relevant applications. The skills required include the ability to backup and recover, tune and debug the database and associated applications. At least one WLCG Tier1 does not have this effort available today.

How Can We Improve? Change Management Plan and communicate changes carefully; Do not make untested changes on production systems – these can be extremely costly to recover from. Incident Management The point is to learn from the experience and hopefully avoid similar problems in the future; Documenting clearly what happened together with possible action items is essential.  All teams must buy into this: it does not work simply by high-level management decision (which might not even filter down to the technical teams involved). CERN IT plans to address this systematically (ITIL) as part of its 2009+ Programme of Work 11

Additional KPIs 1.Downtimes: scheduled, overrun & unscheduled all give measures of service quality at a given site The data exist (scheduled downtimes that overrun are not always easy to identify) but no convenient summary as provided by GGUS exists (AFAIK – AIA) 2.“VOviews”: existing “dashboard” information – used daily by the experiments to monitor their production and the status of sites Ongoing work to present this existing information in views that summarize: i.A VOs activities across all sites it is using; ii.A sites performance in terms of the VOs it is serving  But all of these KPIs really only work when things are ~under control – e.g. GGUS We need (first) to reduce the rate of critical incidents – those that trigger a Service Incident Report – particularly for 2. 12

13 GridMap Test Page (Template)

Concrete Actions 1.Review on a regular (3-6 monthly?) basis open Oracle “Service Requests” that are significant risk factors for the WLCG service (Tier0+Tier1s+Oracle) The first such meeting is being setup, will hopefully take place prior to CHEP 2009 2.Perform “technology-oriented” reviews of the main storage solutions (CASTOR, dCache) focussing on service and operational issues Follow-on to Jan/Feb workshops in these areas; again report at pre-CHEP WLCG Collaboration Workshop 3.Perform Site Reviews – initially Tier0 and Tier1 sites – focussing again and service and operational issues. Will take some time to cover all sites; proposal is for review panel to include members of the site to be reviewed who will participate also in the review before and after their site 14

The Goal The goal is that – by end 2009 – the weekly WLCG operations / service report is quasi-automatically generated 3 weeks out of 4 with no major service incidents – just a (tabular?) summary of the KPIs  We are currently very far from this target with (typically) multiple service incidents that are either: New in a given week; Still being investigating or resolved several to many weeks later By definition, such incidents are characterized by severe (or total) loss of service or even a complete site (or even Cloud in the case of ATLAS) 15

CMS jobs in last 3 months of 2008 sorted by activity More than 100K submitted and terminated per day Big fraction of CMS jobs (25-30%) are jobs of the analysis users, Running on a big variety of sites CMS analysis jobs in last 3 months of 2008 sorted by site

CMS distributed MC production ~5*10^9 produced events over last quarter of 2008 CMS transfer activity in the last quarter of 2008 Average rate : ~370 MB/s

Constant improvement of the quality of the infrastructure Comparison of the CMS site availability based on the results of SAM tests specific for CMS VO First and last quarter of 2008.

Service Priorities 1.Stability – up to the limits that are currently possible 2.Clarity – where not (well always…)  All sites and experiments should use consistently the existing meetings and infrastructure where applicable Join the daily WLCG operations con-call regularly – particularly when there have been problems at your site and / or when upcoming interventions are foreseen Always submit GGUS tickets – private emails / phone calls do not leave a trace Use the established mailing lists – quite a few mails still “get lost” or do not reach all of the intended people / sites  The LCG Home Page is your entry point! 19

Outstanding Tests It is essential that key activities – such as reprocessing – are tested under realistic 2009 conditions as soon as possible. This requires inter-VO scheduling.  Without such testing the sites cannot guarantee that they can support this key activity, nor can the experiments assume it! Similar testing of 2009 Analysis Use Cases is also required – this is likely to stress further the systems but is also essential for the same reasons cited above 20

2009 Data Taking – The Prognosis Production activities will work sufficiently well – many Use Cases have been tested extensively and for prolonged periods at a level equal to (or even greater than) the peak loads that can be expected from 2009 LHC operation Yes, there will be problems but we must focus on restoring the service as rapidly and as systematically as possible Analysis activities are an area of larger concern – by definition the load is much less predictable Flexible Analysis Services and Analysis User Support will be key In parallel, we must transition to a post-EGEE III environment – whilst still not knowing exactly what this entails… But we do know what we need to run stable Grid Services! 21

WLCG Service Summary Great strides have been made in the past year, witnessed by key achievements such as wide-scale production deployment of SRM v2.2 services, successful completion of CCRC’08 and support for experiment production and data taking Daily operations con-calls – together with the weekly summary – are key to follow-up of service problems Some straightforward steps for improving service delivery have been identified and are being carried out  Full 2009-scale testing of the remaining production + analysis Use Cases is urgently required – without a successful and repeatable demonstration we cannot assume that this will work! 22

Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009.

Similar presentations

Presentation on theme: "Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.

Similar presentations

Presentation on theme: "Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009.

Presentation on theme: "Summary of 2008 LCG operation ~ Performance and Experience ~ LCG-LHCC Mini Review, 16 th February 2009."— Presentation transcript: