Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jamie Shiers WLCG Service Coordination WLCG – Worldwide LHC Computing Grid WLCG Service Status LCG Comprehensive Review, November 20 th 2007.

Similar presentations


Presentation on theme: "Jamie Shiers WLCG Service Coordination WLCG – Worldwide LHC Computing Grid WLCG Service Status LCG Comprehensive Review, November 20 th 2007."— Presentation transcript:

1 Jamie Shiers WLCG Service Coordination WLCG – Worldwide LHC Computing Grid WLCG Service Status LCG Comprehensive Review, November 20 th 2007

2 Agenda State of Readiness of WLCG Service – In particular, progress since LHCC Open Session in September (not a repeat…) – And in the context of CCRC’08 / data-taking preparation Status of Services in terms of Reliability – Benchmarks: WLCG MoU targets & Experiment lists of “Critical Services” and their criticality Outlook & Conclusions

3 LHCC Comprehensive Review – September 20063 WLCG Commissioning Schedule September 2006 Note emphasis: Residual Services, Residual Services, Increased reliability, Increased reliability, Performance, Performance, Capacity, Capacity, Monitoring, Monitoring, Operation Operation

4 CHEP 2007 LCG Component Service Readiness - Update ServiceComments SL4BDII, CE, WN, UI delivered. Other components being ported… LFCStable & in production. Additional bulk methods (ATLAS) delivered and being tested. Issues with R/O LFC  development (done & certified); deployment at phase I Tier1s < February CCRC’08; all < May CCRC’08 FTSService runs at CERN and (most) T1s. (Clients also at T2s). Still some issues related to CMs to be worked out. (“Clouds”) VOMSServer and management interfaces work. Still issues over how proxies, roles, groups, attributes will be used – being analysed WLMWell performing gLite WMS, including gLite 3.1 for SL4 (32bit mode); 3DTier 0 and Tier 1 database infrastructure in place and streams replication used by ATLAS, CMS (T0 online-offline) and LHCb; 3D monitoring integrated with experiment dashboards SRM v2.2 Extensive testing since ~1 year. Production deployment in progress. (CERN + major T1s < end 2007; rest < end Feb 2008.) Experiment adaption planned & in progress.

5 j.c.gordon@rl.ac.uk LCG Top 5 Issues (MB – May/June) ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 CASTOR@CERNCASTOR: functionality & performance Data Access from T1 MSS Issue #2xrootd- DPM Integration of DDM/FTS/SRM/ GridFTP/LFC etc SRM I/F with functionality & performance glexec usage Issue #3FTS Service (Lack of) SRM 2.2FTS ServiceFile management Issue #4gLite WMSData Storage Management Tools Workload management Deployment procedure Issue #5VOMSStability of the Information System Information system

6 SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2). – LHCb endpoint already configured & ready for production; – Others well advanced – available shortly – Tier1 sites running CASTOR2 at least one month after CERN (experience) SRM v2.2 is being deployed at Tier1 sites running dCache according to agreed plan: steady progress – Remaining sites – including those that source dCache through OSG – by end-Feb 2008. DPM is already available for Tier2s (and deployed); STORM also for INFN(+) sites CCRC’08 (described later) is foreseen to run on SRM v2.2 (Feb + May) Common Computing Readiness Challenge Adaption of experiment frameworks & use of SRM v2.2 features planned – Need to agree concrete details of site setup for January testing prior to February’s CCRC’08 run  Details of transition period and site configuration still require work

7 dCache Upgrade Schedule Oct 29 : NDGF (done) Nov 5 : gridKa (done), SARA (likely) Nov 12 : Edinburgh workshop (done) Nov 19 : (nothing yet) Nov 26 : IN2P3 (likely) Dec 3 : (nothing yet) Dec 10 : BNL (rather sure) Dec 17 : PIC, RAL (they actually said : end of the year) Fermi : December to January TRIUMF : no reply yet.

8 SRM clients: FTS, lcg_util/GFAL lcg_util / GFAL testing done – Various bugs / issues reported, now fixed in latest release Tests on pilot FTS service – Both dteam tests and experiment tests: wide-scale testing to discover issues early – Last release of FTS 2.0 fixes known integration issues Many SRM issues identified and resolved – Specification conformity issues reported to SRM providers and solved Major work is now done w.r.t. client integration – We do anticipate there will be more integration / conformity issues as we ramp up production, but these will be solved as they occur

9 DPM SRM 2.2 been deployed in production since January 2007 – Not all T2 sites have yet upgraded SRM 2.2. conformance tests and stress-tests done – All issues resolved – DPM-managed SRM copy functionality has still to be provided

10 SRM v2.2 - Summary The ‘SRM saga’ has clearly been much longer than desirable – and at times fraught Special thanks are due to all those involved, for their hard-work over an extended period and (in anticipation) for the delivery of successful production services When the time is right, it would make sense to learn from this exercise… Large-scale collaboration is part of our world: – What did we do well? What could we improve?

11 Explicit Requirements – CCRC’08 ServiceExperimentsComments SRM v2.2ATLAS, CMS, LHCb Roll-out schedule defined and now in progress. Expect to be at ~all Tier1s < end 2007, ~1/2 Tier2s by end January 2008, ~all Tier2s by end March 2008. xrootd i/fALICEDraft document on support for this being discussed. R/O LFCLHCbDevelopments for R/O replicas done - patch through certification but new code path needs to be validated in R/O mode, e.g. at CNAF, then other Tier1s. Generic agents (aka “pilot jobs”) LHCbSee discussions at MB & GDB – glexec security audited; experiments’ pilot job frameworks to follow Commissioned links CMSAccording to CMS definition & measurementCMS definition & measurement (DDT programme – underway, reports regularly) Conditions DBATLAS, LHCbIn production. To be tested at CCRC’08 scale… Target: services deployed in production 2 months prior to start of challenge Neither all services nor all resources will be available in February – “integration challenge” – helping us understand problem areas prior to May’s “full challenge”.

12 October 9, 2007 M.Kasemann CCRC f2f planning meeting 12/10 CCRC’08 – Proposed Schedule Phase 1 - February 2008: –Possible scenario: blocks of functional tests, Try to reach 2008 scale for tests at… 1.CERN: data recording, processing, CAF, data export 2.Tier-1’s: data handling (import, mass-storage, export), processing, analysis 3.Tier-2’s: Data Analysis, Monte Carlo, data import and export  Experiments have been asked to present these ‘blocks’ in detail at December CCRC’08 planning meeting, including services (WLCG, experiment) involved as well as corresponding ‘scaling factors’  Resource availability at sites expected to limit scope / scale of challenge (e.g. not all sites will have full 2008 resources in production by then – no / reduced re-processing pass at these – e.g. read each event in the file including a conditions DB lookup?) Phase 2: Duration of challenge: 1 week setup, 4 weeks challenge Ideas: Use January (pre-)GDB to review metric, tools to drive tests and monitoring tools –This means that we must preview the metric etc already in December meeting – more later!  Use March GDB to analysis CCRC phase 1  Launch the challenge at the WLCG workshop (April 21-25, 2008)  Schedule a mini-workshop after the challenge to summarize and extract lessons learned (June 12/13 in IT amphitheatre or Council Chamber)  Document performance and lessons learned within 4 weeks. Recent CSA07 experience ‘suggests’ doing these things concurrently is indeed harder than separately, e.g. load on storage due to transfers + production

13

14 ATLAS Scaling Factors T0 rate: 200Hz T0->T1 Traffic: 1020 MB/s T1->T2 Traffic: 10-40 MB/s depending on T2 – 5-20 from real data and 5-20 from reprocessing T1->T1 Traffic: 40 MB/s – 20 from ESD + ~20 from AOD. – This assumes everybody will reprocess in Feb. CCRC08 Job Submission at T1: 6000 Jobs/Day (over all T1s) MC Simulation: 20% of RAW data = 30Hz = 2.5M Events/Day = 100K simulation jobs/day + 10K reco jobs/day at T1

15 CHEP 2007 LCG Reliability  Operational complexity is now the weakest link  Inconsistent error reporting -- confused by many layers of software – local system, grid middleware, application layers  Communications difficult – -- sites supporting several (not just LHC) experiments and sometimes other sciences -- experiments coordinating across a large number of sites -- multiple sites, services implicated in difficult data problems  Sites have different histories, different procedures, different priorities  A major effort now on monitoring**  Integrating grid monitoring with site operations  Experiment specific dashboards for experiment operations and end-users.. and on standard metrics – comparing sites, experiments **Session on monitoring - Grid Middleware and Tools – Wednesday afternoon

16 CHEP 2007 LCG WLCG Service Concerns for 2008 Scalability:  5-6 X needed for resource capacity, number of jobs  2-3 X needed for data transfer  Live for now with the functionality we have  Need to understand better how analysis will be done Reliability:  Not yet good enough  Data Transfer is still the most worrying - despite many years of planning and testing  Many errors  complicated recovery procedures  Many sources of error – storage systems, site operations, experiment data management systems, databases, grid middleware and services, networks,.... Hard to get to the roots of the problems

17 CHEP 2007 LCG CMS Critical Services (wiki)wiki RankDefinitionMax. Downtime Comments 11CMS Stops Operating0.5 hoursNot covered yet 10CMS stops transferring data from CessyCessy output buffer time 9T0 Production stopsmin(T0 input buffer/Cessy output buffer) or defined time to catch up 8T1/T2 Production/analysis stops 7Services critical when needed but not needed all the time (currently includes documentation) 0.5 6A service monitoring or documenting a critical service 8 5CMS development stops if service unavailable 24 4CMS development at CERN stops if service unavailable … more …

18 CHEP 2007 LCG ATLAS Critical Services (PDF)PDF TierServiceCriticalityConsequences of service interuption 0Oracle database RAC (online, ATONR) Very highPossible loss of DCS, Run Control, and Luminosity Block data while running. Run start needs configuration data from the online database. Buffering possibilities being investigated. 0DDM central services Very highNo access to data catalogues for production or analysis. All activities stops. 0Data transfer from Point1 to Castor HighShort ( 1 day): loss of data. … 0-13D streamingModerateNo export of database data. Backlog can be transferred as [ soon as ] connections are resumed. … more …

19 CHEP 2007 LCG LHCb Critical Services (CCRC08 wiki)CCRC08 wiki ServiceCriticality CERN VO boxes10=critical=0.5h max downtime CERN LFC service10 VOMS proxy service10 T0 SE7=serious=8h max downtime T1 VO boxes7 SE access from WN7 FTS channel7 WN misconfig7 CE access7 Conditions DB access7 LHCb Bookkeeping service7 Oracle streaming from CERN7 … more …

20 20 ALICE critical services list WLCG WMS (hybrid mode OK) WLCG WMS (hybrid mode OK) LCG RB LCG RB gLite WMS (gLite VO-box suite a must) gLite WMS (gLite VO-box suite a must) FTS for T0->T1 data replications FTS for T0->T1 data replications SRM v.2.2 @ T0+T1s SRM v.2.2 @ T0+T1s CASTOR2 + xrootd @ T0 CASTOR2 + xrootd @ T0 MSS with xrootd (dCache, CASTOR2) @ T1 MSS with xrootd (dCache, CASTOR2) @ T1 PROOF@CAF @ T0 PROOF@CAF @ T0

21 CHEP 2007 LCG Some First Observations  Largely speaking, requirements on services are more stringent for Tier0 than for Tier1s than for Tier2s…  Some lower priority services also at Tier0…  Maximum downtimes of 30’ can only be met by robust services, extensive automation and carefully managed services  Humans cannot intervene on these timescales if anything beyond restart of daemons / reboot needed (automate…)  Small number of discrepancies (1?):  ATLAS streaming to Tier1s classified as “Moderate” – backlog can be cleared when back online, whereas LHCb classify this as “Serious” – max 8 hours interruption  Also, ATLAS AMI database is hosted (exclusively?) at LPSC Grenoble and is rated as “high”  Now need to work through all services and understand if “standards” are being followed and if necessary monitoring and alarms are setup…  Do we have measurable criteria by which to judge all of these services? Do we have the tools? (Again < CCRC’08…)

22 Robust Services Services deployed at CERN with a view to robustness: – h/w, m/w, operational procedures, alarms, redundancy (power, network, middle-tier, DB b/e etc.) This was done using a ‘service dashboard’ & checklist at the time of SC3 & re-visited recently – Extensive documentation on robustness to specific failure modes – highlights where to improve (FTS2.0)FTS2.0 Some degree of ‘rot’ – needs to be followed regularly Some middleware improvements still required… Sharing of experience / training would be valuable

23 Main Techniques  Understanding of implications of service downtime / degradation Database clusters – not a non-stop solution; requires significant (but understood) work on behalf of application developer & close cooperation with DBAs Load-balanced middle Tier – well proven; simple(!) H/A Linux as a stop-gap (VOM(R)S); limitations Follow-up: workshop at CERN in November following recent re-analysis with Tier1s and m/w developers m/w & DB developers will share knowledge

24 Running the Services Daily operations meeting as a central point for following service problems & interventions Excellent progress in integrating grid services into standard operations Consistent follow up – monitoring, logging, alarming – efficient problem dispatching  Still some holes – not everyone is convinced of the necessity of this… Despite what experience tells us…

25 Scheduled Interventions Still the reason for most downtime – security patches, upgrades (fixes) etc. – Often several interventions at main sites / week Impact can be reduced by performing some interventions in parallel (where this makes sense) An increasing number of interventions can already be done with zero user-visible downtime In particular true for LFC; FTS has some features to minimize impact; down-time of ½ day per year to introduce new version (schema changes – forward planning reduces this) CASTOR interventions (a few per year) also ½ day downtimes Done by VO; during shutdown / technical stop? – Significant pressure to look at any data also during these periods – is zero user-visible downtime possible for storage services?

26 Unscheduled Interventions  By far the worst – power & cooling! – These have to be addressed by sites directly Beyond that: relatively few major downtimes (I am not talking about on-going reliability – this has to be addressed too!) – LFC: short-term panic last summer (ATLAS) – problem with alarms – solved by escalation in a few hours (PK / expert call-out over night) – FTS: service degradations – solved by restart of daemons (power cycle would also have worked!) – CASTOR: ‘stuck’ processes/daemons, still some improvements in monitoring needed; some well known problems have required new versions – need to test extensively to minimize risk of ‘late surprises’ – DB services: again some stuck clients / services – rapidly resolved by expert intervention  Single points of failure – and complexity – are the enemies!

27 Other Problems Still see far too many ‘file system full’ & ‘system overload’ type problems This is being actively addressed by the monitoring working groups – “You can’t manage what you don’t measure” Another problem that has affected many services – and is independent of DB technology – is ‘DB house-keeping’ – Largely table defragmentation or pruning… – Tom Kyte: “It’s a team effort…” – In some cases, even a need for “DB 101”… This (to me) underlines the need for a ‘WLCG Service’ view, following Ian Foster’s vision:

28 Ian Foster’s Grid Checklist 3.A non-trivial level of service is achieved: “A Grid allows its constituent resources to be used in a coordinated fashion to deliver various qualities of service, relating for example to response time, throughput, availability, and security, and/or co-allocation of multiple resource types to meet complex user demands, so that the utility of the combined system is significantly greater than that of the sum of its parts.”

29 On-Call Services ‘PK’ working group 1 identified need for on-call services – at least for initial years of LHC running – in the following areas:identified – CASTOR (and related services, such as SRM); – Database Services for Physics (also Streams replication?); – Grid DM (FTS, LFC etc.) These are clearly not the only services required for e.g. data acquisition / first-pass processing, but: – It has been shown over a period of years that on-call experts (at CERN) can readily solve problems; – On-call teams (still under discussion) appear viable in these areas (and are needed!) What is (urgently) needed at T1/T2 sites? – Batch, storage services? File transfer and conditions support?

30 Other Services There are clearly other services that are essential to the experiments’ production – AFS, LSF, …, phones, web services, …, However, it is not obvious (see PK WG report) that an on- call rota for these services: – Could realistically be staffed; – Is actually needed Relatively stable services with infrequent expert call-out This reflects not only the service maturity but also the care taken in setting up the service itself (Named experiment contact decides when intervention is needed & calls console operators who have list of experts

31 Cross-Site Services  This is an area that is still not resolved Excellent exposé of the issues involved at WLCG Collaboration Workshop in Victoria But not presented due to time! Will follow-up at WLCG Service Reliability workshop at CERN in November (26+)WLCG Service Reliability Emphasizes the need for consistent (available) logging and good communication between teams “UTC” – the Time of the Grid!

32 Guess-timates Network cut between pit & B513 ~1 per decade, fixed in ~4 hours (redundant) Oracle cluster-ware “crash” ~1 per year (per RAC?) – recovery < 1 hour Logical data corruption – database level ~1 per decade, painful recovery (consistency checks)  Scripts run directly against the (LFC) DB – much higher Data corruption – file level Being addressed – otherwise a certainty! Power & cooling – Will we get to (<) ~1 per site per year? Soon? Critical service interruption – 1 per year per VO? – Most likely higher in 2008…

33 Specific Actions Need ‘Critical Service’ list from all 4 experiments (have ATLAS, CMS, LHCb) – Large degree of commonality is expected (but a few differences found…) – Target November workshop to present these lists (or better first results…) Need to work through existing check-list and ensure all issues addressed – Use a ‘Service Dashboard’ just as for core ‘WLCG’ services Propose: all such services followed up on daily / weekly basis using standard meetings and procedures – This includes all the things we have come to know and love: Intervention plans; announcements & post-mortems This is basically an extension and formalisation of what is done now A formal program to follow up on this work is recommended – e.g. specific targets & deadlines; actions on named individuals… ¿Establish & document some “best practices” (requirements?) for new developments  future projects ? (largely done – see next slide)

34 In a Nutshell… Services ALLWLCGWLCG / “Grid” standardsGrid KEY PRODUCTION SERVICES+ Expert call-out by operator CASTOR/Physics DBs/Grid Data Management+ 24 x 7 on-call (if agreed) Several papers on this work exist (see CERN Document Server), covering various aspects of the subject (often limited by page count for conference in question). Foresee a single paper summarizing the techniques and experience – deliverable of the WLCG Service Reliability workshop

35 Summary – We Know How to Do It! Well-proven technologies & procedures can have a significant impact on service reliability and even permit transparent interventions We have established a well-tested checklist for setting up and running such services These services must then be run together with – and in the same manner as – the ‘IT (GRID) ones’ These techniques are both applicable and available to other sites (T1, T2, …) Follow-up: WLCG Service Reliability w/s: Nov 26+ Report back to next OB: Dec 4

36 Conclusions The “Residual Services” identified at last year’s LHCC Comprehensive Review have (largely) been delivered – or are in the final stages of preparation & deployment We are now in a much better position wrt monitoring, accounting and reporting – running the service – No time to discuss these aspects! Data taking with cosmics, Dress Rehearsals and CCRC’08 will further shake-down the service  Ramping up in reliability, throughput and capacity are key priorities We are on target – but not ahead – for first pp collisions  A busy – but rewarding – year ahead!

37 BACKUP

38 Ticklist for new service - 09/2005 User support procedures (GGUS) Troubleshooting guides + FAQs User guides Operations Team Training Site admins CIC personnel GGUS personnel Monitoring Service status reporting Performance data Accounting Usage data Service Parameters Scope - Global/Local/Regional SLAs Impact of service outage Security implications Contact Info Developers Support Contact Escalation procedure to developers Interoperation ??? First level support procedures How to start/stop/restart service How to check it’s up Which logs are useful to send to CIC/Developers and where they are SFT Tests Client validation Server validation Procedure to analyse these error messages and likely causes Tools for CIC to spot problems GIIS monitor validation rules (e.g. only one “global” component) Definition of normal behaviour Metrics CIC Dashboard Alarms Deployment Info RPM list Configuration details (for yaim) Security audit

39 39 becla@slac.stanford.edubecla@slac.stanford.edu – CHEP2K - Padua BaBar OPR Performance Tests


Download ppt "Jamie Shiers WLCG Service Coordination WLCG – Worldwide LHC Computing Grid WLCG Service Status LCG Comprehensive Review, November 20 th 2007."

Similar presentations


Ads by Google