Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009."— Presentation transcript:

1 WLCG Service Report Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 17 th February 2009

2 Overview Today’s report will be short: no “major” incidents last week There was the usual background of problems that were addressed as they arose – see minutes from daily callsminutes [ Copied from last week’s report! ]  As mentioned at yesterday’s LHCC mini-review, it would be nice to include some additional “key performance indicators” – such as: 1.Summary of (un)scheduled interventions (including overruns) at main sites, 2.Summary of sites “suspended” by VOs, Do sites always (even?) know they have been suspended? 3.Production / analysis summaries (e.g. “VOviews”) 2

3 Daily Reports “I (Daniele Bonacorsi) have been filling - on behalf of CMS, and just for the WLCG Ops daily calls of ours, now since 2 weeks - the twiki: https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGda ilyreports https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGda ilyreports It seems to me it works, both as a reference for discussion, and for your minutes. If so, and if you agree, I propose I keep this habit of mine for the future.” IMHO this is very useful and it would be good if it could be adopted for the reports from the other experiments [ This also facilitates reporting to other meetings ] 3

4 GGUS Summary 4 VO concernedUSERTEAMALARMTOTAL ALICE3003 ATLAS202022 CMS8019 LHCb3306 Totals345140 The one alarm ticket was a test (Daniele Bonacorsi to CNAF): To be sure that a problem I had with GGUS alarm to CNAF is now solved, please anybody at CNAF receiving this 1) be aware it's a TEST and not a problem report, and 2) just CLOSE IT and mail me any details. Regards, DanieleB (CMS)

5 Experiment-specific Issues 5 ExperimentIssue ALICEOn-going WMS issues still being debugged; seriously impacted experiment’s production: next steps ATLASSome issues related to scheduling / communication of cleaning of PNFS database@FZK: now completed! (see announcement below) CMSSeveral issues reported but promptly followed up by experts / site contacts LHCbSome issues related to low numbers of running batch jobs – on-going reconfiguration and investigation. (Believed to be related to implementing the pilot role at CERN which gave problems with the LSF shares – now reported as fixed). Start of downtime [UTC]: 17-02-2009 08:00 End downtime [UTC]: 17-02-2009 12:00 FZK-LCG2/gridka-dCache.fzk.de/SRM 1.installing a dcache patch to fix queue allocation and improve throughput 2.shrinking ATLAS pnfs database (may improve throughput for ATLAS) 3.upgrade Postgres DB (which prevents uncontrolled PNFS DB growth)

6 WMS / ALICE 1.Setup of 2 new WMS at CERN with the latest 4.3 version which will be deployed for ALICE use only. These two new WMS will be put in production with the current ones so the experts can stop them, drain them.... any operation they consider in a totally transparent way for ALICE 2.In addition we are putting in production at CNAF the egee-rb- 09 WMS. It has also some fixes for ALICE as for example the drain flag. This procedure will directly put the WMS in drain mode as soon as the number of input requests becomes impossible to manage. 3.The CNAF procedure has been sent to the WMS experts at CERN to follow the same procedure, but it seems it is not still in production. We hope to gain enough familiarity with these procedures to provide the developers with feedback and also the site admins. 6

7 Intervention Summary (fake) 7 Site# scheduled#overran#unscheduledHours sched. Hours unsched. Bilbo501104 Frodo110222 Drogo27001650 As with GGUS summary we will drill-down in case of exceptions (examples high-lighted above) Q: what are reasonable thresholds? Proposal: look briefly at ALL unscheduled interventions, ALL overruns and “high” (TBD) # of scheduled

8 Site / Cloud status (examples) 8 VOSite / CloudStatusDurationReason ATLASNL-T1Offline8 hoursNetwork reconfiguration Where do we harvest this information? Could be useful to report at daily operations meeting (change of state)

9 CMS Dashboard – Site Availability 9

10 Summary Another calm week – the 2 nd in a row Start of a trend or correlation with school holidays in some areas??? Let’s hope the former…  Agree on WLCG Operations Roadmap 2009/2010 in Prague! 10

11 Workshop News Some 220 people had registered by the end of last week, including 20 for the workshop only Numbers in Victoria and Mumbai were a little lower – 180 people on both occasions by time of event The agenda is now rather full – speakers should aim to leave at least 30% (or more…) time for questions and discussions… Talks should be oriented towards operations / service delivery and not just status reports… 11


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009."

Similar presentations


Ads by Google