WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Slides:



Advertisements
Similar presentations
Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
Advertisements

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Will / Can Clouds Replace Grids? A Three-Point Grid Support Group, IT Department, CERN.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Security Policy Update LCG GDB Prague, 4 Apr 2007 David Kelsey CCLRC/RAL
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN IT Department CH-1211 Geneva 23 Switzerland GT WG on Storage Federations First introduction Fabrizio Furano
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
WLCG Service Report ~~~ WLCG Management Board, 9 th August
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
CASTOR-SRM Status GridPP NeSC SRM workshop
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
WLCG Workshop Introduction
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009

Overview Today’s report will be short: no “major” incidents last week There was the usual background of problems that were addressed as they arose – see minutes from daily callsminutes [ Copied from last week’s report! ]  As mentioned at yesterday’s LHCC mini-review, it would be nice to include some additional “key performance indicators” – such as: 1.Summary of (un)scheduled interventions (including overruns) at main sites, 2.Summary of sites “suspended” by VOs, Do sites always (even?) know they have been suspended? 3.Production / analysis summaries (e.g. “VOviews”) 2

Daily Reports “I (Daniele Bonacorsi) have been filling - on behalf of CMS, and just for the WLCG Ops daily calls of ours, now since 2 weeks - the twiki: ilyreports ilyreports It seems to me it works, both as a reference for discussion, and for your minutes. If so, and if you agree, I propose I keep this habit of mine for the future.” IMHO this is very useful and it would be good if it could be adopted for the reports from the other experiments [ This also facilitates reporting to other meetings ] 3

GGUS Summary 4 VO concernedUSERTEAMALARMTOTAL ALICE3003 ATLAS CMS8019 LHCb3306 Totals The one alarm ticket was a test (Daniele Bonacorsi to CNAF): To be sure that a problem I had with GGUS alarm to CNAF is now solved, please anybody at CNAF receiving this 1) be aware it's a TEST and not a problem report, and 2) just CLOSE IT and mail me any details. Regards, DanieleB (CMS)

Experiment-specific Issues 5 ExperimentIssue ALICEOn-going WMS issues still being debugged; seriously impacted experiment’s production: next steps ATLASSome issues related to scheduling / communication of cleaning of PNFS now completed! (see announcement below) CMSSeveral issues reported but promptly followed up by experts / site contacts LHCbSome issues related to low numbers of running batch jobs – on-going reconfiguration and investigation. (Believed to be related to implementing the pilot role at CERN which gave problems with the LSF shares – now reported as fixed). Start of downtime [UTC]: :00 End downtime [UTC]: :00 FZK-LCG2/gridka-dCache.fzk.de/SRM 1.installing a dcache patch to fix queue allocation and improve throughput 2.shrinking ATLAS pnfs database (may improve throughput for ATLAS) 3.upgrade Postgres DB (which prevents uncontrolled PNFS DB growth)

WMS / ALICE 1.Setup of 2 new WMS at CERN with the latest 4.3 version which will be deployed for ALICE use only. These two new WMS will be put in production with the current ones so the experts can stop them, drain them.... any operation they consider in a totally transparent way for ALICE 2.In addition we are putting in production at CNAF the egee-rb- 09 WMS. It has also some fixes for ALICE as for example the drain flag. This procedure will directly put the WMS in drain mode as soon as the number of input requests becomes impossible to manage. 3.The CNAF procedure has been sent to the WMS experts at CERN to follow the same procedure, but it seems it is not still in production. We hope to gain enough familiarity with these procedures to provide the developers with feedback and also the site admins. 6

Intervention Summary (fake) 7 Site# scheduled#overran#unscheduledHours sched. Hours unsched. Bilbo Frodo Drogo As with GGUS summary we will drill-down in case of exceptions (examples high-lighted above) Q: what are reasonable thresholds? Proposal: look briefly at ALL unscheduled interventions, ALL overruns and “high” (TBD) # of scheduled

Site / Cloud status (examples) 8 VOSite / CloudStatusDurationReason ATLASNL-T1Offline8 hoursNetwork reconfiguration Where do we harvest this information? Could be useful to report at daily operations meeting (change of state)

CMS Dashboard – Site Availability 9

Summary Another calm week – the 2 nd in a row Start of a trend or correlation with school holidays in some areas??? Let’s hope the former…  Agree on WLCG Operations Roadmap 2009/2010 in Prague! 10

Workshop News Some 220 people had registered by the end of last week, including 20 for the workshop only Numbers in Victoria and Mumbai were a little lower – 180 people on both occasions by time of event The agenda is now rather full – speakers should aim to leave at least 30% (or more…) time for questions and discussions… Talks should be oriented towards operations / service delivery and not just status reports… 11