8 th CIC on Duty meeting Krakow 16-18 05/2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

EGEE SA1 Operations Workshop Stockholm, 13-15/06/2007 Enabling Grids for E-sciencE Service Level Agreement Metrics SLA SA1 Working Group Łukasz Skitał.
INFSO-RI Enabling Grids for E-sciencE SA1 Operations Manual P. Strange RAL, CCLRC UK.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC SEE By E. Atanassov,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
Enabling Grids for E-sciencE COD 19 meeting, Bologna Nordic ROD experiences Michaela Lechner COD-19, Bologna.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations update Guillaume Cessieux.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
Lessons for the naïve Grid user Steve Lloyd, Tony Doyle [Origin: 1645–55; < F, fem. of naïf, OF naif natural, instinctive < L nātīvus native ]native.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
Enabling Grids for E-sciencE SA1 EGEE-II INFSO-RI The Pre-Production Service in WLCG/EGEE A. Retico, N. Thackray CERN – Geneva, Switzerland PPS.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
INFSO-RI Enabling Grids for E-sciencE SA1 and gLite: Test, Certification and Pre-production Nick Thackray SA1, CERN.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Enabling Grids for E-sciencE INFSO-RI Tools for CIC Operations, Bologna, 24th May Monitoring workflow in EGEE GOC DB is used to get the list.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Dashboard Cyril L’Orphelin - CNRS/IN2P3.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Overview of Operations in EGEE-III Marcin.
VO VOCE - Availability and Stability of Resources Enabling Grids for E-sciencE VO VOCE - Availability and Stability of Resources.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Feedback from joining and first COD shift M.Radecki on behalf of CE ROC COD-7, Lyon, France.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
II EGEE conference Den Haag November, ROC-CIC status in Italy
SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
Supervisor Approval Instructions- Bi-weekly Payable Time & Absence Requests.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-16 (Transition to EGEE-III) Report to.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-17
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI COD activity in EGI-InSPIRE Marcin Radecki CYFRONET, Poland & COD Team 9/29/2016.
EGEE-III INFSO-RI Enabling Grids for E-sciencE COD EGEE09 Barcelona Pole-2 Restructuring of Procedures Vera Hansper.
EGEE-III INFSO-RI Enabling Grids for E-sciencE COD EGEE09 Barcelona C-COD Survey results Vera Hansper.
Service Availability Monitoring
Summary on PPS-pilot activity on CREAM CE
Nordic ROC Organization
Pole 3 – Dashboard Assessment COD 20 - Helsinki
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov

2/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Outline Operational Tools Observations on site problems Other problems and issues Proposals for improvement Conclusions

3/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Operational Tools –GSTAT – sometimes GSTAT gets stuck. It would be nice if it could give a warning, so that we can switch to the CNAF mirror. Downtimes are not consistent with GOC DB (difference of 1 hour or so). Please double-check if this is now fixed. –GGUS – was down at the start of the shift, fixed in an hour or so. If GGUS is down, the CIC dashboard is also not available, which is bad. –Certificate lifetime monitor – no problems, but it shows info only about CEs.

4/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Operational Tools(2) –CIC portal – very useful. A new version was put in place during the shift, but it had some problems when showing a site with more than one ticket. Problems were reported, and resolved. –SFTs – It takes some time to realize that SFTs are not updated. A warning could be helpful, so that new SFTs could be run manually. –SFTs for the PPS on the gLite site failed throughout the shift, because of some middleware problems, that could not be resolved. It seems that PPS sites have good responsiveness to tickets, but the problems could not be solved easily.

5/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Observations on site problems JS problems – several sites with JS problems. We noticed that LHCB created several tickets with Subject “More than 1000 failed jobs at …” for such sites. Our impression is that many of these problems are caused by some, but not all, misconfigured worker nodes, and are not really resolve by the site administrators. “We did nothing, but the problem is resolved.” Especially in the case of “Maradona problem” site administrators could be requested to identify the WN where the job failed. Sites with replica management problems – mostly sites using some other BDII, not lcg-bdii.cern.ch. Lots of tickets and lots of s on such issues, but the problem is easily solved by changing two files – lcgenv.sh and lcgenv.csh on the WNs. R-GMA problems – no time to deal with them, because the number of sites affected is big. Many sites did not upgrade their MON boxes to secure R-GMA and consequently fail the R-GMA test. Is this test really a critical test? If it is, why gstat does not show the site in CT in such case? JL problems – several of the big sites with more than one CE had non-trivial JL problems. It appears they were caused by high load on the CE, which usually results from users submitting jobs directly to globus, or something similar. Seems to be a middleware problem. The current procedure does not deal efficiently with sites that mark the problem as “solved”, but do not really do anything about their problem. Frequently it is obvious immediately that the problem is not solved, but what are we supposed to do?

6/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Other problems and issues Downtimes – CIC portal gives a warning if a ticket is going to be issued for a site that has been or is going to be in downtime during the same day. However, it is unclear if tickets should be issued for sites that are down outside their scheduled downtime – question about procedure. This lead to some confusion with one USA site. In the new GOC DB2 the downtimes are input in local timezone, which solves this problem. However, we still believe Gstat is not entirely in sync with that. The 4444 problem – many sites showed this problem. It appears it was a middleware problem, and it should only result in ticket if the site shows 4444 waiting jobs for extended period of time. If GIIS for a site is down, in CIC portal it appears as having 0 CPU, and it is shown at the bottom of the list, even if the site usually has 1000 CPUs. The number of CPUs should be taken as average over some period of time, to avoid this problem.

7/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Proposals for improvement Operational tools –CIC portal shows a site as down if it is down at GSTAT. However, it is very annoying to open the site’s page, then we perform ldapsearch and see that the site is actually up, and nothing is to be done. The same applies about SFTs – one SFT failure is not enough. Therefore some history of failures should be visible in CIC portal. –The history of tickets issues for the sites should be easily visible, because maybe some sites have “replication failed” most of the time, and it is always in quarantine and this makes a vicious cycle of “1 st , 2 nd , 1 st , quarantine,1 st ” and so on. –However, ticket reopening should be avoided, because it is not clear how relevant the ticket is to the new situation.

8/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Proposals for improvement(2) Operational metrics –Metrics are not ideal –A good metric for site’s operation is “number of tickets” –We propose to call downtime as measured by SFTs “SFT-downtime”, because sometimes a site is up, and running, but some test fails for some reason, perhaps just because of a network problem that lasts for 1 minute. –Another possible metric is “ticket downtime” – time, when the site has a critical ticket and its status is “1 st ” or “2 nd ” (or even quarantine). We believe that there are sites that are less than 50% available with respect to “ticket downtime”

9/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Proposals for improvement(2) SA1 Operational procedures (in the document) –Page 4, below, see point 2 in the subsection 3.2: “CIC-on-duty team regular tasks”. It is written: “….at the end of the day”, Maybe the correct phrase should be: “…at the end of the shift ( week).” –Subsection 3.3, the 2 nd item: The correct mailing list is: project-eu-egee- –All tables should be numbered. –(See page 10): There are 3 steps in the 1 st table (subsection 5.4 ). But step 4 is mentioned in the paragraph under the table. –Page 15, from the top, point 4, in the 1 st item: “…in the table 3”. Such table doesn’t exist. –In table 2: “Phone call to ROC” – Does this escalation step exist or not ?

10/14 Enabling Grids for E-sciencE 8 th CIC on Duty meeting Krakow 16-18/05/2006 Conclusion The Grid is not about solving tickets, the Grid is for users. It is extremely annoying to see 50% of your jobs being aborted at some sites. Example – a site that has 100 nodes and 1 WN misconfigured, will fail 1% of the SFTs and may not even get a ticket, but if a user submits 200 jobs to the site and the site is empty, the user will see 99 jobs ok and 101 jobs aborted. That is why we believe the monitoring should be more stringent.