EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC SEE By E. Atanassov,
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
Enabling Grids for E-sciencE COD 19 meeting, Bologna Nordic ROD experiences Michaela Lechner COD-19, Bologna.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC.
INFSO-RI Enabling Grids for E-sciencE GLOBAL GRID USER SUPPORT THE MODEL AND EXPERIENCE IN LCG/EGEE Gilles Mathieu(1), Torsten Antoni(2),
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What GGUS can do for you JRA1 All hands.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
Site Report from KEK, Japan JP-KEK-CRC-01 and JP-KEK-CRC-02 Go Iwai, KEK/CRC Grid Operations Workshop – 2007 Kungliga Tekniska högskolan, Stockholm, Sweden.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations EGEE.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE COD June 2009 COD-20 Parallel sessions Hélène Cordier COD-20, CNRS-IN2P3,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Operations Automation Team KoM, May ROC VIEW (SWE)‏ Javier Lopez Cacheiro/
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Dashboard Cyril L’Orphelin - CNRS/IN2P3.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CIC portal Requirements from users WLCG service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Alistair.
EGEE-III INFSO-RI Enabling Grids for E-sciencE COD20. June 2009 Helsinki R-COD in UKI Claire Devereux, Jeremy Coles & Co. COD-20,
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
INFSO-RI Enabling Grids for E-sciencE User and Virtual Organisation Support in EGEE Flavia Donno, CERN Torsten Antoni, FZK Alistair.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Overview of Operations in EGEE-III Marcin.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Pole 2 wrap up Vera, Helene, Malgorzata, David, Fotis, Diana.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-16 (Transition to EGEE-III) Report to.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-17
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
INFSO-RI Enabling Grids for E-sciencE Support Model for SC4 Pilot WLCG Service Flavia Donno CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks IT ROC: Vision for EGEE III Tiziana Ferrari.
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
INFSO-RI Enabling Grids for E-sciencE Workshop WLCG Security for Grid Sites Louis Poncet System Engineer SA3 - OSCT.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
The CCIN2P3 and its role in EGEE/LCG
NE-ROC Nordics Operations
EGEE Operation Tools and Procedures
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN 12 June 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm tools used in daily grid operations –SAM - Gstat 7 –SAM admin 2 –GridView 2 –CIC 3 –GOCDB 3 –GGUS 5 –Other:  MoniFarm (1)  FailRank (1)  Unix Monitoring Tools (1)  Central Logging Server (1)  Nagios 4  Ganglia 3  GridICE 1  Experiment dashboards 1  Local scripts 2  ELFMS (quattor) 3  Remedy 1  Monalisa 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm what features are missing to make your work easier -Mw logs 4 -Replicated sam admin 1 -Operational site admin manual 3 -Firewall rules 1 –Adding of a stamp from local batch system to a job’s ID to improve traceability of jobs from UI to WN 1 –Tool do to correlation statistics for difficult problems related to load etc. 1 –SAM alarms directed to the site 1 –Support for scheduled upgrades 1 –Search engine in CIC portal 1 –Service monitoring 1 –interaction CIC-CIC Operator-SAM-GOCDB-GGUS 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm Scheduled/unscheduled interventions -Similar scheduled/unscheduled causes in most of sites -MANY unscheduled interventions (as many as scheduled?): -Cause? Some related to mw instabilities -Is realistic to think that the number will go down? -Or we better prepare for good information and recovery?

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm deployment -Service Redundancy 1 -Tests/validate before deploying (in PPS or SA3 or by site) 3 -Sequential updates 1 -Only apply sw updates between LHC runs 1 -Overhead of porting to fabric management tools 2 -Updates less frequents: 1 per month? 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm Communication with users –Remote users  GGUS 4  Helpdesk 2  5 –Local users  phone 3  face-to-face meetings 2  6  training events and workshops 2  local mailing list 2 –Other:  “Grid Clinic” day establishment (open grid lab)  High level meeting with users 1  When to use broadcast? 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm Communication - cross-site issues –ROC communication via regional mail, helpdesk, meetings 5 –weekly ops meetings 3 –Rollout mailing list (fast and expert advice) 5 –Ggus 3 (slow) –Broadcast: too many! 2

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm Problems found by COD –Most problems are discovered and corrected before COD ticket –most issues detected by site admins through monitoring tools –on some cases, issues have been reported by local users  5-20% of problems detected by COD  80%-10% of problems detected by users –CIC are very useful overall 4 – links provided helping to solve the issue –Make sure problems are solved (escalation)  spent more time on detecting/analysing central failures 1  COD people filtering information 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm operations bodies/meetings Weekly reports: some duplication 1 Weekly ops meeting not very useful (preferred meeting with ROC) 1 Not attending 1

Enabling Grids for E-sciencE EGEE-II INFSO-RI Ops workshop – Stockholm Questions sent to the sites -tools used in daily grid operations - what features are missing to make your work easier -examples of the most frequent scheduled interventions at your site -examples of the most frequent unscheduled interventions at your site -points to improve in communication with ROC, other sites, Vos, rest of the world... -How do you plan deployment of updates/new versions so continuous production is not interrupted? -Communication with users: how are you informed about operational problems at your site reported by local/remote users? Mail/GGUS/phone/other? -Correlation of cross-site issues: is the operations meeting enough for this? How do you do it otherwise? -What percentage of real site problems are detected and reported by central monitoring (SAM, COD) before you know about them? -usefulness of the following operations bodies/meetings and suggestions to improve them: - -COD -your ROC support team -operations meeting