LCG/EGEE Operational Issues Stephen Burke RAL. November 1 st 2004LCG Operations - Issues Introduction List of problems to initiate discussion –A personal.

Slides:



Advertisements
Similar presentations
User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.
Advertisements

GLite Status Stephen Burke RAL GridPP 13 - Durham.
Stephen Burke – Heidelberg - 26/9/2003 Partner Logo Overview of applications view of the data management middleware Stephen Burke.
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Africa & Arabia ROC tutorial The GSTAT2 Grid Monitoring tool Mario Reale GARR - Italy ASREN-JUNET Grid School - 24 November 2011 Africa & Arabia ROC Tutorial.
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Deployment Issues David Kelsey GridPP13, Durham 5 Jul 2005
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
11/30/2007 Overview of operations at CC-IN2P3 Exploitation team Reported by Philippe Olivero.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Lessons for the naïve Grid user Steve Lloyd, Tony Doyle [Origin: 1645–55; < F, fem. of naïf, OF naif natural, instinctive < L nātīvus native ]native.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
INFSO-RI Enabling Grids for E-sciencE SA1 and gLite: Test, Certification and Pre-production Nick Thackray SA1, CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Progress on first user scenarios Stephen.
Documentation (& User Support) Issues Stephen Burke RAL DB, Imperial, 12 th July 2007.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
DTI Mission – 29 June LCG Security Ian Neilson LCG Security Officer Grid Deployment Group CERN.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Last update 29/01/ :01 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD CERN VOMS server deployment LCG Grid Deployment Board
Overview Background: the user’s skills and knowledge Purpose: what the user wanted to do Work: what the user did Impression: what the user think of Ganga.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
– n° 1 Grid di produzione INFN – GRID Cristina Vistoli INFN-CNAF Bologna Workshop di INFN-Grid ottobre 2004 Bari.
Software Management Workshop Steve Traylen. Software Management(WG5) The aim of the working group is to look at deficiencies in deployed and upcoming.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
CERN LCG1 to LCG2 Transition Markus Schulz LCG Workshop March 2004.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Accounting Update Dave Kant, John Gordon RAL Javier Lopez, Pablo Rey Mayo CESGA.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Jean-Philippe Baud, IT-GD, CERN November 2007
NGI and Site Nagios Monitoring
Technical Board Meeting, CNAF, 14 Feb. 2004
EGEE Middleware: gLite Information Systems (IS)
EGEE Operation Tools and Procedures
Presentation transcript:

LCG/EGEE Operational Issues Stephen Burke RAL

November 1 st 2004LCG Operations - Issues Introduction List of problems to initiate discussion –A personal selection Organisation Operation Monitoring & Accounting User Support Design Middleware

Organisation

November 1 st 2004LCG Operations - Issues Release management Sysadmins would like more information about what will be in future releases –With likely timescales –Status of dcache? VOMS? –Resource implications, migration etc Input to developments, site requirements –E.g. Tank & Spark, disk pool manager How do we upgrade in a safe way? –Mixed version systems Installation tools – Quattor support?

November 1 st 2004LCG Operations - Issues VO management How to add a new VO –Who runs services –How to add VOs at sites –Resource allocation policy –User registration policy Lightweight system for small VOs –Must scale to tens of VOs, hundreds of users –NA4 reports 47 VOs already! What happens if a VO ends?

November 1 st 2004LCG Operations - Issues Pre-production system How many sites? What level of service? What middleware? Who will use it? When? SA1 requirements to gLite

Operation

November 1 st 2004LCG Operations - Issues Current LCG Performance Efficiency measured with a test job submitted once per hour Plot shows weekly average over a ten-week period Most problems are site-specific Submitted: broker down (x2), myproxy problem, cron failure Executed: mainly BDII empty/overloaded Total efficiency: many detailed site- specific problems

November 1 st 2004LCG Operations - Issues Site Management Tools for sites to check installation –Most errors due to site misconfiguration or faults, e.g. NFS, clocks, ssh keys, full disks, … –Scaling: 1/century*10,000 = 2/week! Ongoing certification –Sites seem to “decay” with time How to certify for each VO – standard tests? Remote control of services? –How fast do problems need to be fixed? –Small sites not used to 24*7 cover Safe service shutdown procedure

November 1 st 2004LCG Operations - Issues Flexibility Overall system is very complex, hard to predict the effect of actions Need to balance the freedom of sysadmins against usability of the overall system –The more flexibility in configuration, the harder to certify/validate How do people know what they can do safely, what the consequences are?

November 1 st 2004LCG Operations - Issues Security Incident response –Several intrusions lately, what if a stolen proxy is used? –Are all sites passing on information about incidents? No such thing as a local incident! –Shut down sites remotely? Logging –Can use of proxies be traced? –Are the logs secure? Care of private keys and proxies Outbound IP access

Monitoring and Accounting

November 1 st 2004LCG Operations - Issues Monitoring Lots of tools –GridICE, ganglia, MonaLISA, R-GMA, GIIS monitor … –Coherency? Different clients, purposes –Users, sysadmins, CIC/GOC/ROC, funding bodies –Routine monitoring, alerts, problem tracing, measuring resources, PR, … Info system schema not sufficient? –Job information missing Must fix problems, not just monitor them!

November 1 st 2004LCG Operations - Issues How many sites? Map: 82 GridICE: 77 BDII: 84 CPUs –GIIS: 8805 –GridICE: 34222

November 1 st 2004LCG Operations - Issues Test jobs Frequency of job submission –Load on CE –Time taken to catch problems All jobs run as dteam –How to test for other VOs? Test coverage –As problems are found, new tests should be added Automatic reporting back to sites?

November 1 st 2004LCG Operations - Issues Accounting Several clients –VOs, funding bodies, resource providers, SLAs What granularity? –VOs, users, single jobs? –What about failed jobs? –Separation of jobs run at low priority? Data protection/privacy? Normalising different CPUs –CPU time or real time? Hyperthreading? Accounting for disk space, networking Enforcement of quotas?

User Support

November 1 st 2004LCG Operations - Issues Support needed Single users, VO managers, sysadmins … How do sites and VOs communicate? –Or sites and individual users? Different kinds of support –“How do I …?” –Bugs –Requirements for extra features –Problems with installation/configuration –Site/service faults –Problems with applications Training – pre-emptive support?

November 1 st 2004LCG Operations - Issues User support channels Want single point of contact – GGUS? Several FAQs – need a single FAQ database Many problems very hard, need experts –Limited number of experts –All reading rollout list … Mailing lists may be good for some things –~150 lists, hard to know where to go! Use of savannah? –186 open bugs, 46 assigned to “none” –Oldest from February

November 1 st 2004LCG Operations - Issues Documentation LCG user guide fairly good, but still need EDG manuals for details –LCG has modified many things, need updated documentation Need sysadmin guide, beyond installation LCG web site is complex, hard to find info –Use google to find things! Many other web sites –GridPP, INFN, EGEE, … –Lots of info, how do you find it?

System Design

November 1 st 2004LCG Operations - Issues Design EDG did not really have a complete system design LCG has made short-term decisions on a pragmatic basis, no real overall design –Complicated BDII/RB structure –Separate services per VO –Config for data management tools gLite has an architecture, but not a design for a deployed system? How much can LCG do? –High level design for medium term?

November 1 st 2004LCG Operations - Issues UI configuration UI needs to point to RBs, BDIIs etc –Will also need VOMS config –How does the admin find the info? There is no single list of RBs –Need fallback if services are down e.g. can configure multiple RBs but not multiple myproxies, BDIIs Get info from info system rather than static configuration? –Bootstrap - need a single info system

November 1 st 2004LCG Operations - Issues Security Is VOMS being deployed? –If so, how will it be used? Secure services needed ACLs, ownership for files Need VO groups and roles –Software managers lose the ability to run as normal users Support for multi-VO membership? –e.g. access for atlas *and* gridpp

November 1 st 2004LCG Operations - Issues Glue schema Several known problems –CE can’t describe scheduling policies –Can’t deal with inhomogeneous WNs –Various SE problems –Clear definition of measurement units –Need “failsafe” defaults Who controls the schema? New objects, e.g. RB? Info about running jobs?

Middleware

November 1 st 2004LCG Operations - Issues Error handling Services need to be fault tolerant –In a large Grid some things are always broken Error messages need to say what went wrong –Often very hard even for experts to diagnose problems –Can’t tell if a job exceeded time limit! Need consistent logging, remote diagnostics

November 1 st 2004LCG Operations - Issues Job submission Passing requirements (memory, CPU time etc) to LRMS No simple link from RB job ID to PBS ID Some successful jobs get resubmitted, some failed jobs look like successes, some jobs are just lost –If an RB fails the jobs are in limbo RB doesn’t notice if a site keeps failing – black holes Local environment with inhomogeneous sites –lcg-*, rgma, … –Installed software Versions, locations –Outbound Ranking algorithm – EstimatedResponseTime!

November 1 st 2004LCG Operations - Issues Data management Check consistency between SE and LRC –Sites should avoid deleting files! “Close SE” concept is not well defined –Info system, brokerinfo, edg-rm config, environment variables Does the classic SE disappear? –Could improve some things Control access, e.g. to shut down? –iptables -> block SYN packets! File ownership? Space management on WNs Software installation –Is Tank & Spark sufficient? –What about the gLite solution?

November 1 st 2004LCG Operations - Issues Information system Unique service index to locate all services –How to find BDIIs? –Need to know about RB/BDII/myproxy relations –How to understand the structure of the system Info provider configuration is complex, hard to diagnose problems –Need management tool? How to find out service status info? –Use web pages as well? –Entries in the GOC DB not enough

Conclusion

November 1 st 2004LCG Operations - Issues Priorities Security incident response Ongoing site validation –Problems must be fixed and not just detected Pre-production system VO management (VOMS?)

November 1 st 2004LCG Operations - Issues Next Steps So many problems, so little time … Organise targeted working groups Develop operations model for next year –Have to please many constituencies: VOs, users, sysadmins, middleware developers, funding bodies, … Not always with the same goals Who agrees to do what?