WLCG Service Report ~~~ WLCG Management Board, 18 th November 2008.

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCb Quarterly Report October Core Software (Gaudi) m Stable version was ready for 2008 data taking o Gaudi based on latest LCG 55a o Applications.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Summary on Data-taking readiness WS R. Santinelli on behalf of the “Rapporteurs”
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
WLCG IPv6 deployment strategy
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 18 th November 2008

Overview – from last week Overall goal is to move rapidly to a situation where (weekly) reporting is largely automatic  And then focus on (i.e. talk about) the exceptions… Have recently added a table of Service Incident Reports Need to follow-up on each case – cannot be automatic! Propose adding: Table of alarm / team tickets & timeline Maybe also non-team tickets – also some indicator of activity / issues Summary of scheduled / unscheduled interventions, including cross-check of advance warning with the WLCG / EGEE targets  e.g. you can’t “schedule” a 5h downtime 5’ beforehand… Some “high-level” view of experiment / site activities also being considered How to define views that are representative & comprehensible? 2

Summary of the week Relatively smooth week – perhaps because many people were busy with numerous overlapping workshops and other events! ASGC – more problems started on Friday and by Monday only 70% efficiency seen by ATLAS (CMS also sees degradation) – expectation is that this will degrade further. In contact with ASGC – possible con-call tomorrow FZK – some confusion regarding LFC r/o replica for LHCb. A simple test shows that entries do not appear in the FZK replica whereas they do in others (e.g. CNAF). This should be relatively low-impact but needs to be followed-up and resolved. Not entirely clear how LHCb distinguish between an LFC with stale data that is otherwise functioning normally and an LFC with current data… Weekly summaries: GGUS tickets, GOCDB intervention summary & Baseline versions now all part of meeting template.  A relatively smooth week! 3

WLCG “Data Taking Readiness” Workshop Some Brief Comments… 4

Overview Over 90 people registered and at (many) times “standing room only” in IT amphi (100 seats) Attendance pretty even throughout the entire two days – even early morning – with slight dip after lunch Not really an event where major decisions were – or even could be – made; more an on-going operations event  Probably the main point: on-going, essentially non-stop production usage from the experiments  on-going production service! Overlapping of specific activities between (and within) VOs should be scheduled where possible… To be followed… Personal concern: we are still seeing too frequent and too long service / site degradations. Maybe can “tolerate” this for the main production activities – what will happen when the full analysis workload is added? Matching opportunity: compared to this time last year – when we were still “arguing” about SRM v2.2 service roll-out – we have (again) made huge advances. Can expect significant service improvement in coming months / year. But remember: late testing (so far) has meant late surprises. (aka the “Fisk phenomenon”) 5

Actions A dCache workshop – most likely hosted by FZK – is being organised for January 2009 Discussions on similar workshops for other main storage implementations with overlapping agendas: 1.Summary of experiment problems with the storage what does not work that should work because it is needed what does not work but you just have to deal with it 2.Instabilities of the setup timeouts, single points of failure, VO reporting, site monitoring known (site specific) bottlenecks 3.Instabilities inherent with the use of the storage MW at hand bugs/problems with SRM, FTS, gridFTP, used clients 6

Summary on Data-taking readiness WS Patricia Mendez, R. Santinelli on behalf of the “Rapporteurs” Alberto Aimar, Jeremy Coles

LHC status – Interesting (technical) insights into the 19 th Sep incident and description of the operation to get it back to life – Question: when it will restart operating? Simply don’t know. A more clear picture expected by the end of December. – Enormous temptation to relax a bit: never mind. Ed: I think that this means “don’t”

CRSG report to C-RRB – Harry reported the report to C-RRB about the scrutiny the subgroup C-RSG. – First time public assessments on the LHC exps requirements scrutinized – Some discrepancies between 2008/09 reqs and advocated resources per each exp because of LHC shutdown. Fairly old TDRs (2005) – This is just an advisory, the first step of a resources allocation process that has the final decision from the LHCC body – Burinng question from the C-RRB: 2009 envisaged like Why more? Usually site usage shown only 2/3 of resources used. Why more?

Experiments plans: re-validation of the service Common points: – No relax but keep the production system as it was in data taking – Cosmic data is available (at least ATLAS and CMS, LHCb enjoys muons from dump of SPS) – Plans to be tuned with LHC schedule. Assumption is that in May LHC is operative and then 2/3 real data taking (cosmic/beam collisions) and 1/3 to final dress rehearsal tests. – No need to have a CCRC’09 like on 2008 being the experiments more or less continuously running activity. – Some activities should be at least advertized (throughput) – Some other need to be carried out (staging at T1 for all exps)

CMS – recent global runs CRUZET CRAFT (with 3.8 T magnetic field) Analysis of these data now. Reprocessing these data with new releases of CMSSW in few weeks and on January. – Analysis end-to-end. It is not a challenge but we need to go through to convince the quality of the process by the raw till the histogram (validation of the process). – Another CRUZET global run with all subdetectors on will take place before LHC start (April/May) – 64 bit: CMS sw not ready yet not time line for move to. WLCG should provide 64 bit resources with 32 bit compatibility – For what concerns facilities (monitoring tools, procedures) CMS feels ready.

ATLAS September + December 2008: cosmic ray data taking and preparations re-processing and analysis Early 2009: reprocessing 2008 cosmic ray data (reduction to ~20%) and data analysis Presented resources requirements for these activities (pre-staging might be a problem, not now having already prestaged cosmic data at T1  1000CPUs site ~1 TB/hour on-line New DDM (10 times more performant, functional tests) Distributed Analysis Test (done in Italy) should be carried on all clouds (at T2 1.2 GB/sec (10 Gb) per 400 CPU’s)

LHCb DIRAC3 is now in production since July and DIRAC2 dismissed fall Some Montecarlo for physics study of the detector and benchmark physics channels Some analysis of DC06 data via DIRAC3 FEST09 started already (for its MC production now going to merge). – Full Experiment System Test (from the HLT farm to the user desktop exercising all ops procedures) in January/February tests and then March/April the exercise. A full scale FEST expected if LHC delay Generic pilot sentences CPU normalization sentences Concept of CPU peak requirements: site are asked to provide certain amount of CPU per exps but from time to time the needs fluctuate (from half or 0 to twice this pledge. Opportunistic usage of other VO’s allocated resources at the site not always work

ALICE – Data in and 200TB replicated to T1 (FTS) in including cosmics and calibration Discussion about data cleanup of old MC productions ongoing with the detectors groups and to be coordinated with IT – Offline Reconstruction Significant improvements of the processing configurations for all detectors All reconstructible runs from 2008 cosmics data taking are processed Development of quasi-online processing framework ongoing with new AliRoot releases – Grid batch user analysis: High importance task – Specific WMS configuration ongoing – High interest on CREAM-CE deployment Already tested by the experiment summer 2008 with very impressive results – Tests of SLC5 going on at CERN System already tested in 64b nodes – New run of MC production from Dec08 – Storage Gradually increasing the number of sites with xrootd-enabled SEs Emphasis on disk-based SEs for analysis – Including a capacity at T0/T1s – Storage types remain unchangde as defined in previous meetings

Experiments Plans Oct08Nov08Dec08Jan09Feb09Mar09Apr09May09 ALICE Validation of new SEs and Analysis SEs at T0/T1s SRM testing via FTS Fine tuning of WMS use and CREAM-CE migration startup SLC5 migration Grid batch analysis – introduction of analysis train for common analysis tasks RAW and old MC data – partial cleanup of storages Replication tools – moved to new FTS/SRM, testing through re-replication of portions of RAW High volume p+p MC production RAW data processing – second pass reconstruction of cosmics data ALICE detector end of upgrade period New round of cosmics and calibration data taking

DM (I) Presented the status of various DM services, versions and experiences. LFC (<1 incident/month in WLCG ) and FTS (1.5/month) fairly good reliability observed. Presented general storage issues: Main problems still on the Storage area (robustness and stability). – dCache Pnfs performances (mainly affected when expensive srmls are run with stat (FTS staging files hammering does not help) – Storage DB issues (RAL ASGC) – Files temporary lost should be marked as UNAVAILABLE so that experiments know it is not accessible. – Issues in the pre-staging exercise of ATLAS at various T1. Not problem for CASTOR but for dCache having different MSS system. – Various outages due to disk hotspots. Difficult disk balancing.

DM (II) Pictorial view of DM layers also given. Error are frequent to happen at any of each of these layers or between them (sub-optimized) The error often obscure. Must invest on a more exhaustive logging from DM Operations cost is still quite high  Sharing experiences across experiments but also sites would save time We can survive already with the current system! Ed: watch out for analysis Use Cases!

Distributed DB Smooth running in 2008 at CERN, profiting from larger headroom from h/w upgrade – Comfortable for operating the services – Adding more resources can only be done with planning – Use of data-guard also requires additional resources New hardware for RAC2 replacement expected early 2009 – 20 dual-cpu quad-core (possibly blade) servers, 32 disk arrays Policies concerning integration and production database services remain unchanged in 2009 – s/w and security patch upgrades and backups Hope that funding discussions for online services with ATLAS and Alice will converge soon Ed: ATLAS conditions strategy to be clarified & tested, in particular for Tier2s. Gremlin over LHCb LFC/FZK ? Maria Girone Physics Database Services 21

T2 No major problems in 2008  Only CMS seems to have tried out analysis at T2’s. Worries for 2009 : – Adequate the storage for analysis,: will real data and real activity come to T2? – Monitoring of sites...too many web portals and lack of a homogeneous source (site view). Hot topics for T2. – How to optimize CPU/IO bound? The LRMS should be smart enough to send high CPU bound jobs to machine with "poor" LAN connection and viceversa high bandwidth boxes serving high IO application. – How much the shared area size?

T1 Stable CE/Information System Pilot – Pilot jobs would further alleviate Major feedbacks received about SEs Scalability of current dCache system – CHIMERA and new PostGres would alleviate Currently SRM system can handle 1Hz SRM requests. Sites would like to know how far from the experiment’s targets they are Site view of experiment activity is also another important issue

T1 reliability Too often T1 are seemed to break the MoU 2 key questions risen to T1 (only few answered) Best practice now known to sysadmins but: – Are really these recommendation followed by T1’s Proposal to have T1s reviewing each others seems interesting which is a good way to share expertise.

T0 Services running smoothly now. – hardware issues and hotspots problems mainly. Procedure documentations should be improved – now the “Service on Duty” covers 24X7 and more people have to be in the game One main area of failure Interesting rules of thumb for upgrading a service adopted at CERN – PPS seems to help Illustrated the Post-Mortem concept: – “According to WLCG a Post Mortem is triggered when a site’s MoU commitment has been breached”. Time spent by sysadmin to catch and understand problems could be dramatically alleviated by a better logging system. Often we know a problem already and may be effort addressed to make it not happening again

Middleware meaning of the baseline and targets approved by MB for data taking – Defined the minimum required version of packages. intended baseline – WN SLC5 and new compiler versions (UI later) – glexec/SCAS: target: deployment of SCAS (now under certification) – GLUE2:no really target – Rationalize the publication of heterogeneous clusters. – WMS a patch fixing many problems is in certification. ICE for submitting to CREAM there.

Storageware CASTOR – CASTOR Core: – CASTOR SRM v2.2: on SLC3 The recommended release is: – CASTOR Core: (released this week) – CASTOR SRM v2.2: on SLC4 (srmCopy, srmPurgeFromSpace, more robust) The next version of CASTOR, 2.1.8, is being considered for deployment at CERN for the experiment production instances. dCache (1.9.0 recommended also fast PNFS) – 1.9.X (>1) comes with new features (New pool code (November), Copy and Draining Manager PinManager (December, January), gPlazma,Tape protection (MoU) srmReleaseFiles based on FQAN (MoU), Space Token protection (MoU), New information providers, ACLs (January), NFS v4.1, Bug Fixes (gsiDcap, UNAVAILABLE Files, etc.) DPM V is the last stable release – V currently in certification (srmCopy, Write permission to spaces limited to multiple groups (MoU), srmReleaseFiles based on FQAN (MoU), srmLs can return a list of spaces a given file is in, New dpm-listspace command needed for information providers (installed capacity), DPM retains space token when the admin drains a file from a file system ) StoRM currently in production – Major release 1.4 dec. 2008

Definition of readiness Software readiness High-level description of service available? Middleware dependencies and versions defined? Code released and packaged correctly? Certification process exists? Admin Guides available? Service readiness Disk, CPU, Database, Network requirements defined? Monitoring criteria described? Problem determination procedure documented? Support chain defined (2nd/3rd level)? Backup/restore procedure defined? Site readiness Suitable hardware used? Monitoring implemented? Test environment exists? Problem determination procedure implemented? Automatic configuration implemented? Backup procedures implemented and tested? Questions chosen to determine the level of “readiness” Answers will eventually be published in the service Gridmaps Reliability, redundancy etc. not part of this definition! ALICE ok. ATLAS ok – but relying on experts. CMS services basically ready (but still debugging). LHCb lack procedures/documentation in many areas

Analysis WG Presented a WG setup for distributed analysis – what is the minimum amount of services requirement. – What the models from the experiments. deliverable: – documented analysis models – report on reqs for services and devs with priorities and timescales. Understand how the experiments want to run analysis and advertize sites how best they should configure the resources.

New metrics to be defined? “Simulated” downtime of 1-3 Tier1s for up to – or exceeding – 5 days (window?) to understand how system handles export including recall from tape Scheduled(?) recovery from power-outages Extensive concurrent batch load – do shares match expectations? Extensive overlapping “functional blocks” – concurrent production & analysis activities Reprocessing and analysis use cases (Tier1 & Tier2) and conditions "DB" load - validation of current deployment(s) Tape rates - can sites support concurrent tape rates needed by all VOs it supports concurrently?

Goals and metrics for data taking readiness “challenges” – Ian/Jamie Summary of C-RRB/MB discussion. Should there be a CCRC’09? No but do need specific tests/validations (e.g. combined running, reprocessing …). Experiment plans shown yesterday, no time coordination between activities. Areas to be considered listed on the agenda (based on list from operations meeting) What tests? How will buffers deal with multiple site outages. Real power cuts break discs, simulated outages have less problems! Many things happen anyway – need to document the recovery better. Follow up on things that went wrong. Expts. have to expect problems and be able to deal with them in a smooth way. Sites still waiting for data flow targets from the expts. CCRC08 rates are still valid. ATLAS plan on continuous load but DDM exercise in 2 weeks will provide peak load. Worry about tape rates – limits in number of available drives. CMS – alignment of smaller activities would be useful. ALICE – live or die by disk based resources. Reprocessing and analysis cases of particular concern. Conditions DB in ATLAS (T2 frontier/SQUID) needs testing. Other than concurrent tape rates CMS has own programme of work. When a more complete schedule exists looking at overlaps would be interesting. ATLAS hope all clouds involved with analysis by end of year. LHCb- can plan modest work in parallel – sometimes need to solve problems in production when they occur. Allow expts. To continue with programmes of work. When ATLAS read -> common tape recall tests. Try time when can push lots of analysis. Things should be running and then focus on resolving issues in post-mortems. Need to test things that change (e.g. FTS).

Presentation of the DM s/w status and ongoing activities at the T0 Status of – Used for migration of LHC data to tape and T1 replication – System in fully production state – Stressed during the CCRC’08 but few improvements neccessary from the point of view of the analysis DM software

TodaySpring09Mid 09End 09End 10 Security End user access to CASTOR has been secured Support both PKI (Grid certificates) and Kerberos Castor Monitoring and SRM interface being improved Tape efficiency New tape queue management supporting recall / migration policies, access control lists, and user / group based priorities New tape format with no performance drop when handling small files Data aggregation to allow managed tape migration / recall to try to match the drive speed (increased efficiency) Access latency a)Removed the LSF job scheduling for reading disk files b) Direct access offered with the XROOT interface which is already part of the current service definition c)Mountable file system possible Plan to remove job scheduling also for write operation Additional access protocols can be added (eg. NFS 4.1) Calendar of major improvements

One desired answers to many sysadmins requests in the path…the site view gridmap.

Exp. Operations Experiment Operations rely on multilevel operation mode – First line shift crew – Second line Experts On-Call – Developers as third line support not necessarily on-call Experiments Operations strongly integrated with WLCG operations and Grid Service Support – Expert support – Escalation procedures Especially for critical issues or long standing issues Incidents Post Mortems – Communications and Notifications EIS personally like the daily 15:00h meeting ATLAS and CMS rely on a more distributed operation model – Worldwide shifts and experts on call Central Coordination always at CERN – Possibly due to geographical distribution of partner sites Especially for US and Asia regions All experiments recognize the importance of experiment dedicated support at sites – CMS can rely on contacts at every T1 and T2 – ATLAS and ALICE can rely on contacts per region/cloud Contact at all T1s, usually dedicated Some dedicated contact also at some T2 – LHCb can rely on contacts at some T1 Problem with a VO in a site should be notified to the local contact person That is known centrally by the VO and will transmit to the local responsible

MoU Targets Do they well describe the real problems we face every day? GGUS is the system to log problem resolution – MoU categories introduced in some critical tickets (TEAM and ALARM) – Are these categories really comprehensive? – Should they provide a view by service or by functional block (i.e raw distribution)? (the former implies the submitter knows the source of the pb) – Time resolution: are they realistic? – No conflict with the experiment monitoring systems GGUS is not for measuring the efficiency of the site. – Response time to a GGUS ticket is in the MoU not solving time. – The sysadmin must any way to close as soon as he’s confident the problem is fixed. In GGUS we have all data to build up complete and reliable report for assessing the intervention has been carried properly and accordingly MoU Use SAM test for each of the MoU categories and with GridView computing the site availability for automatic monitoring.