Experiment support at IN2P3 Artem Trunov CC-IN2P3

Experiment support at IN2P3 Artem Trunov CC-IN2P3 artem.trunov@cc.in2p3.fr

Experiment support at IN2P3 Site introduction System support Grid support Experiment support

Introduction: Centre de Calcul of IN2P3 Serving 40+ experiments HEP Astrophysics Biomedical 20 years of history 40+ employees http://cc.in2p3.fr/

CC-IN2P3 - Stats ~900 batch workers - 2200 CPUs (cores) – 3400 jobs Dual CPU P-IV 2.4 GHz, 2GB RAM Dual CPU Xeon 2.8GHz, 2GB RAM Opteron Dual CPU Dual core 2.2Ghz, 8GB RAM. Mass Storage – HPSS 1.6 PB total volume stored Daily average transfer volume ~10TB Rfio access to disk cache, xrootd, dcache Network 10 Gb link to CERN since January 2006 1 core router (catalyst 6500 series) 1 Gb uplink per 24 WN (upgradable to 2x1GB)

LCG Tier 1 Center in France 20062007200820092010 CPU(kSI2K)1170267067701050017300 Disk (TB)5201600380054008400 MSS (TB)5351450346063809900 LCG MoU Jan 31 2006

Systems support A guard on duty 24 hours Smoke etc alarm sensors, service provider visits Telecom equipment sends an sms on failure Storage robotic equipment sends signals to the service provider who acts according to the service agreement. Unix administrator is on shift during evening hours. He can call an expert at home, but the expert is not obliged to work odd hours Weekly reviews of problems during shifts Nights and weekends not covered, overtime not paid. But in case of major accidents like power failure system group is making all efforts to bring the site back as soon as possible. During the day routine monitoring of Slow or hang jobs, jobs that end to quick, jobs exceeding requested resources – a message is sent to submitter Storage services for problems Some service to end users is delegated to experiments representatives Password reset, manipulation with job resources

Grid and admin support at IN2P3 Dedicated Grid team People have mixed responsibilities 2.5 – all middleware: CEs, LFC etc + operations 1.5 – dcache + FTS 1.5 – HPSS administrator 1 – Grid jobs support Few people doing CIC, GOC development/support + many people not working on grid Unix, network, DB, web Not actually providing 24/7 support Few people do too much work. Weekly meetings with “Tour of VOs” are very usefull GRID team storage production system

Grid and admin support at IN2P3 But Grid people not enough to make grid working Didn’t Grid mean to help experiments? - yes Are experiments happy? - no Are Grid people happy with experiments? – no “Atlas is the most ‘grid-compatible’, but it gives us a lot of problems”. “Alice is the least ‘grid-compatible’, but it gives us a lot of problems”. One of the answers: it’s necessary to work closely with experiments. Lyon: a part of user support group is oriented to LHC experiment support

Experiment support at CC-IN2P3 Experiment support group - 7 person 1 – BaBar + astrophysics 1 – all Biomed 1 – Alice and CMS 1 – Atlas 1 – D0 1 – CDF 1 – general support (sw etc) All – general support, GGUS (near future) Plan is to have one person for each LHC experiment All are former physicists The director of the CC is also physicist

Having physicists on site helps Usually had in-depth experience with at least one experiment Understand experiment’s computing models and requirements Proactive in problem solving, have broad view on experiments computing. Work to make physicists happy Bring mutual benefit to the site and the experiment Current level of Grid and experiments’ middleware really requires a lot of efforts And grid is not getting easier, but more complex.

CMS Tier1 people – who are they? Little survey on CMS computing support at sites Spain (PIC) – Have dedicated person Italy (CNAF) – Have dedicated person France (IN2P3) - yes Germany (FZK) – no Support is by physics community SARA/NIKEF – no Nordic countries – terra incognita US (FNAL) - yes UK (RAL) – yes, but virtually no experimental support at UK Tier2 sites

What are they doing at their sites? Making sure site setup works integration work optimization Site admins usually can’t check whether their setup works and asks VO to test it. Reducing “round trip time” between experiment and site admins Talking is much better for understanding than email exchange. Sometime it’s just necessary to sit together in order to resolve problems Helping site admins to understand experiment’s use of the site Especially “exotic” cases, like Alice (asks for xrootd) Also requires lot of iterations Testing new solutions Then deploying and supporting Administration At Lyon, all xrootd servers are managed by user support Grid expertise, managing VO Boxes, services (FTS)

VO support scenarios Better “range” when experiment expert is on site Could interact directly with more people and systems Integration is a keyword Grid admins Site admins Grid storage systems DB Experiment community Grid admins Site admins Grid storage systems DB a) b)

At Lyon Some extra privileges that on-site people use Installing and supporting xrootd servers (BaBar, Alice, CMS, other experiments), root Managing VO boxes, root Installing all VO software at site - special Co-managing FTS channels - special Developing SW (Biomed) Debugging transfers, jobs – root Transfer and storage really needs attention!

Grid support problems SFT not objective Grid service depends on some external components E.g. foreign RB or BDII may be at fault Too many components are unreliable Impossible to keep a site up according to MoU GGUS support not 24 h either. Select some support areas and focus on them

Focus points In short, what does an experiment want from a site? Keep the data To ingest a data stream and archive it on tape Access the data Grid jobs must be able to read data back When experiments are not happy? Transfers are failing Jobs are failing What causes failures? ► Network is stable ► Local batch is stable ►Grid middleware ►Storage

Debugging grid jobs Difficult Log file not available even for superuser – they are on a Resource Broker Whether it’s user application failing or site’s infrastructure is not clear without VO expertise At Lyon, production people monitor slow jobs, jobs that end too quick, jobs killed sending messages to users – not scalable.

Storage failures Storage is another key component (transfer, data access) Transfer made too complex SRM Non-interoperable storage solutions Exotic use cases Site storage experts can’t always debug transfers themselves Don’t have VO certificates/roles, Don’t have access to VO middleware that initiates transfers Storage classes WG (T1 storage admins) is trying to reduce complexity, bring the terminology to the common ground, make developers understand the real use cases, make experiments aware of the real life solutions, make site storage experts fully comfortable with experiments’ demands This is all for the mutual interest File loss is inevitable An exchange mechanism for site-VO interactions is being developed by Storage Classes WG initiative. But detection of the file loss is still a site’s responsibility!

Storage Needs serious monitoring Pools down Staging fails Servers overloaded Make storage a part of 24h support? Train operators to debug storage problems and localize the damage to an experiment (e.g. by notification to experiments) Is this possible to develop such monitoring and support scenario, where problems are fixed before users complain? Shame, when a user tells an admin “You have a problem”. Better be vice-versa. At Lyon will try to expand storage expertise Experiment support people are already involved. VO will obviously have to train data taking crew on shift to recognize storage problems at sites (since exporting data from CERN is a part of initial reconstruction workflow).

Little summary about what should work in Grid support Dedicated grid team Deep interaction with experiments, regular meetings on progress and issues Storage/transfers monitoring Human factor

Experiment support at IN2P3 Artem Trunov CC-IN2P3

Similar presentations

Presentation on theme: "Experiment support at IN2P3 Artem Trunov CC-IN2P3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experiment support at IN2P3 Artem Trunov CC-IN2P3

Similar presentations

Presentation on theme: "Experiment support at IN2P3 Artem Trunov CC-IN2P3"— Presentation transcript:

Similar presentations

About project

Feedback