Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Experiment Operations Simone Campana.

Similar presentations


Presentation on theme: "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Experiment Operations Simone Campana."— Presentation transcript:

1 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Experiment Operations Simone Campana

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Outline Try to answer to the following questions: –How are experiment operations organized? –Which Communication Channels are used? –Which are the commonalities? –Which are the differences? Thanks to Patricia Mendez Lorenzo, Roberto Santinelli and Andrea Sciaba + many other from experiments

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS Computing Operations Computing Shift Person (CSP) at the CMS centre at CERN or FNAL –Monitors the computing infrastructure and services going through a checklist –Identifies problems, triggers actions and calls –Creates eLog reports and support tickets –Reacts to unexpected events Computing Run Coordinator (CRC) at CERN –Overview of offline computing plans and status, operational link with online, keeps track of open computing issues –Is a computing expert Expert On Call (EOC), physically located anywhere in the world –Very expert in one or more aspects of the computing system (there can be more than one) –Must be on call

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS Computing Operations Data Operations expert on call: –Runs the T0 workflows and the T1 transfers –Monitors the above workflows Time Coverage –During global runs: Computing Shift Person: 8 hours shift, 16/7 coverage DataOps expert: 16/7 mandatory, 24/7 voluntary –Otherwise (local runs): CSP: 8/5 coverage DataOps expert: just on call

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services LHCb Computing Operations Grid Shifters (a.k.a production shifters) –Running production and data handling activities –Identifying and escalating problems –Some not-so-basic knowledge of Grid services and LHCb framework –See tick list for more information: https://twiki.cern.ch/twiki/pub/LHCb/ProductionOperations/GridShifter1 70808.pdf https://twiki.cern.ch/twiki/pub/LHCb/ProductionOperations/GridShifter1 70808.pdf Grid Expert on call –addressing problems –defining/improving operational procedures. Production Manager (based at CERN) –Organizes the overall production Dirac Developers experts –Fraction of time dedicated to run Grid Operations All Grid Operations are run from CERN –With the exception of some contact persons at T1s whose role also fits in one of the above

6 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services LHCb Time Coverage For more information please check the production operations web page https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperations LHC down : decided to move to 1 shifter for working hours

7 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ALICE Computing Operations ALICE Computing Operations is a joined effort between: –ALICE Core offline team running ALICE operations. Centralized at CERN –WLCG ALICE experiment support i.e. people offering Grid expertise to ALICE Production manager organizing the overall activity –with workflow and component experts behind data expert, workload expert, Alien expert etc... Offline shifts in the ALICE control room (P2) –Support the central GRID services and management tasks. RAW data registration (T0) and replication to T1s Conditions data gathering, storage and replication Quasi online first pass reconstruction at T0 –and asynchronous second pass at T1s ALICE Central Services status ALICE Site Services (VO-box/WMS/storage) status

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ALICE Time Coverage Offline shifts 24/7 during data taking First line support at CERN provided by IT/GS. Site support is tiered and assured by regional experts –one per country/region, in contact with site experts. –supported by the Core Offline and/or by the WLCG experts for high level or complex Grid issues. –very important to emphasize the importance of the support also at T2 sites

9 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ATLAS Computing Operations ATLAS Computing Shift at P1: 24(16)/7 during data taking –T0 shifter Monitor Data collection and recording from P1 to T0 Monitor First processing at T0 –Distributed Computing Shifter Monitor T0-T1 and T1-T1 data distribution –Database shifter ATLAS Distributed Computing Shifts (ADCoS) –Several level of expertise: Trainee, Senior, Expert, Coordinator –Monitor Monte Carlo production and T2 transfer activities ATLAS Expert On-Call: 24/7 –Offers expertise for data distribution activities Developers and single components experts: best effort –offering third level support

10 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ADCoS Time Coverage Europe 5 experts+10 seniors+ 5 trainees Asia: 4 seniors+1trainee America: 2 experts+5 seniors+ 3 trainees Covering 24h/day and 6 days/week, having people in three time-zones (no need for night shifts)

11 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS Comunication Channels eLog (using DAQ eLog + FNAL eLog, will have dedicated CERN box) “Computing plan of the day” (by the CRC) AIM accounts for shifters Savannah –+ GGUS for EGEE sites Sites  Operations: Savannah + HN Operations  Sites: Savannah, GGUS (+HN) Users  Operations: CMS user support (Savannah + email)

12 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services LHCb Communication Channels Internally LHCb: –Elog book: http://lblogbook.cern.ch/Operations/http://lblogbook.cern.ch/Operations/ –14X7 :Expert cell-phone number: 16-1914 –Daily meeting (14:30 – 15:??) –Mailing list: lhcb-grid@cern.ch (for ops matters) lhcb-dirac@cern.ch (for dev matters) mailing list for each contact person.lhcb-grid@cern.chlhcb-dirac@cern.ch Outreaching services and sites: –GGUS and/or Remedy ALARM tickets just for test, TEAM ticket not extensively used yet –WLCG daily and weekly meetings –IT/LHCb coordination meeting, SCM meeting –Higher level meetings (GDB/MB) –Local contact person and central grid coordinator person useful for speeding up resolution of problems Being reached from users and sites: –Support unit defined in GGUS –Mailing lists –Contact persons acting as liaison/reference for many site admins and service providers

13 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ALICE Communication Channels Internal ALICE communication –Mailing list –ALICE-LCG-EGEE Task Force Communication with users and User Support –Mailing list for operational problems and Savannah tracker for bugs. –Monthly User Forums (EVO) for dissemination of new Grid related information and analysis news. And monthly Grid training for new users Communication with sites and Grid operation support –TASK force Mailing List for operational problems –GGUS –daily WLCG ops meetings –weekly ALICE-LCG taskforce meetings –Dedicated contacts with many sites

14 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services ATLAS Communication Channels Internal Communication –ADCoS ELOG + T0 ELOG + ADCS@P1 ELOG –Savannah for DDM problem tracking Communication with sites –Mainly GGUS Team Tickets for all shifts + ALARM tickets for restricted list of experts –Support Mailing Lists mostly for CERN (CASTOR, FTS, LFC) –Cloud Mailing Lists Informational only –Many sites read ELOG –No clear site2ATLAS channel ATLAS operations mailing list, but something better should be thought. Communication with Users –Mostly HN for Operations2Users –GGUS + Savannah for Users2Operations … and meetings: Daily WLCG Meeting, weekly ATLAS ops

15 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Conclusions (I) Experiment Operations rely on multilevel operation mode –First line shift crew –Second line Experts On-Call –Developers as third line support not necessarily on-call Experiments Operations strongly integrated with WLCG operations and Grid Service Support –Expert support –Escalation procedures Especially for critical issues or long standing issues Incidents Post Mortems –Communications and Notifications I personally like the daily 15:00h meeting

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Conclusions (II) ATLAS and CMS rely on a more distributed operation model –Worldwide shifts and experts on call Central Coordination always at CERN –Possibly due to geographical distribution of partner sites Especially for US and Asia regions All experiments recognize the importance of experiment dedicated support at sites –CMS can rely on contacts at every T1 and T2 –ATLAS and ALICE can rely on contacts per region/cloud Contact at all T1s, usually dedicated Some dedicated contact also at some T2 –LHCb can rely on contacts at some T1


Download ppt "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Experiment Operations Simone Campana."

Similar presentations


Ads by Google