CMS Computing Shift Personnel (CSP) Tutorial

CMS Computing Shift Personnel (CSP) Tutorial
10. January 2011

Tutorial Structure Today :
Brief Introduction to CMS Computing General Description of Computing Shift Procedure Subscription to the CMS Computing E-Log Organization of Vidyo access from local CMS center Questions After this tutorial and >= 2 months prior to 1st shift : New shifters go through the Shift Procedure and shadow experienced CSP by taking „passive“ shifts (only E-log reports, NO alarms) After 2 „passive“ shifts : Sign off by Peter/Oli Full participation as CSP Possibility to sign-up via the WEB

Brief Introduction to CMS Computing

Overview of the CMS Distributed Computing System
CAF Multi-tiered distributed computing infrastructure based on GRID technologies for resource access and data movement Many new challenges compared to established HEP experiments: Data distribution, user localization, site monitoring, support responsibilities

Data archival (cold copy) Prompt reconstruction Time critical calibration & alignment Tier-0 / CAF CAF

Data archival (hot copy) Reprocessing, skimming, MC production Data serving Tier-1 CAF

Centralized Simulation Distributed Data Analysis Tier-2 CAF

Transfer rates Processing resources Resources CAF 300 MB/s Tier-1 level: ~35k jobs/day 600 MB/s Up: 20 MB/s sustained Down: MB/s bursts Tier-2 level: ~100k jobs/day

In total:7 Tier-1 across 3 continents~50 Tier-2 across 4 continents

CSP introduction

CSP Role and Required expertise
The CSP is mainly monitoring systems and raising alarms monitor computing infrastructure and services at checkpoint hours by going through a set of checklists identify problems Create E-Log reports trigger actions open Savannah tickets, in particular to CMS Sites contact CRC, Core Computing Operators & Experts, Computing Experts On Call ➞ We are working on making the CSP role even more active in problem trouble-shooting Required expertise of the CSP Fair understanding of CMS distributed computing infrastructure + services required for data processing, transfers and analysis Physicist or technician from a collaborating CMS institute Tutorial assisted “passive” shifts

CMS policy for Computing Shifts
The Computing Shifts are accounted within standard MoA service work defined by CMS (as Central CMS Shifts) see Standard requirement : 8 points per author per institute 1 CSP shift == 0.75 points week / 1.25 points week-end no extra credit for night shifts since covering all time zones (special arrangements not excluded) During Data taking computing shifts are carried out : From Main CMS Centres : CMS CC or FNAL/ROC From Remote CMS Centres : see nice/cms-centre/www/CMS-Centres-Worldwide.pdf In 8 hours shifts (09-17/17-01/01-09), with 1 CSP per shift With the support of a Computing Run Coordinator who is on duty at CERN during 1 week periods With the support of CMS Core Computing Operators & Experts

Other Roles & interactions with CSP
Computing Run Coordinator (CRC) Subscribes to all CSP E-log sub-sections Assists CSP in raising alarms/tickets for complex cases Calls EOC during off-working hours (see below) Core Computing Operator or Expert (FacOps, DataOps, AnaOps) Subscribes to relevant CSP E-Log sub-sections Supports CSP during working hours Computing Expert On Call (EOC) Responsible of a particular service Alarmed by CSP via /IM/Tel during working hours Alarmed by CRC if really needed off-working hours CMS Site Contact Person Responds to alarms (e.g. Savannah, GGUS tickets) Other shifters (DQM, Online, Detector, …) In temporary absence of CRC, the CSP is the Core Computing contact for any shifter at P5/CMS Center/FNAL ROC CSP procedure responsible Assigns CSP shifts

CSP tools

Prerequisites The CSP should be CMS member have a CMS Computer account
if you don’t, please fill up the WEB registration form After the form has been submitted, an is sent to your Institute Representative (Team Leader) for approval If you have never been to CERN, it is necessary to send a copy of your passport to Anastasia Dolya, CMS Secretariat, CERN - PH Department, CH Geneva 23, Switzerland have a CMS Computer account for the Computer account, please contact a Hypernews account a GRID certificate + CMS VO registration Please follow the link _Grid_certificate_and_the_r for a guideline on how to proceed

Most important CSP tools
Main CSP Shift Instructions Vidyo connection to the Tandberg system (other CMS Centres) Shift Sign-Up tool daily Instant Messenger under “FacOpsShifter” account s Computing Plan of the Day Account in the CSP E-log Savannah account ( “cmscompinfrasup” member) for opening tickets Membership in e-group subscribe via

Shift Subscription tool
type=25 Shift selection : Blue == available on any slot that day / Green == available on a particular slot that day Preferably, please always check the Green box corresponding to your time zone slot to avoid being approved for other time zones Warning : when selecting Green, Blues get automatically selected, so please deselect it to avoid confusion

Shift Subscription policies
By end 2010, we actually have more demand for shifts than available slots (95 potential shifters !), so approvals need to follow stricter policies : shift requests can be made anytime for any open shift period shift approvals will follow a monthly schedule, where shifts are approved two months in advance to allow for a reasonable planning horizon for all shifters example : all shift requests for January are reviewed beginning of November, the shift requests are balanced between the different groups/regions and shifts are approved In the monthly approval process, we would like to follow the following procedure: shift requests from shifters in their own time zone have priority within a time zone, balance shift requests first on group/institute level, then on the level of individual shifters We are also regularly publishing the CSP shift planning and accounting tables, per time zone, per group and per shifter, see next slide.

CSP Planning and Accounting
Example for European time zone :

The CMS Computing Logbook
2 (unpleasant) features : need to enter your elog pwd the first time accessing a given section need to regularly re-load your browser to see updates

The Savannah ticketing tool
Submit a ticket main tool to communicate with sites and DataOps/FacOps/AnaOps to solve infrastructure problems Savannah Instructions for CSP :

Savannah Category: mostly SAM tests, Job Robot, Data transfers, ...
Severity: You judge ! Privacy: “Public” Assigned to: either DataOps, FacOps, AnaOps or T1/T2 site squad Use GGUS: YES for T1s, NO for T2s Site: T1/T2 site squad Subject: if connected to a specific site, begin with [SITE] Example: [T1_US_FNAL] For Tier-1, please systematically bridge to GGUS (WLCG ticketing) via Use GGUS: Yes More information about that here :

The Vidyo interface We have setup a permanent Vidyo ➞ MCU video bridge Connects to the permanent video feed between the main CMS Centers and P5 Remote shifters can be in direct contact with CMS Centers at CMS CC, P5, FNAL ROC shifters To avoid having too many connections, only one CSP shifter is allowed to connect at all times CSP has to log on at the beginning of shift and log off at end Every remote CMS Center needs a Remote Video Admin (to connect to MCU) : Responsible to check that system is used properly and holding the connection details Vidyo-capable PC (Window and MAC client OK, Linux client still Beta version) Sites with existing “Tanberg” or “Polycom” devices will be connected to MCU directly

CSP procedures

General

Checklist I: Core CERN/Core infrastructure monitoring :
Main checks: CERN/IT SSB, CMS Service Gridmaps, CMS Services scheduled upgrade, CASTORCMS instances

Checklist 2 : Tier-0 Tier-0 workflows monitoring :
Main checks: Storage Manager, T0Mon, tier0export pool, networking, batch/LSF farm, jobs

Checklist 3 : CAF CAF workflows monitoring :
Main checks: free space/usage per CAF stakeholder on cmscaf pool, networking, batch/LSF farm, jobs

Checklist 4 : Data Transfers
Distributed Data Transfer monitoring : Main checks: Queued based monitoring for Tier-1s (not for T2s), Status of PhEDEx agents at sites Soon Obsolete, see next slide

New Checklist 4 : Data Transfers
Distributed Data Transfer monitoring. Main checks : Status of PhEDEx agents at sites Queued based monitoring for Tier-1s (not for T2s) This new tool will be tested with shifters during November and deployed by end of 2010, replacing the existing tool.

Checklist 5 : Grid Sites Distributed Grid sites monitoring :
Main checks: SAM, JobRobot, Downtimes, Commissioning links, Savannah

Checklist 5 : Grid Sites Important
CSP is asked to investigate the problem in as much detail as possible This helps the admin which will receive any Savannah tickets to quickly and easily solve the problem DON’T REPORT THAT SITE X HAS A MEDIUM SIZE RED BALL!!! Report that site x shows failures in the <to be filled> SAM test In the body, investigate further what the problem is by clicking through the information provided till you reach the detailed error report 1 2

Checklists 6&7 : T1/T2 workflows
Tier-1 workflows monitoring : Main checks: not covered so far, currently relying on T1 admins, T1 coordinators, DataOps Plan to add ProdMon/Dashboard monitoring + GlideIn Fabric monitoring Tier-2 workflows monitoring : Main checks: not covered so far, currently relying on T2 admins, T2 coordinators and CRAB support team Plan to collaborate with AnalysisOps monitoring Plan to add ProdMon/Dashboard monitoring

Some real examples

CAF monitoring Free space on CMS CAF disk starts to shrink, due to an unexpected reason CSP instructions (CAF) : If the fraction of free space on cmscaf as shown in URL1 goes below 10% and if this was not already mentioned in the Computing Plan of the Day and there is no already opened Savannah ticket, open an ELOG in the "CAF" category 10% If no detection/alarm by CSP, the free space might shrink to 0, with the consequence that the critical Tier-0 to CAF data flow breaks This really happened ! …and some uncontrolled emergency data flushing on the CAF had to be done ➞ WORST CASE SCENARIO !

Computing Plan of the Day
Note : 3 Russian sites in downtime !

Grid Site Monitoring Example CMS Site Status Board :
T2_CN_Beijing shows a red ball ! Known by Comp. Plan of Day? No ! So what to do ? Example CMS Site Status Board : JINR in Scheduled downtime Ignore Waiting Room

Grid Site Monitoring Investigate further: Report in E-log
Click on link next to “red ball” Check the different problem categories and even drill further down to check for the real problem Report in E-log Advanced CSP can open Savannah ticket to site Subject should include: [SITE] and as specific short description of the problem as possible Do not only mention that the site has a “red ball” !!! Ticket should contain as many details as found out during investigation

Other news on GRID site monitoring
“lens symbol” == already known issue. NO Elog/ticket needed (still check if it is still the same problem) “At work symbol” == Site scheduled downtime. NO Elog/ticket needed Note : Unscheduled downtimes are not yet marked with the “At work symbol”, so double-check with the Computing Plan of the Day and with CMS Google Downtime Calendar (see next slide) before opening Elog/ticket. If T1 red, small ball, CSP should open Elog/Savannah quasi immediately (1-2h) If T2, follow instructions when/how open Elog/Savannah

Other news on GRID site monitoring
CMS Google Downtime Calendar

PhEDEx Components Status Page
All Russian T2s have their PhEDEx componentsdown since ~3h What to do ? Check Computing Plan of the Day!

Evolution of CSP procedure

Where we stand and where we go
Summer 08: CMS Computing shift procedures created Fall 08: introduced the concept of Computing Shift Person (CSP) and Computing Run Coordinator (CRC) Winter 08: ~100 shifts done by pool of ~30 computing experts at CMS & FNAL/ROC 2009: CSP shifts covered by CMS collaborators at remote CMS Centres Pool of 45 CSPs from 3 time-zones (Asia, America, Europe) CMS Centres : Beijing, Rio, Sao Paulo, Texas Tech, Univ. of Florida, Aachen, DESY, FNAL, CERN 2010: extend above philosophy Pool of 70 CSPs (new remote Centres: GridKa, INFN Bologna, ... ) Encourage strong remote teams who can provide local CSP support Strengthen role of CSP in trouble-shooting issues Enforce 24/7 coverage of critical services in shift procedures Move away from “Twiki” to DQM-like monitoring (in progress)

Critical Services and Sites
We are currently revising the Criticality Level of all CMS services CSP instructions will be adapted accordingly Frequency of checks List of experts to contact Type of alarm : Elog, Savannah, telephone to CRC (who might raise GGUS alarm or call Expert on Call) As a general rule : the closer you are to the detector data stream, the more critical : Tier-0 : processing and storage CAF : processing and storage Central Services at CERN (Core) : DBS, PhEDEx, … Tier-0 – Tier-1 transfers Tier-1 Site Availability ➞ Please pay special attention to these workflows And always read the Computing Plan of the Day carefully

24/7 Critical Services&Sites Coverage (II)
Service/Facilities Monitoring CSP checks every 2 hours Status Green ? E-LogBook & Ticketing tool Expert answer within 1 hour ? No Service/Site Alarm Procedure Yes Expert Computing Operations Problem solved ? Core System Alarming Computing Run Coordinator (CRC) reachable 24/7 for : - Critical Service recovery procedure - Priority (GGUS-Team) ticket to site CMS Core Computing experts / CMS Site admins(*) : - Apply routine service / infrastructure operations and monitoring - Respond as On-Call Experts to Alarms CSP CRC CERN/IT (*) CMS has dedicated site-contacts and site-admins (**) highly critical alarms to Tier-0/1s are sent via GGUS-Alarm tickets and can trigger phone calls (***) CRC, Service Expert or Site Admin actions are systematically reported back to the E-LogBook or Savannah or GGUS, for transparency purposes. (**) (***)

What CSP should always do ?
Subscribe to CSP shifts well in advance (> 1 week). If cancel, consult P.Kreuzer/O.Gutsche AND remove shift subscription Carefully read the Computing Plan of the Day and keep an eye on it during the whole shift. If Plan missing, read report by previous shifter and complain via AIM or to CRC Always connect to the instant messenger CSP account “FacOpsShifter”. When leaving the shift desk, inform outside world by changing status of messenger (e.g. to “away for lunch”) When reporting an issue in the proper Elog section, provide details of the observed problem (not just the link) Regularly read Elog responses or announcements by CRC or Computing Experts, in all Elog sections (reload browser !) Write detailed final shift reports in Elog; even if nothing new has occurred during shift, report on main open issues Once trained (2-3 passive shifts), open Savannah tickets in case of well identified site issue, by carefully following the instructions

What CSP should never do ?
Ignore a suspicious problem because too complex to understand  solution : inform CRC or Computing experts via Elog Open a Savannah ticket without following the CSP instruction to identify a site problem (PhEDEx Component, SAM) or if confused about an observed problem  solution : consult CRC, Computing Experts via Elog Cancel shifts or being replaced without reporting  solution : inform shift responsible in advance and cancel subscription in shiftlist

Last steps

Passive shifts Passive shifts Shift Subscription
Go through already signed up shifts and determine CSP time slot for doing passive shifts Contact CSP shifter and check if she/he is willing to act as passive shift host Confirm with O.Gutsche/P.Kreuzer Shift Subscription Once passive shifts done, subscribe to shifts (ideally 2 months in advance) via system/Shiftlist/ShiftSelection?shift_type=25

Subscriptions Assumption: Sign up for elog access:
Shifter already has CERN account and HyperNews account Sign up for elog access: Sign up for e-group Sign up for correct Savannah access to write tickets: Login to Savannah (CERN afs login) under "Request for inclusion" type "CMS" and "search", this will display all groups, then click on "CMS Computing Infrastructure Support" Peter & Oli will approve the request Get a valid Grid Certificate and CMS VO registration _certificate_and_the_r

And now we can practice more if you wish
 Simply open s Many Thanks for you attention and we are looking forward to work with you !

CMS Computing Shift Personnel (CSP) Tutorial

Similar presentations

Presentation on theme: "CMS Computing Shift Personnel (CSP) Tutorial"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMS Computing Shift Personnel (CSP) Tutorial

Similar presentations

Presentation on theme: "CMS Computing Shift Personnel (CSP) Tutorial"— Presentation transcript:

Similar presentations

About project

Feedback