Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

Similar presentations


Presentation on theme: "Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini."— Presentation transcript:

1 Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini Andrea Sciabà

2 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 2Andrea Sciabà Hammercloud: quick summary (1/2) Hammercloud is a service to define and run Grid test jobs simulating analysis workflows Similar to the Job Robot (to which it was inspired) but much more powerful –User can choose the dataset, CMSSW version, job splitting parameters, sites or regions, throttling parameters Two basic modes of operation: –Functional tests: user defines a template and HC instantiates tests in a continuous way Exactly like the JR does –Stress tests: user instantiates tests by hand Ideal for site stress testing, could be applied to CMSSW and CRAB validation

3 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 3Andrea Sciabà Hammercloud: quick summary (2/2) Status of tests visible in real time via plots and tables –Any parameter in the FJR can be plotted as a histogram over the jobs in the test Statistics by site available Administrative interface to add/change templates, metrics to plot, CRAB parameters to use, etc.

4 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 4Andrea Sciabà Status of HC in CMS HC server running of vocms38 HC web interface running on voatlas49 (common for ATLAS, CMS and LHCb) –New users must request a login Functional tests running since several months at all sites (Note: right now there is a Python error, preventing job submission, to be fixed  ) New CMSSW releases need to be installed by hand Only one CRAB version selectable –Not using CRAB server

5 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 5Andrea Sciabà Planned improvements Allow to select the CRAB version Allow to use the CRAB server Allow full access to standard output files via web server Allow to select as parameter the activity name for the Dashboard –Possible with the latest CRAB version –Essential to replace the JobRobot as we need to separate “JR/HC” jobs from other HC tests Use a non-ATLAS host for the web server –Is this a requirement? Develop SLS sensors

6 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 6Andrea Sciabà Possible improvements Enable the possibility (already supported) to report which sites fail some freely defined criteria –For example, sites with a low success rate in the last N minutes –Easy to publish in the Site Status Board (done for ATLAS) –May be used to implement automatic site exclusion mechanisms

7 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 7Andrea Sciabà Nagios: quick summary A new framework, based on the Nagios monitoring system and the WLCG MSG service (a messaging system based on ActiveMQ) It replaces the old SAM framework Experiments can use it to run their own functional tests Developed and supported by IT-GT for WLCG

8 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 8Andrea Sciabà Status of Nagios in CMS CMS tests ported from SAM to Nagios since several months –Nagios server very stable; preprod server available CMS tests and their configuration must be packaged as an RPM –Currently, not automatically generated when a test is updated by a test maintainer, so risk of delays Tests are run on all CE+SRM at all CMS sites Test results and site availability published by the Dashboard and taken from the old SAM database CMS Site Readiness will use Nagios availability as from today

9 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 9Andrea Sciabà Planned improvements Integrate a CMS glexec test Run tests only on specific services using a CMS topology feed (produced by the Dashboard) Enable automatic site exclusion from BDII for CEs failing critical tests Have the Dashboard taking test results from ACE (the replacement for the old SAM database) Proper procedure to generate new RPMs Proper alarming (SLS + Lemon)

10 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 10Andrea Sciabà Open issues IT-GT maintaining the CMS Nagios production and preproduction server for the time being but not forever Must determine if to run them in CMS or in IT

11 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 11Andrea Sciabà Conclusions Hammercloud –Still some integration work needed –Should define ASAP procedures for Facilities Operations –Promote its usage to have a better fit with CMS needs and a quicker development cycle; aim at decommissioning the Job Robot ASAP Nagios –Basically already production quality –Integrate new important tests –Converge on a production infrastructure

12 Experiment Support Acknowledgements Thanks to the IT-GT group for their support on the usage of SAM/Nagios and to Mario Úbeda for the HC-CMS integration


Download ppt "Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini."

Similar presentations


Ads by Google