Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear Physics J. Flix (CIEMAT/PIC), José M. Hernández (CIEMAT), A. Sciabà (CERN) On behalf of CMS Computing

José Hernández CMS Computing System  CMS computing system consists of 60+ sites distributed worldwide  CMS computing requires stable and reliable behavior of the underlying infrastructure to sustain the various workflows  CMS has established procedures to extensively and continuously test all relevant aspects of a Grid site  Ability to efficiently use their network to transfer data  Functionality of all the site services relevant for CMS  Capability to sustain the various CMS computing workflows at scale  Monte Carlo simulation, data processing and skimming, data analysis CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 2

José Hernández Site Readiness Monitoring  CMS has developed a Monitoring framework to track site readiness  CMS Site Availability Monitoring (SAM) tests  Jobs sent to sites to test specific services  JobRobot job load generator  Test data processing workflows  Data transfer load generator  Data transfer quality and commissioned links  Site readiness metrics established to guarantee data processing can be performed efficiently and reliably  Provide list of ‘good’ sites for production and analysis activities  Provide sites with information to solve eventual problems  Looked at by computing shifters and reviewed at weekly Computing Operations meetings CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 3

José Hernández  Site Availability Monitoring – CMS SAM tests  Test CE, SE, experiment software, conditions cache, data read, stage out, etc  High priority jobs submitted every hour  Require daily availability > 80% for T2s and > 90% for T1s Site Readiness: SAM tests CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 4 T2sT1s

José Hernández  Job Robot load generator  Tool for automatic job preparation, submission, collection, evaluation  Simple jobs reading data. JR will be replaced by HammerCloud  Few hundred jobs/site/day to more than 50 sites (~25k jobs/day)  Require daily success rate > 80% for T2s and > 90% for T1s Site Readiness: JobRobot CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 5 T2s T1s

José Hernández  Commissioning of data transfer links  For sites to be usable, data transfer links need to be operational  The Debugging Data Transfers (DDT) task force defined metrics, procedure & tools to certify links and assisted sites in solving problems  The minimum requirements to commission a transfer link are:  5 MB/s sustained for 24h for T2→T1 links (and T2→T2 links)  20 MB/s sustained for 24h for T0→T1, T1↔T1 and T1→T2 links  Each commissioned link is enabled and is used in production  Based on the operational needs, site is considered OK if: Site Readiness: data transfer links CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 6 T0T0 T1T1 T1T1 T2T2 T2T2

José Hernández  Data transfer quality  Transfer quality continuously probed at low rate in all links  2000+ links, ~1 GB/s aggregate CMS-wide  Allows to detect problems (network, storage, transfer services, etc)  Require transfer qualities >50% on > half of commissioned links  Monitoring and production data transfers used in the metric Site Readiness: data transfer quality CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 7 Example: T1s → Spanish T2s (production and monitoring transfers)

José Hernández  Collect and display all site readiness information in Site Status Board  Central point of information for sites and computing shifters Site Readiness: Site Status Board CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 8

José Hernández (> 2  Combine all metrics into single daily ‘site readiness status’  Intermediate Warning state to give sites the time to recover  Back to Ready state after some stability Site Readiness: metrics CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 9 Ready warningnot-Ready

José Hernández  Use site readiness status history to flag good/bad sites  Use history of past 15 days to ensure stable sites are used for production and analysis activities  Site Readiness(15-days) > 80% for Tier-2s, 90% for Tier-1s  List of good sites available to the WM tools Site Readiness status flag CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 10 T2s

José Hernández Site Readiness monitored  Positive effects of site readiness program  Continuous monitoring of Grid & CMS services at sites  Helps production and users to select reliable T2 sites  Significant improvement when SR programme started  ~40 Tier-2s ready for CMS workflows  Some instability in Tier-1s  Still room for improvement CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 11

José Hernández Monitoring Resource Utilization & Performance  Closely monitor how efficiently we use our computing resources  Utilization  Slot usage, processing share among sites, utilization level wrt pledges  Performance  Job success rates, CPU efficiency  Investigate and overcome inefficiencies  Disentangle the various inefficiency effects: site problems, WMS tools/operations inefficiencies, lack of processing work, imbalance of data prelocation, etc  Try to balance resource utilization to make the most efficient use of the available resources  Important once we become resource-constrained  Reviewed weekly at the Computing Operations meeting CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 12

José Hernández Tier-1 resource utilization  Slot usage  Average number of slots occupied  Spiky usage due to intermittent data processing (irregular data taking, reprocessing passes)  Complemented by MC production CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 13  Processing share  Fraction of the processing done at each Tier-1  Useful to balance resource usage

José Hernández Tier-1 resource utilization  Utilization level  Fraction of the pledged number of slots actually used  Not resource-constrained yet CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 14

José Hernández Tier-1 performance  Job success rates  In average ~90%  After few automatic resubmissions ~all jobs are finally done CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 15  Job CPU efficiency  CPU/Wallclock times  I/O-bound data processing jobs  Lot of work ongoing to improve application CPU efficiency and site bottlenecks in data serving infrastructure

José Hernández Tier-2 usage  ~6000 slots continuously used for analysis  Up to ~10000 slots used for MC production  ~All T2s regularly used for analysis  400+ distinct analysis users/week CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 16

José Hernández Tier-2 performance analysis activities  ~70% analysis job success rate  Site failures ~5%, application failures ~25% (remote stageout, configuration, crashes, etc)  Aborted jobs by the Grid ~ 10%  ~55% analysis job CPU efficiency CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 17

José Hernández Conclusions  Site readiness monitoring has been instrumental in bringing and keeping sites into stable & reliable operation and to scale up the CMS distributed computing system  Monitoring the resource utilization and performance has helped in improving the execution efficiency of the CMS workflows Thanks to Dashboard Team for providing the monitoring infrastructure CHEP’10, 18-22 Oct 2010, Taipei Monitoring site readiness and utilization 18

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Similar presentations

Presentation on theme: "Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Similar presentations

Presentation on theme: "Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear."— Presentation transcript:

Similar presentations

About project

Feedback