Presentation is loading. Please wait.

Presentation is loading. Please wait.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Similar presentations


Presentation on theme: "Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED."— Presentation transcript:

1 Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

2 4 Dec 2007 Alessandro Di Girolamo 2 SAM Critical Tests: Current Status Now running standard OPS tests using ATLAS credentials (i.e. the original SAM tests run under the ATLAS VO) List of sites from GOCDB SE & SRM:  put: lcg-cr using cern-prod LFC, files in SAM test directory  get: lcg-cp from site to the SAM UI  del: lcg-del - clean the catalog and the storage CE  Check CA RPMs version  Job Submission on a WN tests  VO swdir (sw installation directory) LFC  lfc-ls, lfc-mkdir FTS  glite-transfer-channel-list, Information System configuration and publication

3 4 Dec 2007 Alessandro Di Girolamo 3 Work in progress We are developing and testing ATLAS-specific SAM tests in order to: monitor the availability of ATLAS critical Site Services verify the correct installation and the proper functioning of the ATLAS software on each site SE & SRM & CE endpoints definition: intersection between GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model)  different services and endpoints might need to be tested using different VOMS credentials  ATLAS endpoints and paths must be explicitly tested (i.e. /dq2 area)  the LFC of the Cloud (residing in the T1) is used

4 4 Dec 2007 Alessandro Di Girolamo 4 Development: Tests and Alarms SE & SRM (centrally from SAM UI): – put: lcg-cr with Cloud LFC, with and without using BDII infos – get: lcg-cp CE (job submitted on each ATLAS CE): – keep on running large part of OPS suite – for ATLAS Tier1 and Tier2: Check the presence of the required version of the ATLAS sw Compile and execute a real analysis job based on a sample dataset Test put/get to local storage via native protocols (dccp, rfcp …) Alarm system: SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System (mail and/or sms) Grid Services (FTS, LFC etc.) tests failing: alarms to  Service responsible  the ATLAS dedicated services (DDM, etc..) that use those services

5 4 Dec 2007 Alessandro Di Girolamo 5 Reliability & Availability results SAM Critical Tests not reliable for: – France: BDII configuration (ATLAS endpoint should be explicitly put) – NDGF/BNL: different service setup SAM Critical Tests last months failures: – FZK: real SRM failures. Problems under investigation with site responsible – SARA: (mainly) not scheduled network problems

6 4 Dec 2007 Alessandro Di Girolamo 6 To Do New ATLAS specific tests (now running in pre-production) will be more realistic for the Experiment Improve completeness of monitor informations  Informations across TiersOfATLAS, GOCDB and BDII.  ATLAS Cloud topology view  Integration with Ganga Robot and other ATLAS tools  Integration with the ATLAS dashboard

7 4 Dec 2007 Alessandro Di Girolamo 7 Backup slides …

8 4 Dec 2007 Alessandro Di Girolamo 8 SAM ATLAS SE (SRM) tests All SRM endpoints (v1 and v2) can be considered as SE: SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB

9 4 Dec 2007 Alessandro Di Girolamo 9 SAM ATLAS SE (SRM) tests All SRM endpoints (v1 and v2) can be considered as SE: SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB

10 4 Dec 2007 Alessandro Di Girolamo 10 SAM results on Gridmap Thks to CERN openlab / EDS Topology :  Possibility to include ATLAS Cloud view,  Possibility to change the metrics for the sites size The collaboration with the Gridmap developers is already started

11 4 Dec 2007 Alessandro Di Girolamo 11 Other SAM tests Many more tests, not critical, are running

12 4 Dec 2007 Alessandro Di Girolamo 12 Site Availability: T0/T1  Site Services Availability:  Site Services X = CE, SE, SRM  Down: if all services of type X of a site are Down  Ok: if all services of type X are Ok  Degraded: if some services of type X are Ok and other are Down  Site BDII: Ok or Down by taking the status of the site BDII instance  Site Availability:  The AND of each single Site Services Availability

13 4 Dec 2007 Alessandro Di Girolamo 13 Site Availability: one example

14 4 Dec 2007 Alessandro Di Girolamo 14 Storage Space Monitor via SAM A specific SAM test could be sent on the VOBOXes to check storage disk space, as already done for the IT cloud


Download ppt "Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED."

Similar presentations


Ads by Google