SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Experience with Site Functional Tests Piotr Nyczyk CERN IT/GD WLCG Service Workshop Mumbai, February 2006.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)1 Overview of monitoring tools for Grid Systems Varenna, 12 May.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
The EDGeS project receives Community research funding 1 SG-DG Bridges Zoltán Farkas, MTA SZTAKI.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Real Life Examples Tickets – Real life examples Mário David LIP - Lisbon.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
User Interface UI TP: UI User Interface installation & configuration.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.
RI EGI-TF 2010, Tutorial Managing an EGEE/EGI Virtual Organisation (VO) with EDGES bridged Desktop Resources Tutorial Robert Lovas, MTA SZTAKI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.
Service Availability Monitoring
Daniele Bonacorsi Andrea Sciabà
NGI and Site Nagios Monitoring
Use of Nagios in Central European ROC
Grid Operations Procedures
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
SAM Alarm Triggering and Masking
Site availability Dec. 19 th 2006
Presentation transcript:

SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2 SAM sensors and tests

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3 Outline about SAM –introduction –production service at CERN –official submissions ops other Vos framework structure sensors + tests –definition –existing sensors –Jobwrapper tests –documentation

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 4 SAM - Overview Grid service-level monitoring framework used in Grid Operations basis for Availibility Metrics VO-based submissions –VO-specific tests services tested currently: CE, gCE SE RB sBDII BDII FTS LFC JobWrapper tests

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 5 CERN infrastructure serv RB1RB2 BDII1 WMS1WMS2 UI1UI2 BDII2 CE gCE dns-alias

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 6 Central SAM submissions Official CERN submissions –Production and Certified sites –ops (+ dteam) VO –job submitted in every hour –basis of COD alarms – PPS –ops VO –hourly – SAM Admin Portal –ops VO –on-demand –Certified + Uncertified sites

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 7 VO-specific submission LHCb –CE, gCE, SE –VO specific tests used as Critical Tests for the VO! Atlas –all sensors –submitted from SAM UI Alice –CE, gCE tests RM tests not taken in account (special Alice LFC settings) –submitted from SAM UI CMS –coming soon

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 8 SAM Framework framework structure –client submission framework sensors –tests –server web services portal Oracle DB accessed by web services static (GOCDB) + dynamic (BDIIs) info developed –server + submission framework: CERN Team –sensors + tests: CERN Team + external contributors

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 9 Submission workflow

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1010 Sensors and tests Sensor –types per GRID service (CE, gCE, SE, SRM, FTS, LFC, etc.) for multiple services ( host-cert ) –container object tests necessary configuration for the tests checks, preparation for the tests definition of the execution test sequence –invoked by the SAM command-line tool –executing environment for the defined list of contained tests Test –belongs to a sensor –scripts (executables) executed by the sensor

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN1 Test submission I. SAM client executable prepare check test1 test2 test3 testN SAM DB SENSOR

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1212 Test submission II. SAM client executable prepare check SENSOR test1 test2 test3 testN SAM DB parent test publishing submission

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1313 SAM Sensors sensor for each GRID service –there are still missing ones at the moment plug-in modules → easy to add new ones sensor execution: – $SAME_HOME/client/bin/same-exec –some sensors have to publish results (CE, gCE, FTS) $SAME_HOME/client/sensors/ –additional scripts checks, setting up environment, etc. – tests directory: tests

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1414 SAM Tests simple (bash) scripts plug-in modules → easy to add new ones –VO-specific tests most of the tests display the exact command that it executes –failing site can investigate, try to reproduce the error Note: error may only occure for executing VO! $SAME_HOME/client/sensors/ /tests –naming convention _ -testname where is optional, defaults to ops

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1515 Test failures test failures can come from –problems with the site (majority) misconfiguraton hardware errrors network errors etc. –problems with a central service –problems with the framework lcg-cr fails → lcg-del is still executed for the non-existing file FTS-infosites fails → the rest of the tests don't recognize the absence of the FTS-endpoint parameter recepies & hints: –GOC wiki:

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1616 Sensors - CE, gCE I. job submission: UI→RB/WMS→CE/gCE→WN –errors may occure on any level (usually not UI) job submission: parent job for the rest of the tests –individual tests publish themselves to SAM –job submission is published by invoking the SAM cmd-line tool every hour for the CERN submissions (except LHCb) job submission failure → job logging info is returned

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1717 Sensors - CE, gCE II. examples for site errors – Cannot read jobwrapper output, both from Condor and Maradona error on CE – Cannot plan: BrokerHelper: no compatible resources probably the site-BDII is not publishing results properly mpatible_resources mpatible_resources – Got a job held event, reason: Unspecified gridmanager error – Got a job held event, reason: Globus error 131: the user proxy expired (job is still running) proxy lifetime (12 hours) expired before the job finished running

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1818 Sensors - CE, gCE III. CA certificate check (on WN!) –algorithm: 1. checking the RPMs 2. if no RPM → checking CA files –new CA certificates released by EUGridPMA middleware repository + SAM: immediate upgrade sites have 7 days to upgrade –the tests report 'WARN' during this period, 'ERROR' after Software middleware version check – lcg-version command Brokerinfo – edg-/glite-brokerinfo command

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 1919 Sensors - CE, gCE IV. UNIX shell env. vars –bash + csh Replica Management –testing default SE and 3rd-party replication checking GFAL_INFOSYS env. var. lcg-cr file to default SE lcg-cp file back from default SE lcg-rep file from def. SE to central CERN SE lcg-del file Notes: –LFC endpoint not connected if the site is using “its own” top-level BDII, then it might not publish it properly

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2020 Sensors - CE, gCE V. VO experiment software –check VO_ _SW_DIR env. var –check directory existence VO management tags – lcg-Manage-VOTags command

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2121 Sensors - CE, gCE VI. ('ops' non-crit. tests) RGMA –printing RGMA configuration file $RGMA_HOME/etc/rgma.conf –inserting & querying a tuple using RGMA shell Secure RGMA –running edg-java-security-tomcat-test.sh WN –getting the hostname of the worker node APEL –executed on the UI (not on the remote site) –RGMA query to get the number of entries per site test results from gstat –CE-totalcpu,CE-freecpu,CE-waitjob,CE-runjob

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN2 Sensors - SE, SRM, LFC SE, SRM –the same set of tests for both lcg-cr file from UI to SE/SRM lcg-cp file back to UI lcg-del file from SE/SRM LFC – lfc-ls directory listing on ' /grid ' – lfc-mkdir : creating entry in ' /grid/ '

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2323 Sensors - FTS executed for dteam VO lcg-infosites – check if FTS endpoint is correctly published in BDII glite-transfer-channel-list –ChannelManagement service transfer test –transfer jobs following the VO use cases tested T0 → all T1s (outgoing) tested T1 ← other T1s (incoming) –checking the status of jobs –using pre-defined static list of files SRM endpoints taken from this list (CVS), no dynamic discovery yet –Note: test is relying on SRM availibility

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2424 Sensors - standalone sensors BDII (Sinica) –accessiblity (respnose time) –relibility of data (number of entries) sBDII (Sinica) –accessiblity (respnose time) –sanity checks (partial Glue schema validation) –totalCPU, freeCPU –waiting jobs –SEusedTB, SEavailTB RB (RAL) –job submission “important” RBs are tested using selected “reliable” CEs –measuring the time of matchmaking

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2525 SAM errors Examples: – Timeout after 600 seconds! error due to overloaded hardware solution: new (stronger) hardware with loadbalancing –proxy expires in 1-2 minutes middleware problem (RB) –patch applied problem reporting –GGUS:

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2626 JobWrapper tests I. Requested by experiments –motivation SAM jobs might not reach all WNs → broken WN not detected simplified set of tests test execution by the CE wrapper with every GRID job –→ all WNs reached test results: –passed to the job –stored in the SAM DB installation: –modified job-wrapper scipts: part of the release –tests: signed tarball installed on the software area

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2727 JobWrapper tests II. Operations –infrastructure description with unique identification of WNs: relation between CE, batch queues, WNs detection of monitoring queues pointing to “carefully” selected WNs counting WNs at the site without the risk of double counting due to shared batch farms –detection of sites with brokens WNs basic fabric monitoring for small sites Status: –wrapper scipts in production –tarball with tests has to be installed –visualization tools have to be developed

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2828 Missing sensors glite WMS MyProxy VOMS Tier1 DB RGMA registry –being developed at RAL VOBOX –basic tests ready by end of Jan. ( gsissh ) volunteers are welcome! :)

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 2929 SAM Documentation all documentation on GocWiki –

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3030 Impact of the test results

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3131 Outline Critical Tests COD alarms VO-specific submissions Availability Metrics

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3232 Critical Tests set of SAM tests –defined by each VO –defined for each sensor defines the criteria for availibility of a resource CT set manipulation –via FCR Admin Portal –only by the VO responsibles CT set display –on FCR User Portal (see later) –also visible on SAM portal

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN3 COD Operator on Duty –12 ROCs in a weekly rotation –follow-up of site problems COD dashboard –main tool for CODs to use in the operations Critical (ops) SAM tests raise alarms –displayed on the COD dashboard –processed by the operators –alarm masking: focus on the real problems

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3434 COD dashboard

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3535 VO-specific results VO-specific test submissions –also: VO-specific tests for some VOs VO's Critical Tests –selected from tests submitted by the VO usually ops tests, if there's no SAM submission for the VO –determines the status of a resource for the VO resource status used in –Availibility Metrics –by VOs to select resources that should be used (FCR)

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3636 FCR – User View checking service status for multiple VOs

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3737 SAM Portal

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3838 Availability metrics - algorithm t ∈ CriticalTests TestResult (N,t)Status of node N =Status of site S = CE1CE2CEnSRM 1SRM 2SRM nsite BDII AND OR Everything is calculated for each VO that defined critical tests in FCR Results make sense only if VO submits tests!!! N ∈ instances(C) Status (N) Status of service C = ∧ ∨ ∧ = boolean AND ∨ = boolean OR

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 3939 Availability metrics - algorithm II service and site status in every hour daily, weekly, monthly availability scheduled downtime information from GOCDB details of the algorithm on GOC:

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 4040 Availability metrics - GridView

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 4141 Availability metrics - data export

Grid Operatioins, Tier2 Admin Workshop, 25 Jan. 2007, CERN 4242 Thanks for your attention!