Service Availability Monitoring

Service Availability Monitoring
Piotr Nyczyk, CERN/GD GDB Meeting CERN, 9th November 2005

Outline Aim of service availability monitoring
Existing monitoring solution: Site Functional Tests GStat Integration using R-GMA Present use in grid operation infrastructure CIC Dashboard CIC-on-duty operations Future plans and responsibilities

Purpose Service availability monitoring:
Two main goals: Monitor grid services (without going into the site fabric level) Verify SLAs and LCG MoUs It must be site independent and monitor all sites in a uniform way Monitor both quantity and quality To be used by grid operation teams: CIC on duty team Framework plus some metrics already in place: SFT and GStat

Existing monitoring solution

Site Functional Tests (SFT)
Submits a short-living job to all sites (Computing Elements) to test various aspects of functionality (using WMS – RB): Job submission - CE availability: Can I submit a job and retrieve the output? Basic environment tests: software version, BrokerInfo, CSH scripts Security tests: CA certificates, CRLs, ... (work in progress) Data management tests: basic operations using lcg-utils on the default Storage Element (SE) and chosen “central” SE (usually at CERN - 3rd party replication tests) VO environment tests: tag management, software installation directory + VO specific job submission and tests (maintained by VOs, example: LHCb and dirac- test) What is not covered? Only CEs and batch farms are really tested - other services are tested indirectly (usually by causing failures in all sites at once): SEs, RB, top-level BDII, RLS/LFC, R- GMA registry/schema Maintained by Operations Team at CERN (+several external developers) and run as a cron job on LXPLUS (using AFS based LCG UI) Jobs submitted at least every 3 hours (+ on demand resubmissions)

SFT - report Shows results matrix with all sites
Selection of “critical” tests for each VO to define which sites are good/bad Detailed test log available for troubleshooting and debugging Deployed on two machines at CERN (load distribution, fault tolerance)

GIIS Monitor (GStat) Monitoring tool for Information System:
Periodically queries all Site BDIIs (but doesn’t monitor Top-level BDIIs) Checks if Site BDIIs are available Checks integrity of published information Checks for missing entities, attributes Detects and reports information about some of the Services: RB, MyProxy, LFC but doesn’t monitor them Detects duplicated services in some cases (eg. 2 global LFC servers a single VO) Cron job maintained and run by Min-Hong Tsai in Taipei - updates every 5 minutes

Integration using R-GMA
R-GMA is now mature enough to be used as the “universal BUS” for monitoring information SFT and GStat are now publishing results to R-GMA GOC DB is used to get the list of sites and nodes to monitor, which are published to R-GMA together with scheduled downtime information We managed to solve most obvious scalability problems: ~170 sites About 3.5M tuples for 1 month history with full detail After one month only summary information

Prototype sites availability metric
Using our current data schema and R-GMA we managed to integrate monitoring information from SFT and GStat Summary generator uses list of critical test to generate a summary per site - binary value (good/bad) generated every 1h Metric generator integrates the summaries over time period (1 day…) to generate availability metric

Prototype sites availability metric
Shows only availability of computing resources

CIC Dashboard Main tool for CIC-on-duty
Makes CIC-on-duty job much easier Integrated view of monitoring tools (summary) - shows only failures and assigned tickets Detailed site view with table of open tickets and links to monitoring results Single tool for ticket creation and notification s with detailed problem categorisation and templates Ticket browser with highlighting expired tickets Well maintained - adapts quickly to new requirements/suggestions (thanks to developers!)

CIC Dashboard Problem categories •` •`
Sites list (reporting new problems) Test summary (SFT,GSTAT) GGUS Ticket status

CIC-on-duty operations
CIC-on-duty: currently 6 teams (CERN, IN2P3, RAL, INFN, Russia, Taipei) working in weekly shifts The operators look at emerging alarms (CIC Dashboard) and the monitoring tools (for details) the and report problems Problems are submitted as tickets to GGUS (Remedy based) and both ROC and sites (Resource Centers – RC) are notified ROC is responsible for timely problem solution - otherwise ticket is escalated Priorities and deadlines for tickets are set depending on site size (number of CPUs) Everything here is described in details in Operations Manual - primary document for CIC-on-duty

CIC-on-duty operations
But! Currently only failures of computing resources (CEs) are dealt with automatically. Other services are partially covered as they are indirectly tested by the current tools - all sites failing the same test Top down approach: CIC-on-duty just spots and reports high-level problems and doesn’t solve them Problems are then dispatched to lower-level instances (ROC, RC) for further analysis and resolution CIC-on-duty provides expertise for ROCs (and sites) if necessary

Roadmap to service monitoring
What do we have? Framework and tests for computing resources (SFT) and information system (GStat) Initial data schema that can be used for integration (R-GMA) Basic display and metric report for computing resources - good example and starting point for other services What do we need? Extend the SFT framework for service sensors requirements Sensors measuring availability and performance for all grid services Integrated display Alarm system Integrate all the pieces into existing operation tools

Responsibilities Framework: is being discussed between CERN and Lyon Coordination of sensors: Piotr (CERN) Service Responsible Class Comments SRM 2.1 Dave Kant C monitoring of Storage Elements t.b.d. LFC James Casey C/H FTS FTS support CE Piotr Nyczyk monitored by SFT today RB job monitor exists (few modifications needed) Top-level BDII Min-Hong Tsai can be integrated with GStat Site BDII H monitored by GStat today Myproxy Maarten Litmaath VOMS Valerio Venturi R-GMA Laurence Field Detailed plan with dates being worked out and agreed with the responsibles

Service Availability Monitoring

Similar presentations

Presentation on theme: "Service Availability Monitoring"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Service Availability Monitoring

Similar presentations

Presentation on theme: "Service Availability Monitoring"— Presentation transcript:

Similar presentations

About project

Feedback