Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Service Availability Monitoring – Status.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Service Availability Monitoring – Status."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Service Availability Monitoring – Status & Plans John Shade, CERN SA1 coordination meeting, Abingdon

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Service Availability Monitoring - Status and Plans 2 Main Objectives Increase grid availability and reliability Provide Site managers with tools that pin-point malfunctioning services – and provide enough information for trouble-shooting Collect statistics on service availability for management purposes (WLCG/EGEE) Continue to provide a flexible framework that allows VOs to integrate their own tests Encourage sites/regions to become more autonomous Don’t break anything along the way!

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 What needs to be improved? Allow more flexible availability calculations –VO-specific; allow comparisons & consistent recalculations (i.e. maintain topology snapshots) If SAM web-service is down, test results go to the big bit-bucket in the sky –Use of Message Bus buffering will solve this particular problem Sites are often “blind” during central outages –Nagios at the site will address this issue Need to expand coverage of tests –Develop GFAL tests, refine LFC tests, etc. Service Availability Monitoring - Status and Plans 3

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Architecture – Regional Level Service Availability Monitoring - Status and Plans 4

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Architecture – phased approach Service Availability Monitoring - Status and Plans 5 Site Nagios runs more or less stand-alone Use NDO Utils to populate regional MySQL DB Use Nagios Checker Firefox plug-in Provide central DB for collecting availability metrics Run a large Nagios instance(s) that emulates SAM Send VM images of above to regions

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Nagios Checker Service Availability Monitoring - Status and Plans 6

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Architecture of the regional solution Use Nagios to occasionally probe sites from ROC Have a standard set of components inside the region for: –Storing topology of regional grid –Storing metric results from probes –Raising alarms –Raising tickets –Viewing metric history and details for debugging Central data stores and components for project-level systems –Project-level metric store –Topology Database with history Service Availability Monitoring - Status and Plans 7

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 SAM Work Plan December 2008 –SAM ready to use Message Bus  SAM client on UI and WNs uses msg-publisher  Msg-consume2oracle used to retrieve messages from a topic and insert in SAM DB  Everything tested in Validation –Have Nagios equivalents of current SAM sensors used for WLCG/EGEE Availability calculations  SRMv2; CE & CREAM CE; sBDII  NCG templates available Service Availability Monitoring - Status and Plans 8

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 January 2009 Service Availability Monitoring - Status and Plans 9

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 10 Future plans Now we need a schema for the new metric store –Possibly an extension of the NDO schema Need to publish Nagios test results to new store and compare contents with current SAM DB –Use availability calculations to do this, rather than a portal Need tools to view test results and generate alarms –GridView to view metrics collected at project level –Use NDO add-ons to visualize contents of regional MySQL DB, or possibly build a custom portal? –Use built-in escalation functionality of Nagios for sending alarms to MsgBus, from where “alarm-DB” intelligence can feed GGUS –Use Firefox plug-in to view regional alerts from ROC Nagios Service Availability Monitoring - Status and Plans

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 …and Timelines April 2009: – 1 st implementation of new metric store, new topology database and Probe Description Database –Design of new SAM portal completed June 2009: –switch Regional Nagios metric publication to new metric store –Adapt submission framework to use new topology database –SAM portal layered on new metric store –3 rd party tool selected/customized for viewing NDO database This is ambitious! –VOs will need some hand-holding to migrate their tests –We rely on a few key individuals who have other engagements –We have an existing production service to maintain! Service Availability Monitoring - Status and Plans 11


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Service Availability Monitoring – Status."

Similar presentations


Ads by Google