Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid Services Monitoring Working Group Ian Neilson – CERN HEPSYSMAN, Imperial College, London Material from Emir Imamagic SRCE, Ronald Starink NIKHEF, Max Boehm EDS/Openlab

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Overview Intro: WLCG Grid Services Monitoring WG –Mandate and approach Nagios prototype framework –Nagios framework –Nagios-based grid monitoring  Architecture  Grid extensions  Standard components –Demo/Pics –Current status & Future Work Some real early-adopter feedback –Experience and wishes A little bit of ganglia Aside (if there is time): Gridmap visualisation All material extracted from presentations at EGEE’07 –http://indico.cern.ch/sessionDisplay.py?contribId=297&sessionId=37&confId=18714http://indico.cern.ch/sessionDisplay.py?contribId=297&sessionId=37&confId=18714 –http://tinyurl.com/2gop7vhttp://tinyurl.com/2gop7v

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Grid Services Grid sensors Transport Metric Repositories Views ……. WLCG Monitoring Working Groups 3 groups created by Ian Bird, Oct’06 –“….to help improve the reliability of the grid infrastructure….” –“…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” –“… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. System Analysis Application monitoring ……

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 WLCG Grid Monitoring Landscape Local Resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... Gstat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use 3 WLCG Monitoring Working Groups Slide by Max Böhm, EDS

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 WLCG Grid Monitoring Landscape BDII CESE RB Info System Experiment/VO ATLAS GOCDB site registry Central Services GStat GridICE SAM Grid View html site status + graphs Exp. Dashb. LFC CPUsTBs batch Site Services Grid Services Fabric Resources App Layer Experiment/VO... Ganga/ Panda Apps RTM HTTP/XML pull LB real time 3D job view job state AtlasProdDB VO jobs, data, site reliability data transfer, job status, service availability GOCDB, BDII GOCDB, extBDII DB access HTTP/SOAP push LDAP Experiment/VO... HTTP/XML BDII + fabric/job infos sites Nagios one per experiment File Catalog Resource Broker HTTP/XML push agents RGMA RGMA, RGMA, MonALISA MonALISA HTTP/XML pull submit test jobs fabric infos (other monitoring tools) RGMA FTS results

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 High-level View of Monitoring Work in 4 areas Initial focus to help sites - this session. Also visualization - Gridmap

7 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Aims of Grid Services WG Beginning to look at: –messaging systems for reliable transport –more integration with external monitoring (dashboards…) –management/operations visualisation & reporting requirements BUT The aim is always: –NOT to provide yet another complete technical solution –incrementally improve service reliability by consolidating existing solutions where possible Please see the twiki for all the information and the links to the other WGs https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Beginning to look at: –more integration with external monitoring (dashboards…) –messaging systems for reliable transport –management/operations visualisation & reporting requirements BUT The aim is always: –NOT to provide yet another complete technical solution –incrementally improve service reliability by consolidating existing solutions where possible Please see the twiki for all the information and the links to the other WGs https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo Aims of Grid Services WG

9 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Site Grid Services Monitoring The rest of this session concentrates on Nagios BUT Nagios is one (good) choice You use a tool suited for your site and fabric BUT The standardized probe-set should be reusable Data exchange specification allows standardized access to metrics Work on configuration building should be reusable AND We want to help sites to deploy monitoring to: –Improve their reliability –Make their life easier

10 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios Prototype

11 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios Framework Open source monitoring framework –widely used & actively developed Host and service problems detection and recovery Provides wide set of basic sensors –easy to develop custom sensors Centralized vs. distributed deployment High configurability –service dependencies, fine-grained notification options Web interface –status view, administration

12 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios-based Grid Monitoring Monitoring CRO-GRID Infrastructure (2004-2006) –Globus Toolkit Pre-WS & WS, UNICORE, other services –active recovery of services –http://www.cro-ngi.hrhttp://www.cro-ngi.hr Monitoring EGEE resources in Central Europe (CE) –core services since mid 2006 –all CE sites for 1st line support since September 2006 –http://nagios.ce-egee.orghttp://nagios.ce-egee.org Grid Services Monitoring (GSM) WG –site monitoring prototype, mid 2007 –http://crnjak.srce.hr/nagios (egee.srce.hr)http://crnjak.srce.hr/nagios –https://pps-monitoring.cern.ch/nagios (CERN-PPS)https://pps-monitoring.cern.ch/nagios

13 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Architecture … Site nodes Site BDII CESELFC MyProxy Refresh proxy Get VOMS proxy Service checks Get remote results Probe descriptions … Get site’s & nodes information Get nodes information Live node checks Get Nagios results Site admins Get site status Issue alarms Monitoring server

14 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Grid Extensions Standard probes –provided by SRCE, CERN, OSG –Security facilities & services  CA distribution, Certificate lifetime, MyProxy –Monitoring & information services  R-GMA, BDII, MDS, GridICE –Job management services  Globus Gatekeeper, RB, WMS, WMProxy, Job matching –File management services  GridFTP, SRM, DPNS, LFC, FTS

15 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Grid Extensions Probe description database –probe dependencies Remote gatherers –SAM & NPM Certificate based authentication for the web interface –enables authorization Nagios Config Generator (NCG), Publisher, Credential management

16 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Standard Components Probe wrapper –enables integration of standardized probes –Grid Monitoring Probes Specification –https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpec ificationhttps://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpec ification Publisher & remote gatherers –integration with other tools –Grid Monitoring Data Exchange Standard –https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExcha ngeStandardhttps://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExcha ngeStandard Comments, contributions & probes welcome!

17 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688

18 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688

19 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688

20 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 SAM Standard probes NPM

21 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688

22 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Current Status Three sets of standard probes integrated –SRCE, CERN, OSG RPMs in apt and yum repository –http://www.sysadmin.hep.ac.ukhttp://www.sysadmin.hep.ac.uk Mailing list for community support of sites –wlcg-monitoring-discuss@cern.chwlcg-monitoring-discuss@cern.ch Deployments –CERN-PPS, SRCE, NIKHEF, PIC New volunteers welcome!

23 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios Prototype NIKHEF Experience as early adopter

24 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 NIKHEF Experience NIKHEF part of Dutch T1 –~ 400 cores, 90 TB storage, ~ 150 hosts –Will grow ~ 10 x –Site without active monitoring, only Ganglia Motivation –Recognized need for site monitoring –Planned to look at Nagios –Opportunity to get started! –Hesitation: investment of time on short notice Expectation –Let's see what it does... –Gain experience with Nagios

25 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 NIKHEF Experience Initial setup: –Dedicated host for monitoring –Remote probes only (SAM) –No gLite –Only secure web server Extended setup: –New VM as gLite 3.0 UI –Remote probes –Local probes CE, RB, BDII, site BDII, LFC, SE DPM, classic SE, MON –Only secure web server Future: –Add local probes to dedicated host –Use notifications, perhaps event handlers

26 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 NIKHEF Experience Not difficult to setup –Some manual actions –Early adopter: some small issues –Configuration script complex Very useful! –Almost immediate feedback when services are failing –Used on daily basis –Lots of tests (SAM and local) –Not always clear what they test. Documentation? Some issues –Permanently failing tests: SAM + local probes –Occasionally failing tests, spontaneous recovery –Proxy nearly expired: many tests failing?

27 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 NIKHEF Experience - Conclusions Not difficult to setup –Manual configuration –Help from mailing list Very useful! –Good overview of service status –Fast feedback –Extensible –Jeff: “For me it's already worth the investment” Not yet “production quality”, but close –Some permanently failing tests –Does not yet feel 100% stable Documentation on tests –What is tested? –What does a failure mean? Future –Part of infrastructure –Add more tests, perhaps contribute tests –Use alarms

28 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Nagios Prototype Future and Conclusions

29 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Future Work NCG modularization –enables reuse for other monitoring tools (e.g. Lemon) Enabling “on-host” check via NRPE –process, logs, ports, files, etc Simplify local probe execution –executing local probes on existing gLite-UI nodes –executing local probes without dteam membership Probe description & site topology databases definition Migration of credential management to robot certificates

30 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Conclusions Nagios –highly configurable monitoring framework with notifications, service dependencies, … –widely used by site admins Grid extensions –integration with existing infrastructure (user certificates, VOMS, GOCDB, SAM) –probes for key grid services Implementation of GSM WG specifications –probe wrapper, publisher & remote gatherers –easy integration with existing probes and monitoring systems

31 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Ganglia Using publisher interface to populate ganglia

32 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Contact Us Special thanks to the original authors Questions? wlcg-monitoring-discuss@cern.ch https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo

33 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Visualizing the State of the Grid with GridMaps Max Böhm, Rolf Kubli CERN openlab / EDS EGEE'07 Conference, 1-5 Oct 2007

34 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Outline –Collaboration CERN openlab / EDS –Motivation –GridMap Visualization –Prototype –Conclusions

35 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Collaboration CERN openlab / EDS EDS is a contributor member of the CERN openlab The purpose of the joint project between CERN and EDS is to carry out research and development in the field of monitoring, management and operation of Grid services. The CERN openlab is a framework for evaluating and integrating cutting-edge IT technologies or services in partnership with industry Mont Blanc (4810 m) Downtown Geneva

36 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Motivation Better understanding the state of the Grid helps improve the reliability of Grid services "You can't manage what you don't measure" Many Grid monitoring tools are in use –Service Availability Monitoring (SAM) –GridView –GStat –Experiment Dashboard –GridICE –...

37 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Motivation But... The Grid is a large distributed infrastructure Grid monitoring data are complex! Current tools visualize data by sorted tables, bar charts, etc. Difficult to present an easy to understand top-level view which provides - quick, action oriented oversight and insight - help understand job failures and availability patterns Can new visualizations help?

38 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Visualization

39 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Visualization Idea –visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap site regions Size of rectangle is e.g. - size of site (#CPUs) - #running jobs -...

40 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Visualization Idea –visualize the Grid by using Treemaps (Grid + Treemap = GridMap) Example GridMap Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service -... okdegradeddown

41 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Multiple Views GridMaps can be used for top-level, geographical and VO views VO Views cross-location Top-level View Geographical Views Federation, Partner, Site, etc. Next level of GridMaps Large-scale Federated Grid Services Infrastructure Global GridMap Application Domain GridMap Local GridMap Alert Corrective action effect

42 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Trends Trends can be understood by looking at a sequence of GridMaps 25 Sep 200724 Sep 200723 Sep 2007 Site Availability over time: 22 Sep 200721 Sep 200720 Sep 2007

43 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 More Views Correlations of metrics can be discovered by switching between different views LHCbCMSAtlasAliceOPS Site Availability from different VO perspectives: site BDIISRMSECEOverall Site Status of different Site Services: sites without colour do not support the VO

44 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Prototype

45 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Prototype Architecture Grid sites existing monitoring system(s) GridMap Server Web Browser Title view1 view2 view3 GridMap ViewGridMap Server - Browser based Web 2.0 type client component - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight JavaScript libraries, AJAX type communication pattern - provides client side code and client supporting services - implements GridMap Layout Algorithm - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python

46 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Prototype View Component Metric selection for colour of rectangles Show SAM status Show GridView availability data Grid topology view (grouping) Metric selection for size of rectangles VO selection Overall Site or Site Service selection Link: http://gridmap.cern.chhttp://gridmap.cern.ch Drilldown into region by clicking on the title Context sensitive information Colour Key Description of current view

47 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 GridMap Prototype: Link to Existing Tools Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results

48 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Conclusions

49 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Conclusions –GridMaps are a new approach to visualizing complex monitoring data of the Grid –The same type of visualization can be used for top-level, regional, and VO specific views –GridMaps can identify correlations and availability patterns –A prototype for visualizing SAM data has been implemented –Can be used for visualizing other data, e.g. of experiments, alarms –GridMap web component can be embedded into other tools, e.g. Dashboards (if you are interested, please contact us) –GridMaps are a result of the CERN openlab / EDS collaboration which takes place within the CERN-IT Grid Deployment group

50 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Dr. Max Böhm EDS / CERN openlab max.boehm@eds.com max.boehm@cern.ch EDS and the EDS logo are registered trademarks of Electronic Data Systems Corporation. EDS is an equal opportunity employer and values the diversity of its people. © 2007 Electronic Data Systems Corporation. All rights reserved. Dr. Rolf Kubli EDS Switzerland rolf.kubli@eds.com Contacts:


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid."

Similar presentations


Ads by Google