Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.

Similar presentations


Presentation on theme: "Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it."— Presentation transcript:

1 www.ccr.infn.it http://grid.infn.it/ Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it I Corso di formazione INFN per amministratori di siti Grid Martina Franca, 5-9 Novembre 2007

2 www.ccr.infn.it http://grid.infn.it/ Disclaimer This presentation is based on materials provided and authorized by the EGEE project and is freely available to download and use according to the terms of the following license: http://creativecommons.org/licenses/by-nc-sa/2.5/

3 www.ccr.infn.it http://grid.infn.it/ Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

4 www.ccr.infn.it http://grid.infn.it/ Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

5 www.ccr.infn.it http://grid.infn.it/ Grid Monitoring Grid monitoring has to provide The knowledge of the type, state and features of the resources constituting the Grid by means of: –Grid Resources Inventory –Grid Resources Behavior –Grid Resources Availability

6 www.ccr.infn.it http://grid.infn.it/ Grid Resources Inventory Instantaneous picture of the resources constituting the Grid to have an idea on how Grid resources are shared among sites: –Number of Computing Element (CE), Worker Node (WN) and Storage Element (SE) – Number of Jobs running and waiting in all the Grid, for VOs

7 www.ccr.infn.it http://grid.infn.it/ Grid Resources Behavior Measuring a set of evolving data to investigate historical/statistical aspects of a Grid: –Percentage of jobs aborted in a site for a particular Virtual Organization (VO) in a certain period of time –Time duration of a fault situation for a particular service or Grid process –Percentage of CPU/RAM usage during the Grid activity

8 www.ccr.infn.it http://grid.infn.it/ Grid Resources Availability Evaluating the accessibility of the Grid main services at Regional, Site and VO level for a grid usage improvement –Actual Grid services down (e.g. CE, WN, SE) –Actual Grid site components not working properly (es. authentication and authorization, job submission, data management) –Actual Jobs load in a certain Site –Actual Min/Max Sloat Free where you can submitt jobs

9 www.ccr.infn.it http://grid.infn.it/ Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

10 www.ccr.infn.it http://grid.infn.it/ Day by Day Operations /1 INFNGrid must be daily monitored both by the ROC team and Site Managers to test its functionalities –Service Level Agreement according to the Memorandum of Understanding  Site must provide a Grid production level

11 www.ccr.infn.it http://grid.infn.it/ Day by Day Operation /2 Monitoring procedure is based on: –Problem Detection and Diagnosis  use of monitoring tools  low level check on site –Problem Tracking (see next talk on Support Systems)  Use of helpdesk ticketing system

12 www.ccr.infn.it http://grid.infn.it/ Grid Site Monitoring: General Requirements Efficently scale increasing the number of nodes monitored Use lightweight sensors –Avoid computers overload Publish reliable data –Hard task in Grid environment Send notification on daemons/machines problems Take action in case of problems on services Allow metrics addition easily –New interesting parameters must be added without to much work Be “Grid Aware”

13 www.ccr.infn.it http://grid.infn.it/ Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

14 www.ccr.infn.it http://grid.infn.it/ Monitoring Cluster Systems Use of systems to spot and notify sys administrators in case of outages via email, pager or other alarms Top systems used in Grid Sites –Ganglia  http://ganglia.sourceforge.net/ –Lemon  http://lemon.web.cern.ch/lemon/doc/howto/lemonization_howto. shtml –Monit  http://www.tildeslash.com/monit/

15 www.ccr.infn.it http://grid.infn.it/ Ganglia PRO: Open source project developed by Berkley University Adopted by many sites Easy to install and manage Useful charts –Can easily detect spikes, thanks to the possibility to define the update time Easy to add new metrics CONS: Alarms and reactions on failures not available Problems in scaling to hundreds or thousands node with an high frequency sampling It is not aware of gLite grid-services Data can be stored only in RRD “DB” –No detailed historical data are available

16 www.ccr.infn.it http://grid.infn.it/ Lemon /1 PRO: Open source project developt by CERN Its goal is to provide a monitoring system that can scale at thousand node without problems It is possible to have the detailed history using an Oracle DB as RDBMS Many advanced parameters can be monitored using standard sensor Less PRO: It is also possible to install LEMON without DB back-end –With less functionality It has alarms and reaction on failure –The complete set of function is available only with a DB back- end installation Configuration yet available for some grid-services –must be customized according to the site

17 www.ccr.infn.it http://grid.infn.it/ Lemon /2 CONS: It is not so easy to install and manage It is not so simple to add metrics or checks A more “friendly” DB back-end is not available yet It does not have the hourly graph: can be a problem in order to detect spikes

18 www.ccr.infn.it http://grid.infn.it/ Monit PRO: Public Open Source project It has a good base of standard checks for well known services Lightweight, easy to install, configure and manage A simple http server built-in to check the status of each machine CONS: It is not really a “monitoring system” but an “alert system” A single web page with the status of all monitored machine is not available yet No charts available yet

19 www.ccr.infn.it http://grid.infn.it/ Outline Monitoring goals Monitoring Procedure Fabric Monitoring INFNGrid Monitoring tools

20 www.ccr.infn.it http://grid.infn.it/ Monitoring Grid Systems The INFNGrid project adopts three main Grid monitoring tools to check if its Grid resources and services work as expected –GridICE  http://gridice4.cnaf.infn.it:50080/gridice –GSTAT  http://gstat2.gridops.org/gstat/Italy.html –SAM  https://lcg-sam.cern.ch:8443/sam/sam.py

21 www.ccr.infn.it http://grid.infn.it/ GridICE: Overview Based on the gLite Information System –Daily discovery of new GRISEs –Periodic queries to the discovered GRISes (every 10-30 min)  CE, SE, Site BDII Standard Glue info published  Extended GRIS (EX GRIS) Hosts info (es daemons monitoring) Job monitoring Computing info gathered from Site Local Resource Management System –Information collected in a central RDMS and published in the Web context

22 www.ccr.infn.it http://grid.infn.it/ GridICE: Geo View

23 www.ccr.infn.it http://grid.infn.it/ GridICE Site View Standard Parameters /1 Downtime status (from GOC DB) Country information (from GridICE detection mechanism) Administrative information (from GOC DB)

24 www.ccr.infn.it http://grid.infn.it/ GridICE: Site View Extended Parameters Site job load as measure of how busy is the site ((CPU#- CPUFree)/CPU#)*100 Power estimation calculated by adding the power value (SpecInt) of each CPU of the site WN and CPU number CPULoad is computed by considering the load1min as reported by the LRMS for all the WNs

25 www.ccr.infn.it http://grid.infn.it/ GridICE: Site View Standard Parameters /2 Number of available gatekeepers (CE) Number of configured queues on CE Running and waiting jobs

26 www.ccr.infn.it http://grid.infn.it/ GridICE: Site View Standard Storage Parameters Available, total and percentage used on the storage element of the site

27 www.ccr.infn.it http://grid.infn.it/ GridICE: Site View Monitored Hosts Number of monitored hosts per site

28 www.ccr.infn.it http://grid.infn.it/ GridICE: Host View General Use Case 2 Grid operator – Site administrator Detecting Resource Brokers with problems

29 www.ccr.infn.it http://grid.infn.it/ GridICE: Host View Details

30 www.ccr.infn.it http://grid.infn.it/ GridICE: GRIS View General Use Case 3 Grid operator – Site administrator Detecting GRIS’s status

31 www.ccr.infn.it http://grid.infn.it/ GridICE: GRIS View Detail

32 www.ccr.infn.it http://grid.infn.it/ Job View Job section to track VO users activity in order to: –Search among a huge number of jobs –Inspect jobs resource consuption –Aggregate jobs info based on VOMS attributes (next release)  Info selected according with the consumer ID (group/role)

33 www.ccr.infn.it http://grid.infn.it/ Chart View: Site manager viepoint

34 www.ccr.infn.it http://grid.infn.it/ SAM: CE functionality tests You can customize your personal SAM interface with desired tests chosen from a list of possibility –Job submission –CA certificate version installed on WN –Middleware version installed on WN –Host certificate validity –Replica management tests using lcg-utils –Accessibility of experiments software directory –Accessibility of VO management tools

35 www.ccr.infn.it http://grid.infn.it/ SAM: SE and LFC Functionality Tests SE functionality tests –File copy & register from UI using lcg-cr –File retrieval to the UI using lcg-cp –File delete using lcg-del LFC functionality tests –Directory listing using lfc-ls –File entry creation

36 www.ccr.infn.it http://grid.infn.it/ SAM: Error Investigation

37 www.ccr.infn.it http://grid.infn.it/ GSTAT: Overview Based on gLite information System Uses scripts to generate web-accessible reports Scripts are executed periodically (every 15 mins) to query and collect information published by each site The retrieved information is processed by an analysis framework that checks for failures and errors

38 www.ccr.infn.it http://grid.infn.it/ GStat: General View

39 www.ccr.infn.it http://grid.infn.it/ GSTAT: Site Details

40 www.ccr.infn.it http://grid.infn.it/ References GridICE - Web site –http://gridice.forge.cnaf.infn.it// GSTAT - Web doc –http://gstat2.gridops.org/gstat/filter_help.html SAM - Article –Global Grid Monitoring: the EGEE/WLCG case  High Performance Distributed Computing. Proceedings of the 2007 workshop on Grid monitoring Overview of Grid Monitoring Tools – Article –A taxonomy of grid monitoring systems  Future Generation Computer Systems Volume 21, Issue 1, 1 January 2005, Pages 163-188


Download ppt "Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it."

Similar presentations


Ads by Google