EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks InGRID: A Generic Autonomous Expert System.

Slides:



Advertisements
Similar presentations
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
Advertisements

Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
Chapter 19: Network Management Business Data Communications, 5e.
T-FLEX DOCs PLM, Document and Workflow Management.
Towards a Generic Platform for Developing CSCL Applications Using Grid Infrastructure by Santi Caballé Open University of Catalonia Barcelona, Spain with.
The Comparison of the Software Cost Estimating Methods
Chapter 19: Network Management Business Data Communications, 4e.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Testing an individual module
Lecture Nine Database Planning, Design, and Administration
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
Chapter 9 Database Planning, Design, and Administration Sungchul Hong.
1 Autonomic Computing An Introduction Guenter Kickinger.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Chapter 2 The process Process, Methods, and Tools
1 BTEC HNC Systems Support Castle College 2007/8 Systems Analysis Lecture 9 Introduction to Design.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Conditions and Terms of Use
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE II - Network Service Level Agreement (SLA) Establishment EGEE’07 Mary Grammatikou.
Event Driven Programming
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Database-Driven Web Sites, Second Edition1 Chapter 5 WEB SERVERS.
Event Management & ITIL V3
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks General relationships with EGEE JRA1 SA3.
BUSINESS FUNCTIONS & INFORMATION SYSTEM. What is a System? System is simply a set of components that interact to accomplish some purpose. Business is.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks R-GMA Now With Added Authorization Steve.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Bazaar Vision Ideas of RC/VO coordination,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
1 CMPT 275 High Level Design Phase Modularization.
A university for the world real R © 2009, Chapter 9 The Runtime Environment Michael Adams.
Model View Controller MVC Web Software Architecture.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 – The Ganga Evolution Andrew Maier.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC Security Contacts R. Rumler Lyon/Villeurbanne.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE Charon Extension Layer. Modular environment for Grid jobs and applications management Jan.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
EGEE is a project funded by the European Union under contract IST Security Monitoring Miguel Cárdenas Montes Security Contact SWE Federation.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Computational chemistry with ECCE on EGEE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations WS: Introduction & Objectives.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
SQL Database Management
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks InGRID: A Generic Autonomous Expert System for Grid Nodes Real Time Monitor and Control Francesc Pérez Picazo (Telefónica I+D) June KTH, Stockholm

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 2 INDEX INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 3 INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 4 INTRODUCTION InGrid project started with EGEE, beyond a collaboration between Telefónica I+D and PIC (South-West ROC). The project studies the problem of managing a Grid resource centre and how to improve this task using an expert system.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 5 INTRODUCTION Resource center monitoring and operation: –Computing centres must provide a high Quality Of Service (QoS), and to accomplish this QoS, an optimal monitoring and operation is required. –Many open source monitoring tools are used, such as Nagios, Ganglia, etc.  They are excellent at displaying the raw service incidences.  They can generate alarms and filter them based on simple dependencies, before notifying to the operator. –Even though, when a large number of alarms appear simultaneously, the operator can get overloaded.  HUMAN STILL HAVE LIMITATIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 6 INTRODUCTION Operator’s human limitations could be: –Delay due to a work overload. –Fatigue. –Subjectivity. –Lack of experience. –Inattentiveness. –High cost for training of the operator. To overcome these limitations, we propose InGRID InGRID will: –Manage the large amount of monitoring information –Act as a virtual operator automatically executing actions on services to restore their regular state. –If the action succeeds, as any intervention of the operator is required, this leads to a decrease of the operator workload.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 7 DESIGN CONSIDERATIONS INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 8 DESIGN CONSIDERATIONS Modularity –InGRID should be adaptable to any monitoring tool.  It will be independent of the monitor, and complement it. –We only require that the monitor publishes the status data somewhere can be parsed. Integration –Each monitoring tool has its set of particular functionalities. Scalability –A huge load of monitoring data will be generated. Autonomy –Resource centres are required to be 24x7x365 available.  Only few of them can afford to do this with the attendance of at least one operator. –InGRID will respond automatically to those well defined incidences without the operator attendance.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 9 DESIGN CONSIDERATIONS Real-Time actions –Once an incidence is detected, the expert system must find out which action is required. –Decisions will be made in real-time. Easy and clear to administrate. –The hardest work with any expert system is the administration.  The administrator of the expert system must define an automatic procedure and design a set of rules for each service monitored. –To facilitate this administration task, InGRID will provide a clear graphical interface.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 10 ARCHITECTURE INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 11 WORKFLOW The InGRID workflow contains three main phases: –Information Collection:  the expert system collects the monitored status of each element being supervised. –Decision Making:  the system analyses the information collected, and makes decisions according to defined rules. –Correcting:  each decision corresponds to an action that must be carried out.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 12 MONITORING (I) The Monitor collects information about the status of each sensor.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 13 MONITORING (II) The monitoring information collected must be stored into an output database or text file...

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 14 MONITORING (III) …so it can be dumped into InGRID database.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 15 DECISION MAKING (I) Data stored in InGRID database feeds the Decisions Maker internal memory. The facts are questioned against the web rule conditions.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 16 DECISION MAKING (II) When the facts and the web rule conditions match, a rule is activated, and an action is scheduled.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 17 CORRECTING (I) The Agent listens for actions scheduled by the Decisions Maker…

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 18 CORRECTING (II) …and executes them.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 19 ADMINISTRATION The operator uses the web interface to administrate the expert system and configure the Web Rules. Both Monitor and Agent are independent modules from the Decisions Maker. The communication between them is kept through the database.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 20 DECISIONS MAKER INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 21 CLIPS (I) The Decisions Maker is the core of the expert system. This module is based on a rule-based programming, which is one of the most commonly used techniques for developing expert systems. Rule-based programming “IF clause, THEN action” CLIPS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 22 CLIPS (II) CLIPS is a productive public domain development tool. Provides a complete environment for the construction of rule based expert systems. Key features: –Knowledge Representation  cohesive tool for handling a wide variety of knowledge. –Portability  written in C for portability and speed. –Integration/Extensibility  embedded within procedural code, called as a subroutine. –Fully Documented  extensive documentation. –Low Cost  maintained as public domain software.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 23 MEMORY FEEDING Data stored in InGRID database feeds the internal memory of the Decisions Maker. The Facts are composed by the monitoring data and the list of actions scheduled. The Web Rules are programmed using the web interface.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 24 WEB RULES (I) The Web Rules specify a set of patterns to find, and the actions to be performed when these patterns are found. A Web Rule is composed of an if” component (antecedent), and a “then” component (consequent). –The antecedent is a set of patterns (conditions) that specifies the facts (or monitoring data) which cause the rule to be applicable. –The consequent is the set of actions to be executed when the rule is applicable. To represent all kind of events that may appear in real situations, InGRID considers four types of rules depending on the state of the elements supervised.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 25 WEB RULES (II) STABLEOK state No action is required UNSTABLEhard error state An immediate action is required, independently of the state of any other element. DOWNsoft error state Previously, dependencies must be checked. The action will be schedule only if this error is the real cause of the problem NOTIFY- Notifies by the incident to the operator, without trying to schedule any action.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 26 MATCHING

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 27 MATCHING

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 28 MATCHING

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 29 CASE OF USE INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 30 Module TEST1_CLIENT has only one action activated: restart_dummy

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 31 Module TEST1_CLIENT has 3 rules defined: 2 of them stable, and 1 rule unstable, which activates the action restart_dummy.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 32 Sensor Test1_Client is OK. As it is expected to be RUNNING that activates an stable rule, so nothing must be done here Notice that here sensor Test1_Client is also monitored as CRITICAL. But in this case, the operator is expecting the service to be STOPPED. Sensor Test1_Client is monitored as CRITICAL. That should activate the unstable rule defined, and schedule the action restart_dummy

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 33 The rule for Test1_Client has been activated and the action restart_dummy has been scheduled. Now we must wait till the Agent executes the action.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 34 The action restart_dummy has been executed yet. Now we must wait till the monitor re-checks the sensor, and see if the action has solved the problem.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 35 Having a look at the log, we see each reasoning iteration of the expert system. First, the action restart_dummy is scheduled, the agent executes it successfully, and the monitor re-checks the sensor with an OK status.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 36 Sensor Test1_Client is now monitored as OK. That should activate the stable rule defined for this sensor, so the previous action is removed from the agent queue.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 37 The action has been removed, and the queue is empty. Everything is working fine now!!!

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 38 CONCLUSIONS INTRODUCTION DESIGN CONSIDERATIONS ARCHITECTURE DECISIONS MAKER CASE OF USE CONCLUSIONS

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 39 CONCLUSIONS The main objective of InGRID is: –to minimize the large amount of incidents that the operator must attend, and –to try to restore automatically the major number of services without the intervention of the operator. InGRID is being tested over some production services of PIC (South-West ROC). We use PIC feedback to improve InGRID with new features.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 40 CONCLUSIONS During first tests, we detected some outage problems: –Some of the incidences occurred were false errors caused by a monitoring or executing problems (SNMP not responding). –Also a large number of incidences were caused by a huge load average of the host monitored, which became into timeouts. –During high loads, a host usually returns critical status for all checks. –We have tried to minimize these problems monitoring also SNMP and the Load Average of each host, and using dependencies. InGRID demonstrates that may detect the real cause of the problem when: –service dependencies are fine tuned, and –the adequate sensors and rules are configured.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 41 CONCLUSIONS When an incidence is caused by a standalone service, independent of any other element, InGRID has demonstrated a great performance when acting on it. –Usually, the only necessary action to do is to restart the service. Standalone services may run in a large number of hosts. –To manage them, the administrator just needs to create only one rule for that generic service. –But, if one wants to make distinct rules according to whether the service is running in a particular host, or a group of hosts, or even a site, InGRID permits to associate a rule to these kinds of groups.

Enabling Grids for E-sciencE EGEE-II INFSO-RI InGRID 42 QUESTIONS? Thanks for your attention!! Francesc Pérez