A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

Slides:



Advertisements
Similar presentations
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Operating-System Structures
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Status of Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Computing, B. Wilkinson, 20046c.1 Globus III - Information Services.
Member of the ExperTeam Group Ralf Ratering Pallas GmbH Hermülheimer Straße Brühl, Germany
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Monitoring By Zoran Obradovic CSE-510 October 2007.
Course 6421A Module 7: Installing, Configuring, and Troubleshooting the Network Policy Server Role Service Presentation: 60 minutes Lab: 60 minutes Module.
Włodzimierz Funika, Filip Szura Automation of decision making for monitoring systems.
Networked Application Architecture Design. Application Building Blocks Application Software Data Infrastructure Software Local Area Network Server Desktop.
INFN-GRID Globus evaluation (WP 1) Massimo Sgaravatto INFN Padova for the INFN Globus group
WITSML Service Platform - Enterprise Drilling Information
ANSTO E-Science workshop Romain Quilici University of Sydney CIMA CIMA Instrument Remote Control Instrument Remote Control Integration with GridSphere.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
An Integrated Instrumentation Architecture for NGI Applications Ian Foster, Darcy Quesnel, Steven Tuecke Argonne National Laboratory The University of.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
TRASC Globus Application Launcher VPAC Development Team Sudarshan Ramachandran.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
Backdrop Particle Paintings created by artist Tom Kemp September Grid Information and Monitoring System using XML-RPC and Instant.
Computer Emergency Notification System (CENS)
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
National Computational Science National Center for Supercomputing Applications National Computational Science NCSA-IPG Collaboration Projects Overview.
1 Introduction to Microsoft Windows 2000 Windows 2000 Overview Windows 2000 Architecture Overview Windows 2000 Directory Services Overview Logging On to.
1 / 18 Federal University of Rio de Janeiro – COPPE/UFRJ Author : Wladimir S. Meyer – Doctorate Student Advisors : Jano Moreira de Souza – Ph.D. Milton.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Module 8: Planning and Troubleshooting IPSec. Overview Understanding Default Policy Rules Planning an IPSec Deployment Troubleshooting IPSec Communications.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter Five Windows Server 2008 Remote Desktop Services,
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Globus Toolkit Massimo Sgaravatto INFN Padova. Massimo Sgaravatto Introduction Grid Services: LHC regional centres need distributed computing Analyze.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
Globus – Part II Sathish Vadhiyar. Globus Information Service.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
MySQL and GRID status Gabriele Carcassi 9 September 2002.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Overview of Grid Webservices in Distributed Scientific Applications Dennis Gannon Aleksander Slominski Indiana University Extreme! Lab.
Globus: A Report. Introduction What is Globus? Need for Globus. Goal of Globus Approach used by Globus: –Develop High level tools and basic technologies.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
The Globus Toolkit The Globus project was started by Ian Foster and Carl Kesselman from Argonne National Labs and USC respectively. The Globus toolkit.
Mobile Analyzer A Distributed Computing Platform Juho Karppinen Helsinki Institute of Physics Technology Program May 23th, 2002 Mobile.
Core and Framework DIRAC Workshop October Marseille.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
WP2: Data Management Gavin McCance University of Glasgow.
Module Overview Installing and Configuring a Network Policy Server
Chapter 2: System Structures
University of Technology
Chapter 2: System Structures
Presentation transcript:

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center

2 GGF8 PGM Workshop Motivation l Computational grids: u Many different types of resources u Services deployed on those resources u Applications executed by users l There will be failures u Failures need to be observed u Observation of failures need to be communicated l A grid must be managed u Failure management u General administration

3 GGF8 PGM Workshop Approach l Develop a general framework for observation and control u Observe and control a variety of resources and services u Operate in a distributed environment u Secure u Scalable l Use this framework to monitor and manage grids u Observe computer systems, storage systems, networks u Observe job submission, information, file transfer services u Start, stop, and configure services u Notify administrators of problems l Help develop and be compatible with standards u Global Grid Forum

4 GGF8 PGM Workshop Why not use an existing system? l Commercial systems u Many fully-featured tools available u Cost that could be too high for smaller partners u Incompatibility between different tools u Incompatible with grid security and authentication mechanisms l Open source systems u Not as many features u Incompatible with each other u Not compatible with grid security mechanisms l Either u Want a testbed for standardization

5 GGF8 PGM Workshop High-Level Architecture Manager Actor Directory Service Observer Events Commands Advertise Search

6 GGF8 PGM Workshop Monitoring and Managing a Cluster Cluster Manager Management Host Receive observations Decide if any actions need to be taken Ask for actions Log any problems Host Observer Host N CPU load Disk space Memory use Host Actor Kill process Clean temp disk Host Observer Host 1 CPU load Disk space Memory use Host Actor Kill process Clean temp disk Directory Service

7 GGF8 PGM Workshop User-written Higher-level Observer l Sensor u Performs a measurement and reports results l Sensor manager u Manages sensors, subscriptions, and queries l Event Producer u Subscribe u Query u Available events u Event schemas l Service Hosting Environment Sensor Observer Service Hosting Environment Sensor Low-level Key Event Producer Sensor Manager

8 GGF8 PGM Workshop Actor l Actuator u Performs an action l Actuator Manager u Handles requests for actions by calling actuators l Actor u Request action (RPC) u Available actions u Action schemas l Service Hosting Environment Actuator Actor Actuator User-written Higher-level Low-level Key Service Hosting Environment Actor Actuator Manager

9 GGF8 PGM Workshop Manager l Two external interface components u Event Consumer Client u Actor Client l 2 approaches to higher-level components u User writes management logic u User writes management rules and uses an expert system Expert System Manager Event Consumer Client Actor Client Management Rules Management Logic Manager Event Consumer Client Actor Client User-written Higher-level Low-level Key

10 GGF8 PGM Workshop Directory Service l Information about observers and actors u Contact location and protocol u Available events and actions u Who has access l Dictionary u Event and action schemas l Future: Information about event consumers u Archives u Channels l Experimental component

11 GGF8 PGM Workshop Security l GSI security l Encrypted communication u SSL/TLS l Authentication u X.509 certificates u Proxy certificates l Authorization u Per-observer and per-actor u Pluggable user-defined authorization module l Module for X.509 subject-based access control lists available u Future: per-sensor and per-actuator

12 GGF8 PGM Workshop Basic GUI

13 GGF8 PGM Workshop Monitoring and Managing a Cluster Receive observations Decide if any actions need to be taken Ask for actions Log any problems Host N CPU Load Sensor Host Observer Event Producer Disk Space Sensor Memory Sensor Host 1 Management Host Management Logic Cluster Manager Event Consumer ClientDirector Client Directory Service Service Hosting Environment Sensor Manager Kill Process Actuator Host Actor Actor File Deletion Actuator Service Hosting Environment Sensor Manager

14 GGF8 PGM Workshop Implementation l Communicates using TCP, UDP, or SSL l XML encoding of messages l C++ version u pthreads u Xerces XML parser u Globus I/O for authenticated and secure communication u Currently runs under IRIX, Solaris, Linux u CLIPS expert system l Java version u Xerces XML parser u Globus Java CoG for authenticated and secure communication u JDK 1.3.x or 1.4.x

15 GGF8 PGM Workshop Grid Management System l Things to observe: u Resource status and usage l Computer systems and networks u Grid services l GRAM, MDS l Includes processes, log files, and test queries l Things to control: u Add/remove user mappings in grid-mapfiles u Starting and stopping MDS servers u Add/remove/update CA certificates l Provide a nice GUI to do all this

16 GGF8 PGM Workshop Grid Management System Management GUI GRAM Management Agent Directory Service MDS Management Agent Query for events that describe problems 1.Events describing current state 2.Action requests Advertise existence Find managers and archive Event Archive 1.Subscribe 2.Events with problems Experimental Component

17 GGF8 PGM Workshop Management Agent l Management agents: u Perform observations u Perform actions u Manage local problems l Not doing any management right now l Handle local problems locally Observer Management Agent Local management Actor Manager Local actions Local and remote observations

18 GGF8 PGM Workshop GRAM Management Agent l Observes: u Network latency between GRAM hosts: ping u Available network bandwidth between hosts: IPerf u CPU load: Unix uptime, PBS qstat, LSF bjobs u Available memory: vmstat? u Available disk space: df u The Globus GRAM service: Log files l Performs actions: u Modify Globus grid-mapfile u Start/stop IPerf server u Send l In the future will manage local problems u Receive local observations u Perform local actions when necessary

19 GGF8 PGM Workshop MDS Management Agent l Observes: u Network connectivity between GIS hosts: ping u CPU load: uptime u Available memory: vmstat? u Available disk space: df u The status of the LDAP server l The LDAP server process: ps l If LDAP queries are successful: ldap_search() l Performs actions: u Start and stop LDAP server u Send l In the future will manage local problems

20 GGF8 PGM Workshop Event Archive l Allows events to be archived and searched l An XML database u Currently Xindice u Compatible with our XML-based events l Queried using the Xpath language l Use for all events, just errors, … l Experimental component

21 GGF8 PGM Workshop Grid Management GUI

22 GGF8 PGM Workshop Grid Management GUI l Similar to many you’ve seen before l Java program l Load on systems u System up or down l Latency and bandwidth of network u Network up or down l XML configuration file defines GUI u Which systems to monitor u Which sensors to use on each system u Where to place information on the screen l More detailed information available as dialogs

23 GGF8 PGM Workshop Standardization l Performance Working Group of the Grid Forum u Architecture u Event representations u Directory service schema u Producer-consumer communication protocols l Grid Monitoring Architecture Working Group l DAMED Working Group l Grid Event Service Working Group? u BOF at next GGF, hopefully

24 GGF8 PGM Workshop Status and Future Work l Current Status: u Worldwide noncommercial release expected Real Soon Now u Release quality l CODE framework used day-to-day in the IPG u Preliminary grid management system l Our future plans include: u Define and be compatible with Grid Forum standards u Use in the IPG (Need a web interface) u Develop more sensors and actuators u Sensors and actuators as programs as well as classes u More sophisticated event service l event routing network, more subscription models and options u OGSI as hosting environment u Work with IPG (and other) administrators to improve the grid management system u A public release! Open source!