Overview Goal Concept Types of sensors User Scenarios Architecture Near term project Discuss topics
Goal of PPDG monitoring group Collect and coordinate requirements from information consumers. Collect the existing monitor tools: Catalog available instruments and identify missing functionalities. Coordinate with PPDG, European Data Grid (EDG), GriPhyn and Global Grid Forum Performance work group. Communicate with Globus experts on how the monitoring data can be used by different Globus Component? What do they need in order to improve the system utilization, for example, data type, data attribute and data specification. Identify which are the essential services/resources to be monitored in grid infrastructure and define the adequate and efficient metrics (measurement, data format, and so on). Evaluate and select standard information systems (different information distribution models may be required for real-time vs. archived info and also for active vs. passive monitoring). Obviously, the PPDG "standard" must be coordinated with Grid Performance Area, Globus Information Service (GIS), EU Data Grid, etc. Build higher level diagnostic services package based on Grid Monitoring Architecture. Promote this standard to experimenters, including providing assistance integrating existing or new instruments into the selected information system(s).
Goal: The focus of the Grid monitoring efforts is not to replace or reinvent any of these tools but to integrate them into a scalable distributed architecture. The goal is to provide a single infrastructure that will accommodate many differeent types of monitored information. Of course, we will start with well understood examples and existing functions to integrate into the distributed architecture.
What is monitor? Monitor: use hardware and/or software tools to observes the activities on a given system or application resource. –Analysis performance, –Detection fault, –Identify bottleneck, –Tune performance –Predict performance, –Scheduling “the best resources” Where to get/put the data. Where to execute job.
Concepts: Sensors: A sensor can measure the characteristics of a target system. It generates a time stamped performance statistics. A sensor typically execute one of UNIX utilities, such as top, ps, ping, iperf, ndd or read constantly from system files, such as /proc/* to extract sensor-specific measurements. Typical sensors are used to monitor CPU usage, memory usage and network weather. Some sensors can monitor and capture system abnormal status. We call the type of measurement provided by a sensor subject.
Concepts Information Provider (IP)/Producer: Information provider provides detailed, dynamic statistics about instrumentation. Information provider either invokes and stops a set of sensors to do active probing or interacts with running sensors to obtain the current status of resource. An information provider can also query database to get historical information.
Concepts Aggregate Directory: the directory service is used to publish the location of information provider and its associated sensors. This allows the users to discover which sensors are currently active and which information provider they should contact to obtain information.
Sensors from PPDG meeting System Configuration Sensors: these sensors perform a software and hardware configuration survey periodically and obtain the Information on what software (version, producer) are installed on this system, what hardware is available. Network Sensors: these sensors either sniff passively on a network connection or actively create network traffic to obtain information about network bandwidth, package loss, jitter and round trip time. Host sensors: these sensors collect host information, such as CPU load, Memory load, available memory, available disk space, and average disk I/O time.
Sensors from PPDG meeting Process sensor (service sensor): Process sensors monitor the running status of a process, such as (number of this type of processes, number of users, when it starts). A process sensor might have threshold hold set up and trigger event when the threshold is reached. Application sensor: These sensors will report an application's running status, such as what is the current progress of the application, what percentage of the job is finished, how much CPU, memory and disk have been allocated to these job. If abnormal condition caused by host, network and interruption happens, this sensor will trigger events to get attention.
User Scenario 1: Services tracking and Resource selection Description: Data Transfer, Replicate data selection and job scheduling are provided in grid environment. In order to optimize the service, the grid monitoring system should be able to track selected service. oData transfer: what is performance for the data transfer, what route is selection for transferring, target: find the optimal router for transferring. oReplication Catalog: what location is chosen among several candidates? The purpose of this type of tracking is to find the best location to minimize the transferring time, accessing time. oJob Scheduling: What grid resource pool is scheduled to run jobs? Get the best computing resource that is close to the data that a job needs and the least loaded. Some Grid-specific issues related to this case: oEach type of service tracking needs consistent resource status information. oIdentify what type of data should be required for each type of service tracking. oFor better managing the grid resource, the scheduler needs to know host information, network, storage distribution and data locations. oThe cost for each type of resource should be recorded. oSensor data from grid resource should be accurate and consistent.
User Scenario 1: Implementation in Grid Monitoring Architecture: Each resource should be registered in grid indexed information service. The performance data will be archived in the database. A performance predictor should be able to summarize the historical service tracking data and forecast that the available service could be provided in the near future. Based on the predicted data from telemetry database, the resource selection manager could choose the "best" resource for any pending service request.
User Scenario 2: Network Advice Service Description: Network TCP stack have multiple layers. Physics layer, data link layer and network layer, and transportation layer. Monitoring system could provide recommendation on optimal TCP buffer size for a given network link. Some Grid-specific issues related to this case: oIt is complicated due to multiple layers in TCP stack. There are many parameters involved in TCP tuning. oEach layer requires monitoring information. oNetwork path is divided into three segments. Host to wall jack, wall jack to wall jack and wall jack to host. Each segment requires monitoring. oHow can this information be sent to Grid system. oImplementation in Grid Monitoring Architecture
Short Term Goal We defined a goal of integrating some simple set of site monitoring "sensors“ (network and host) into an MDS infrastructure so that a single display tool can be a "consumer" to show status info for multiple sites, using MDS as a middleman to isolate the consumer from the sensors. Within a month (well before the DOE/NSF reviews at the end of November) we should be able to demonstrate this distributed monitoring capability. The benefits of this demonstration include: - Force some "real" testing of Grid information infrastructure - Demonstrate a tangible common project - Help move us towards other joint tool developments
Discussion: Can we integrate existing tools into this Grid monitoring architecture, and how? Where are daemons, sensors, producers running? How can it communicate with MDS/GMA? Directory services-Central/Distributed? Can we turn on or off levels (amount) of information based on the current system-monitoring requirement? The monitoring overhead should be considered. Do we want to design a monolithic architecture or many small pieces of monitoring toolkit? Should we include a telemetry database or not? Build the system from one scenario and extend it to more scenarios?