Download presentation
Presentation is loading. Please wait.
Published byMercy Lee Shelton Modified over 9 years ago
1
Cracow Grid Workshop 09 13. October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich (Marcus.Hilbrich@tu-dresden.de) Center for Information Services and High Performance Computing (ZIH) A Scalable Infrastructure for Job-centric Monitoring Data from Distributed Systems
2
2 Marcus Hilbrich Outline Motivation –Users view on the Grid –How Grid jobs behave AMon –What kind of data is recorded –How the user can access the data –How to find the interesting data (problematic jobs) Monitoring infrastructure –Why centralized structures fail –A layered and distributed solution Summary
3
3 Marcus Hilbrich Users View on the Grid Grid is an environment to solve computing intensive problems There is no need for users to be experts in computer since The complexity of the Grid is mostly hidden from the user –Middleware Globus gLite ... –Web portals VO or community specific but easy access to the Grid –Client applications Integration of the Grid in the user's desktop –Automatic submission of processes Grid is part of a workflow
4
4 Marcus Hilbrich ? How Grid Jobs Behave Most jobs run fine Some jobs show unexpected behavior or hang up Some jobs get lost ? ? ! ? ! ! ! ? Job-centric monitoring can identify why some jobs behave abnormal
5
5 Marcus Hilbrich AMon A Tool for Job-centric Monitoring
6
6 Marcus Hilbrich What Kind of Data is Recorded Job-specific, system-specific and general information Job information –Job ID, username, computing element, worker node,... CPU information –Wallclock time, CPU time,... Memory –Used real/virtual main memory, free memory, SWAP,... Disk –Free space HOME/TMP, disk usage,... File I/O –I/O rate of used files Network –Send/received traffic
7
7 Marcus Hilbrich How to Find the Interesting Data (Problematic Jobs) (1) Access AMon via webbrowser (with Grid certificate) Use the interactive interface to analyze the jobs Select jobs by current/exit state Or filter by identified problems (like idle CPU or lack of free memory) Reduce thousands of monitored jobs to some interesting/bad ones!
8
8 Marcus Hilbrich How to Find the Interesting Data (Problematic Jobs) (2) Color coded bars to compare multiple jobs Identify differences over the whole runtime (for one parameter) Look at a single job Interactive scroll and zoom for analyzing details
9
9 Marcus Hilbrich Monitoring Infrastructure for Job-centric Monitoring Data of Huge, Widely Distributed Computing Grids
10
10 Marcus Hilbrich Tasks which the Monitoring System has to Perform AMon / User Running job Monitoring infrastructure Which jobs are mine? What are the exact monitoring data of a specific job? Give some space for storage! Make the data accessible and searchable for the owner!
11
11 Marcus Hilbrich AMon / User Running job Monitoring infrastructure Which jobs are mine? What are the exact monitoring data of a specific job? Give some space for storage! Make the data accessible and searchable for the owner! Constraints of the Monitoring System Authorization Who can access whose data? Security How to deny unauthorized access? Scalability / Performance Single user up to huge VOs Authorization and Security can rely on Globus Toolkit 4 framework Scalability has to be archived by using collaborating Globus Toolkit 4 instances
12
12 Marcus Hilbrich Why Centralized Structures Fail The storage performance has to be increased with the number of users or computing elements Some performance criteria can be increased quite easily (e.g. storage capacity) Other criteria cannot be increased (e.g. network bandwidth) Network bandwidth tends to be a major bottleneck! Distributed storage of data can avoid this problem!
13
13 Marcus Hilbrich A Layered and Distributed Solution Independent layer structures can be joined from an outer layer –Each (sub-)VO uses separate STS and LTS –The MDS layer makes all jobs visible from a single access point STS (Short Time Storage): local at a site to reduce latency LTS (Long Time Storage): stores the data in a distributed way MDS (Meta Data Server): generates a global view and hides the inner layers Each layer can consist of multiple, distributed, heterogeneous servers
14
14 Marcus Hilbrich Summary Using AMon –Grid users get a tool to find out why some jobs behave unusual –It is easy to take a look at jobs –The Grid gets more transparent with respect to the jobs behavior The new monitoring data infrastructure is scalable with –The number of jobs –The number of resource providers –The increasing power of single resources AMon already addresses the demands of future Grids
15
15 Marcus Hilbrich Open Discussion
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.