Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring: Grid, Fabric, Network Jennifer M. Schopf, Argonne National Lab PPDG Review 28 April 2003, Fermilab.

Similar presentations


Presentation on theme: "Monitoring: Grid, Fabric, Network Jennifer M. Schopf, Argonne National Lab PPDG Review 28 April 2003, Fermilab."— Presentation transcript:

1 Monitoring: Grid, Fabric, Network Jennifer M. Schopf, Argonne National Lab PPDG Review 28 April 2003, Fermilab

2 28 Apr 2003J. Schopf, PPDG Review2 Monitoring and PPDG Many monitoring tools currently available –Different use cases –Different strengths –Legacy systems Much of PPDG monitoring work is by non-funded collaborators –Les Cotrell, SLAC, IEPM-BW –Iosif Legrand, Cal Tech, MonALISA –Brian Tierney, LBNL, NetLogger, PyGMA, NTAF –Wisconsin Group, Hawkeye

3 28 Apr 2003J. Schopf, PPDG Review3 Tools in a nutshell Grid levelFabric LevelNetworkSchema MDS (Globus)XXX Hawkeye (Condor) XXX Mona LisaXXX GangliaX IEPM-BWXX NetLoggerXXX pyGMAX NTAF (LBNL)X GLUE schemaX

4 28 Apr 2003J. Schopf, PPDG Review4 PPDG Role in Monitoring Deployment and evaluation –Use on production testbeds Requirements back to developers –Additional information sources –Realistic use cases Furthering of interoperability goals –GLUE schema –Common interfaces

5 28 Apr 2003J. Schopf, PPDG Review5 Deployment CMSATLASSTARBaBarD0TJNAF MDS (Globus)XXXX Hawkeye (Condor) X Mona LisaX GangliaXXXX IEPM-BWXXX NetLoggerXX pyGMAX GLUE schemaXX Local SolutionXXX

6 28 Apr 2003J. Schopf, PPDG Review6 Interoperability between Efforts MDSHawk eye Mona lisa GangliaIEPM- BW Net Logger pyGMANTAFGlue Schema MDSXXXXUX HawkeyeXX Mona LisaXXXXU GangliaXXXX IEPM-BWXUXX NetLoggerUXXXX pyGMAUXXXXXU NTAFUXXX GLUE schema XXXX X – Currently available, U – Under consideration

7 28 Apr 2003J. Schopf, PPDG Review7 Overview Examples of interfacing between tools –STAR use of Ganglia/MDS –Ganglia extension in ATLAS –Mona Lisa interfaces to Hawkeye and MDS in CMS Scalability analysis Some future steps

8 28 Apr 2003J. Schopf, PPDG Review8 Ganglia –MDS Interface STAR efforts and use Stratos Efstathiadis, BNL Developed a modified version of the Ganglia IP –Perl basis –Match the current CE-GLUE Schema –Can connect to the Ganglia Meta Daemon or the Ganglia Monitoring daemon –Simpler and more flexible Currently being tested at PDSF and BNL

9 28 Apr 2003J. Schopf, PPDG Review9 Ganglia Extensions in ATLAS Monitor Cluster health The added information through Ganglia creates an additional level combining different clusters into a “metacluster”

10 28 Apr 2003J. Schopf, PPDG Review10 MonALISA in CMS –MonALISA (Caltech) –Dynamic information/resource discovery using intelligent agents Java / Jini with interfaces to SNMP, MDS, Ganglia, and Hawkeye WDSL / SOAP with UDDI –Aim to incorporate into a “Grid Control Room” Service –Integration with MDS and Hawkeye

11 28 Apr 2003J. Schopf, PPDG Review11 Scalability Comparison of MDS, R-GMA, Hawkeye Zhang, Freschl and Schopf, “A Performance Study of Monitoring and Information Services for Distributed Systems”, to appear in HPDC 2003 How many users can query an information server at a time? How many users can query a directory server? How does an information server scale with the amount of data in it? How does an aggregator scale with the number of information servers registered to it?

12 28 Apr 2003J. Schopf, PPDG Review12 Overall Results Performance can be a matter of deployment –Effect of background load –Effect of network bandwidth Performance can be affected by underlying infrastructure –LDAP/Java strengths and weaknesses Performance can be improved using standard techniques –Caching; multi-threading; etc.

13 28 Apr 2003J. Schopf, PPDG Review13 MonaLisa Performance IO Threads CPU Usage Dell I8100 ~ 1GHz Test : A large snmp query (~200 metrics values) on a 500 nodes farm every 60 s. ~ 1600 metrics values collected per second from 1 MonaLisa service “lxshare” cluster at CERN ~ 600 ndoes “lxshare” cluster at CERN ~ 600 ndoes

14 28 Apr 2003J. Schopf, PPDG Review14 Future: OGSA and Monitoring Open Grid Services Architecture (OGSA) defines standard interfaces and behaviors for distributed system integration, especially: –Standard XML-based service information model –Standard interfaces for push and pull mode access to service data Notification and subscription Every service has it’s own service data –OGSA has common mechanism to expose a service instance’s state data to service requestors for query, update and change notification –Monitoring data is “baked right in”

15 28 Apr 2003J. Schopf, PPDG Review15 OGSA-Compatible Monitoring MDS3 –Part of OGSA reference implementation GT3 –Release will include full data in the GLUE schema for CE; Service data from RFT, RLS, GRAM; GridFTP Server data, SW version and path data –Simplest higher-level service is the caching index service Much like the GIIS in MDS 2.x MonALISA –will be compatible with OGSI-spec registration/subscription services –plans to have adapters that can interface to the OGSI service data LBNL tools also adapting OGSI spec

16 28 Apr 2003J. Schopf, PPDG Review16 Future Work - Interoperability Efforts will continue to make tools interoperate more –Many tools have the hooks to do this, it’s just a matter of filling in the slots We need a better understanding of the requirements from the applications

17 28 Apr 2003J. Schopf, PPDG Review17 Summary Many monitoring solutions are in use by different experiments Additional experience is leading towards common uses and deployments Ongoing work towards the use of common tools, common schema and naming conventions Still need better identification of requirements and the involvement of application groups to work together on a common/consistent infrastructure

18 Additional Details

19 28 Apr 2003J. Schopf, PPDG Review19 GLUE-Schema Effort Part of HICB/JTB GLUE framework To address need to common schemas between projects –framework independent –something to translate into, not a requirement within fabric layer Mail list: glue-schema@hicb.org www.hicb.org/glue/glue-schema/schema.html

20 28 Apr 2003J. Schopf, PPDG Review20 Glue Schema Status Compute Element schema: –Currently being used in EDG (MDS) and MDS2 –Found a couple minor things missing, which will be added to the next version –Will be in MDS-3 SE schema: –Lots of good discussion to finalize this at CHEP –Will start to use this in EDG (R-GMA) testbed 2 later this month NE schema: –Merged Ideas from EDG (UK group) with Datatag (Italian group) –GGF NM-WG is now working on this too.

21 28 Apr 2003J. Schopf, PPDG Review21 Globus MDS2 Monitoring and Discovery Service MDS has been accepted as core software for monitoring and presentation of information at the Grid level GIIS set up as part of collaboration with iVDGL –Presents overall picture of the state of the Grid sites Work continuing to interface it to local monitoring systems –Each site/experiment has preferred local solutions –Needed GLUE schema to make this happen

22 28 Apr 2003J. Schopf, PPDG Review22 MDS-3 in June Release All the data currently in core MDS-2 Full data in the GLUE schema for CE Service data from RFT, RLS, GRAM GridFTP Server data, SW version and path data Simplest higher-level service is the caching index service –Much like the GIIS in MDS 2.x –Will have configurablity like an GIIS hierarchy –Will also have PHP-style scripts, much as available today

23 28 Apr 2003J. Schopf, PPDG Review23 MonaLisa Current Status  MonaLisa is running for several months at all the US-CMS production sites and at CERN. It proved to be stable and scalable ( at CERN is monitoring ~600 nodes)  It is used to monitor several major internet connections (CERN-US, CERN-Geant, Taiwan – Chicago, DataTag link … ) MonaLisa is a prototype service under development. It is based on the code mobility paradigm which provides the mechanism for a consistent, dynamic invocation of components in large, distributed systems. http://monalisa.cern.ch/MONALISA

24 28 Apr 2003J. Schopf, PPDG Review24 Hawkeye Developed by Condor Group Focus – automatic problem detection Underlying infrastructure builds on the Condor ClassAd Tech. –Condor ClassAd language to identify resources in a pool –ClassAd Matchmaking to execute jobs based on attribute values of resources to identify problems in a pool Schema-free representation allows users to easily add new types of information to Hawkeye Information probes run on individual cluster nodes and report to central collector Easy to add new information probes

25 28 Apr 2003J. Schopf, PPDG Review25 Hawkeye Recent Accomplishments Release candidate for version 1.0 has been released. Used to monitor USCMS testbed Used to monitor University of Wisconsin-Madison Condor pool.

26 PingER PIs: Les Cottrell SLAC Impact and Connections  IMPACT:  increase network and Grid application bulk throughput over high delay, bandwidth networks (like DOE ’ s ESnet)  provide trouble shooting information for networkers and users by identifying the onset and magnitude of performance changes, and whether they appear in the application or the network  provide network performance data base, analysis and navigateable reports from active monitoring  CONNECTIONS:  SciDAC: High Energy Nuclear Physics, Bandwidth Estimation, Data Grid, INCITE  Base:Network Monitoring, Data Grid, Transport Protocols Milestones/Dates/Status  Infrastructure development Mon/YrDONE - develop simple window tuning tool08/0108/01 - initial infrastructure developed12/0112/01 - infrastructure installed at one site01/0201/02 - improve and extend infrastructure06/02 - deploy at 2 nd site08/02 - evaluate GIMI/DMF alternatives10/02 - extend deployment to PPDG sites03/03 Develop analysis/reporting tools - first version for standard apps02/02  Integrate new apps &net tools - GridFTP and demo05/05 - INCITE tools08/02 - BW measure tools (e.g. pathload)01/03 Compare & validate tools - GridFTP09/02 - BW tools 04/03 PingER novel ideas  Low impact network performance measurements to most of the Internet connected world providing delays, loss and connectivity information over long time periods  Network AND application high throughput performance measurements allowing comparisons, identification of bottlenecks  Continuous, robust, measurement, analysis and web based reporting of results available world wide  Simple infrastructure enabling rapid deployment, locating within an application host, and local site management to avoid security issues PingER: Active End-to-end performance monitoring for the Research and Education communities Tasks: - develop/deploy simple, robust ssh based active end-to-end measurement and management infrastructure -develop analysis/reporting tools -integrate new application and network measurement tools into the infrastructure -compare & validate various tools, and determine regions of applicability www-iepm.slac.stanford.edu Date Prepared: 1/7/02 High-Performance Network Research- SciDAC/Base

27 28 Apr 2003J. Schopf, PPDG Review27 IEPM-BW Status N measuring to about 55 sites (mainly Grid, HENP and major networking sites) 10 measuring sites in 5 countries, 5 are in production Data and analyzed results are available at http://www.slac.stanford.edu/comp/net/bandwidth- tests/antonia/html/slac_wan_bw_tests.html http://www.slac.stanford.edu/comp/net/bandwidth- tests/antonia/html/slac_wan_bw_tests.html PingER results have been plugged into MDS IEPM-BW and PingER data available via web services, we are aligning the naming with GGF NMWG and emerging GGF schemas We will incorporate and evaluate different tests (e.g. tsunami, GridFTP, UDPmon, new bandwidth estimators, new quick iperf) We are also focusing on making the data useful, working with the Internet2 PiPES project, on long and short term predictions and trouble-shooting.

28 28 Apr 2003J. Schopf, PPDG Review28 SAM-Grid


Download ppt "Monitoring: Grid, Fabric, Network Jennifer M. Schopf, Argonne National Lab PPDG Review 28 April 2003, Fermilab."

Similar presentations


Ads by Google