Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi,

Similar presentations


Presentation on theme: "Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi,"— Presentation transcript:

1 Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi, Luca Cinquini, Mike D’Arcy, Neill Miller, David Bernholdt

2 Talk Outline l Overview of the Earth System Grid l Overview of Monitoring in the Globus Toolkit l Globus Monitoring Services in ESG u Monitoring and Discovery System u Trigger Service l Summary

3 The Earth System Grid: Turning Climate Datasets into Community Resources www.earthsystemgrid.org

4 The growing importance of climate simulation data l DOE invests broadly in climate change research: u Development of climate models u Climate change simulation u Model intercomparisons u Observational programs l Climate change research is increasingly data-intensive: u Analysis and intercomparison of simulation and observations from many sources u Data used by model developers, impacts analysts, policymakers 4 Bernholdt_ESG_SC07 Results from the Parallel Climate Model (PCM) depicting wind vectors, surface pressure, sea surface temperature, and sea ice concentration. Prepared from data published in the ESG using the FERRET analysis tool by Gary Strand, NCAR. Slide Courtesy of Dave Bernholdt, ORNL

5 Earth System Grid objectives To support the infrastructural needs of the national and international climate community, ESG is providing crucial technology to securely access, monitor, catalog, transport, and distribute data in today’s grid computing environment. HPC hardware running climate models ESG Sites ESG Portal 5 Bernholdt_ESG_SC07 Slide Courtesy of Dave Bernholdt, ORNL

6 Main ESG Portal IPCC AR4 ESG Portal 146 TB of data at four locations l 1,059 datasets l 958,072 files l Includes the past 6 years of joint DOE/NSF climate modeling experiments 35 TB of data at one location l 77,400 files l Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change l Model data from 13 countries 4,910 registered users1,245 registered analysis projects Downloads to date l 30 TB l 106,572 files Downloads to date l 245 TB l 914,400 files l 500 GB/day (average) > 300 scientific papers published to date based on analysis of IPCC AR4 data ESG facts and figures Worldwide ESG user base IPCC Daily Downloads (through 7/2/07) Slide Courtesy of Dave Bernholdt, ORNL

7 ESG architecture and underlying technologies l Climate data tools u Metadata catalog u NcML (metadata schema) u OPenDAP-G (aggregation and subsetting) l Data management u Data Mover Lite u Storage Resource Manager l Globus toolkit u Globus Security Infrastructure u GridFTP u Monitoring and Discovery Services u Replica Location Service l Security u Access control u MyProxy u User registration Data Subsetting Access Control User Registration OPeNDAP-GMyProxy SRM DISK Cache ESG Web Portal NCAR Cache NCAR MSS RLSSRM ORNL HPSS RLSSRM RLS SRM RLS LANL Cache search browse download Web Browser Web Browser DML Data User publish Web Browser Web Browser Data Provider Monitoring Services Data Publishing Climate Metadata Catalogs Browsing Usage Metrics Data Download Data Search NERSC MSS, HPSS : Tertiary data storage systems First Generation ESG Architecture SRM Slide Courtesy of Dave Bernholdt, ORNL

8 Evolving ESG to petascale Full data sharing (add to testbed…) Synchronized federation – metadata, data Full suite of server-side analysis Model/observation integration ESG embedded into desktop productivity tools GIS integration Model intercomparison metrics User support, life cycle maintenance Central database Centralized curated data archive Time aggregation Distribution by file transport No ESG responsibility for analysis Shopping-cart-oriented web portal Testbed data sharing Federated metadata Federated portals Unified user interface Selected server-side analysis Location independence Distributed aggregation Manual data sharing Manual publishing ESG Data System Evolution 2006Early 20092011 CSSM, IPCC, satellite, In situ biogeochemistry, ecosystems ESG Data Archive TerabytesPetabytes CCSM IPCC Slide Courtesy of Dave Bernholdt, ORNL

9 l Petascale data archives l Broader geographical distribution of archives u across the United States u around the world l Easy federation of sites l Increased flexibility and robustness Architecture of the next-generation ESG Second Generation ESG Architecture Federated ESG Deployment ESG Node Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (CCES) Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (IPCC) Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (CCSM) Distribution Online Data Distribution Online Data Deep Archives CPU Browser Clients Web Portals Remote Application Clients (CDAT, NCL, Ferret, GIS, Publishing, OPeNDAP, DML, Modeling, etc.) Local, Remote, and Web Services Interfaces Applications Components (data transfer, data publishing, search, analysis, visualization, post-processing, computation) Cross-Cutting Concerns (security, logging, monitoring) Workflow & Orchestration Slide Courtesy of Dave Bernholdt, ORNL

10 The team and sponsors National Center for Atmospheric Research Los Alamos National Laboratory Argonne National Laboratory Oak Ridge National Laboratory USC Information Science Institute Lawrence Livermore National Laboratory/ PCMDI Lawrence Berkeley National Laboratory National Oceanic & Atmospheric Administration/PMEL Climate Data Repository and ESG participant ESG participant Slide Courtesy of Dave Bernholdt, ORNL

11 Monitoring ESG l ESG consists of heterogeneous components deployed across multiple administrative domains l The climate community has come to depend on the ESG infrastructure as a critical resource u Failures of ESG components or services can disrupt the work of many scientists u Need to minimize infrastructure downtime l Monitoring components to determine their current state and detect failures is essential l Monitoring systems: u Collect, aggregate, and sometimes act upon data describing system state u Monitoring can help users make resource selection decisions and help administrators detect problems

12 GT4 Monitoring and Discovery System l A Web service adhering to the Web Services Resource Framework (WSRF) standards Consists of two higher-level services: l Index service collects and publishes aggregated information about Grid resources l Trigger service collects resource information from the Index Service and performs actions when certain trigger conditions are met l Information about resources is obtained from external components called information providers u Currently in ESG, these are simple scripts and programs that check the status of services

13

14 ESG Services Currently Monitored l GridFTP server: NCAR l OPeNDAP server: NCAR l Web Portal: NCAR l HTTP Dataserver: LANL, NCAR l RLS servers: LANL, LBNL, NCAR, ORNL l Storage Resource Managers: LBNL, NCAR, ORNL l Hierarchical Mass Storage Systems: LBNL, NCAR, ORNL

15 Monitoring Overall System Status l Monitored data are collected in MDS4 Index service l Information providers check resource status at a configured frequency l Currently, every 10 minutes l Report status to Index Service l This resource information in Index Service is queried by the ESG Web portal l Used to generate overall picture of state of ESG resources l Displayed on ESG Web portal page

16 Trigger Actions Based on Monitoring Information l MDS4 Trigger service periodically polls Index Service l Based on the current resource status, Trigger service determines whether specified trigger rules and conditions are satisfied u If so, performs specified action for each trigger l Current action: Trigger service sends email to system administrators when services fail u Ideally, system failures can be detected and corrected by administrators before they affect larger ESG community l Future plans: include richer recovery operations as trigger actions, e.g., automatic restart of failed services

17 Example Monitoring Information Total error messages for May 200647 Messages related to certificate and configuration problems at LANL 38 Failure messages due to brief interruption in network service at ORNL on 5/13 2 HTTP data server failure at NCAR 5/171 RLS failure at LLNL 5/221 Simultaneous error messages for SRM services at NCAR, ORNL, LBNL on 5/23 3 RLS failure at ORNL 5/241 RLS failure at LBNL 5/311

18 Successes and Lessons Learned in ESG Monitoring l Overview of current system state for users and system administrators u ESG portal displays an overall picture of the current status of the ESG infrastructure u Gives users and administrators an understanding at a glance of which resources and services are currently available l Failure notification u Failure messages from Trigger service have helped system administrators to identify and quickly address failed components and services u Before the monitoring system was deployed, services would fail and might not be detected until a user tried to access an ESG dataset u MDS4 deployment has enabled a unified interface and notification system across ESG resources

19 Successes and Lessons Learned in ESG Monitoring (cont.) l More information was needed on failure types u An enhancement to MDS4 based on our experience: include additional information about location and type of failed service in subject line of trigger notification email messages u Allow message recipients to filter these messages and quickly identify which services need attention.

20 Successes and Lessons Learned in ESG Monitoring (cont.) l Validation of new deployments u Sometimes make significant changes to the Grid infrastructure u E.g., Modification of service configurations or deployment of a new component version u May encounter a series of failure messages for particular classes of components over a period of days or weeks due to these changes u Example: pattern of failure messages for RLS servers that corresponded to a configuration problem related to updates among the services u Example: series of SRM failure messages relating to a new feature that had unexpected behavior u Monitoring messages helped to identify problems with these newly deployed or reconfigured services u Absence of failure messages can in part validate a new configuration or deployment

21 Successes and Lessons Learned in ESG Monitoring (cont.) l Failure deduction u The monitoring system can be used to deduce the reason for complex failures. u Example: we used MDS4 to gain insights into why the ESG portal crashed occasionally due to a lack of available file descriptors l Used monitoring infrastructure to check file descriptor usage by different services running on ESG portal u Example: Failure messages indicated that SRMs at three different locations had failed simultaneously. Such simultaneous independent failures are highly unlikely. We investigated and found a problem with a query expression in our monitoring software. l The monitoring system can be used to deduce reason for complex failures u System-wide monitoring can be used to detect a pattern of failures that occur close together in time u Deduce a problem at a different level of the system

22 Successes and Lessons Learned in ESG Monitoring (cont.) l Warn of Certificate Problems and Imminent Expirations u All ESG services at the LANL site reported failures simultaneously u Problem was expiration of the host certificate for the ESG node at that site u Downtime resulted while the problem was diagnosed and administrators requested and installed a new host certificate u To avoid such downtime in the future, we implemented additional information providers and triggers that check the expiration date of host certificates on services where this information can be queried u Trigger Service checks informs system administrators when certificate expiration is imminent

23 Successes and Lessons Learned in ESG Monitoring (cont.) l Scheduled Downtime u When a particular site has scheduled downtime for site maintenance, it is not necessary to send failure messages to system administrators u Developed a simple mechanism that disables particular triggers for the specified downtime period u Monitoring infrastructure still collects information about service state during this period, but failure conditions do not trigger actions by the Trigger Service

24 Acknowledgements l ESG is funded by the US Department of Energy under the Scientific Discovery Through Advanced Computing Program l MDS is funded by the US National Science Foundation under the Office of Cyberinfrastructure l ESG Team includes: u National Center for Atmospheric Research: Don Middleton, Luca Cinquini, Rob Markel, Peter Fox, Jose Garcia, others u Lawrence Livermore National Laboratory: Dean Williams, Bob Drach and others u Argonne National Laboratory: Veronika Nefedova, Ian Foster, Rachana Ananthakrishnan, Frank Seibenlist, others u Lawrence Berkeley National Laboratory: Arie Shoshani, Alex Sim and others u Oak Ridge National Laboratory: David Bernholdt, Meili Chen and others u Los Alamos National Laboratory: Phillip Jones and others u USC Information Sciences Institute: Ann Chervenak, Robert Schuler, Shishir Bharathi, Mei Hui Su l MDS Team includes: u Argonne National Laboratory: Jen Schopf, Neill Miller u USC ISI: Laura Pearlman, Mike D’Arcy

25 More on Metadata

26 Metadata Services l Metadata is information that describes data l Metadata services allow scientists to u Record information about the creation, transformation, meaning and quality of data items u Query for data items based on these descriptive attributes l Accurate identification of desired data items is essential for correct analysis of experimental and simulation results. l In the past, scientists have largely relied on ad hoc methods (descriptive file and directory names, lab notebooks, etc.) to record information about data items l However, these methods do not scale to terabyte and petabyte data sets consisting of millions of data items. l Extensible, reliable, high performance metadata services are required to support registration and query of metadata information

27 Presentation from SC2003 talk by Gurmeet Singh

28 Example: ESG Collection Level Metadata Class Definitions l Project u A project is an organized activity that produces data. The scope and duration of a project may vary, from a few datasets generated over several weeks or months, to a multi-year project generating many terabytes. Typically a project will have one or more principal investigators and a single funding source. u A project may be associated with multiple ensembles, campaigns, and/or investigations. A project may be a subproject of another project. u Examples: u CMIP (Coupled Model Intercomparison Project) u CCSM (Community Climate System Model) u PCM (Parallel Climate Model)

29 ESG Collection Level Metadata (cont.) l Ensemble u An ensemble calculation is a set of simulations that are closely related, in that typically all aspects of the model configuration and boundary conditions are held constant, while the initial conditions and/or external forcing are varied in a prescribed manner. Each set of initial conditions generates one or more dataset. l Campaign u A Campaign is a set of observational activities that share a common goal (e.g., observation of the ozone layer during the winter/spring months), and are related either geographically (e.g, a campaign at the South Pole) and/or temporally (e.g., measurements of rainfall at several observation stations during December 2003). l Investigation u An investigation is an activity, within a project, that produces data. The scope of the investigation is narrower and more focused than for the project. An investigation may be a simulation, experiment, observation, or analysis.

30 Example: ESG Collection Level Metadata Other Classes l Simulation l Experiment l Observation l Analysis l Dataset l Service

31 Attributes of classes l Project u Id: a unique identifier for the project. u Name: a brief name for the project intended for display in a browser, etc. u Topics: one or more keywords, qualified by an optional encoding, intended to be used by specialized search and discovery engines. See, for example, http://gcmd.gsfc.nasa.gov/Resources/valids/gcmd_para meters.html u Persons – project participants and their respective roles. u Description: a textual description of the project, intended to provide more in-depth information than the Name. u Notes: additional, ad-hoc information about the project. u References – links or references to additional project information: web pages, publications, etc. u Funding: funding agencies or sources. u Rights: description of the ownership and access conditions to the data holdings of the project.

32 l Ensemble u Id: a unique identifier for this ensemble. u Name: name for this ensemble. u Description: a textual description of the project, intended to provide more in-depth information than the Name. u Notes: additional, ad-hoc information about the project. u Persons – those responsible for the ensemble data. u References – optional links or references to additional project information: web pages, publications, etc. u Rights: optional description of the ownership and access conditions to the data holdings of the ensemble, if different from the project.

33 l A standard name is a description of a scientific quantity generated by a model run l Follows the CF standard name table, and is hierarchical l For example, the standard name ‘atmosphere’ is a standard name category that includes more specific quantities such as ‘air pressure l - Atmosphere l - Air Pressure l - … l - Carbon Cycle l - Biomass Burning Carbon Flux l - … l - Cloud l - Air Pressure at Cloud Base l - … l - Hydrology l - Atmosphere Water Content l - … l - Ocean l - Baroclinic Eastward Sea Water Velocity l - … l - Radiation l - Atmosphere Net Rate of Absorption of Longwave Energy l - … l - Sea Ice l - Direction of Sea-Ice Velocity l - … l - Surface l - Canopy and Surface Water Amount l - …

34 Metadata Services in Practice… l Generic metadata services have not proven to be very useful u MCS used in Pegasus workflow system to manage its metadata, provenance, etc. u Not widely used in science deployments l Virtual organizations (scientists) agree on appropriate metadata schema to describe data l Typically deploy a specialized metadata service u Relational database with indexes on domain-specific attributres to support common queries u RDF tuple services l Provide faster, more targeted queries on agreed metadata than a generic catalog


Download ppt "Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi,"

Similar presentations


Ads by Google