1 Research and Development. 2 R&D Agenda  Security  Bulk Data Movement  Data Replication and Mirroring  Monitoring  Metrics  Versioning  Product.

Slides:



Advertisements
Similar presentations
ICS 434 Advanced Database Systems
Advertisements

High Performance Computing Course Notes Grid Computing.
1 SRM-Lite: overcoming the firewall barrier for large scale file replication Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory April, 2007.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
A Computation Management Agent for Multi-Institutional Grids
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
DESIGNING A PUBLIC KEY INFRASTRUCTURE
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
PKI Single Sign On & Auto Provisioning Frank Siebenlist (ANL) Rachana Ananthakrishnan (ANL) Charles Bacon (ANL)
Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
Globus Computing Infrustructure Software Globus Toolkit 11-2.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Maintaining Windows Server 2008 File Services
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
OSG Public Storage and iRODS
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Digital Object Architecture
Presented by The Earth System Grid: Turning Climate Datasets into Community Resources David E. Bernholdt, ORNL on behalf of the Earth System Grid team.
Climate Sciences: Use Case and Vision Summary Philip Kershaw CEDA, RAL Space, STFC.
DISTRIBUTED COMPUTING
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
National Computational Science National Center for Supercomputing Applications National Computational Science NCSA-IPG Collaboration Projects Overview.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Bulk Data Movement: Components and Architectural Diagram Alex Sim Arie Shoshani LBNL April 2009.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
9 Systems Analysis and Design in a Changing World, Fourth Edition.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
- Vendredi 27 mars PRODIGUER un nœud de distribution des données CMIP5 GIEC/IPCC Sébastien Denvil Pôle de Modélisation, IPSL.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
1 SRM-Lite: overcoming the firewall barrier for data movement Arie Shoshani Alex Sim Viji Natarajan Lawrence Berkeley National Laboratory SDM Center All-Hands.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
1 Earth System Grid Center for Enabling Technologies ESG-CET Security January 7, 2016 Frank Siebenlist Rachana Ananthakrishnan Neill Miller ESG-CET All-Hands.
1 Use of SRM File Streaming by Gateway Alex Sim Arie Shoshani May 2008.
1 Accomplishments. 2 Overview of Accomplishments  Sustaining the Production Earth System Grid Serving the current needs of the climate modeling community.
1 Overall Architectural Design of the Earth System Grid.
1 Gateways. 2 The Role of Gateways  Generally associated with primary sites in ESG-CET  Provides a community-facing web presence  Can be branded as.
1 Summary. 2 ESG-CET Purpose and Objectives Purpose  Provide climate researchers worldwide with access to data, information, models, analysis tools,
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
© 2014 VMware Inc. All rights reserved. Cloud Archive for vCloud ® Air™ High-level Overview August, 2015 Date.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
1 Scientific Data Management Group LBNL SRM related demos SC 2002 DemosDemos Robust File Replication of Massive Datasets on the Grid GridFTP-HPSS access.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Presentation transcript:

1 Research and Development

2 R&D Agenda  Security  Bulk Data Movement  Data Replication and Mirroring  Monitoring  Metrics  Versioning  Product Services

3 Security: Single Sign-On Solutions  Goal: Single Sign-On (SSO) across browsers and non-browser clients  Public Key Infrastructure (PKI) SSO SSO for non-browser applications, like GridFTP SSO through X.509 public key certificates issued by MyProxy Online Certification Authority (CA) with username/password Auto-provisioning of trust configuration  Web SSO SSO for http/https applications through OpenID OpenID Identity Provider (IdP) with username/password  Web-SSO & PKI-SSO share username/password DB Single primary authentication mechanism for end user

4 Security: Integrated WebSSO & PKI-SSO

5 Security: MyProxy as Online CA  MyProxy: Open Source software from NCSA  Online CA is one of its many capabilities  Different primary authentication mechanisms through standardized Pluggable Authentication Module (PAM)  Shipped with Globus Toolkit, supported on various platforms  Client package as separate deployment, including Java clients and API Earth System Grid Center for Enabling Technologies: (ESG- CET)

6 Security: Auto-Provisioning  PKI-SSO solutions require configuration of trust-roots Identity providers (IdPs), Certification Authorities (CAs) Revocation lists  Up-to-date configuration required at servers and clients Scalability issues with large numbers of clients  MyProxy provides auto-provisioning option Integrated with login Transparently updates CAs and CRLs Is extended to use for server-provisioning also

7 Security: OpenID  OpenID provides SSO across multiple servers and can leverage multiple IdPs  OpenID satisfies ESG security requirements  OpenID uses standard HTTP/HTTPS protocol  Use ESG-specific OpenID profile to ensure safe deployment All communication with IdP requires SSL  Client  IdP and Idp  RP Yadis IDs (URIs) for OpenID identifiers Resource Providers (RP) enforce a white list of IdPs

8 Security: OpenID4Java  OpenID4Java: Open Source software ESG developers contribute enhancements back  Deployable as independent package into standard application servers Integrates well with ESG’s application server software  Built-in support: SSL (encrypted communication) User attributes push  Java API to write authentication filters and identity providers  Extended to support attributes and multiple identity providers

9 Bulk Data Movement  Requirements Access all data holdings through uniform interfaces, including disk pools and mass storage systems on various nodes, using various security models Allocate space quotas to users dynamically on gateways in order to serve files to client Manage file lifetimes in the allocated spaces, and automatically clean up spaces for reuse Provide easy-to-use user facilities to download many files Manage large-scale robust data movement for replication of core data between nodes  Storage Resource Management (SRM) tools support these requirements in ESG

10 Bulk Data Movement: SRM Technology, and BeStMan  Storage Resource Managers (SRM) are middleware components over shared distributed storage components, that provide: Dynamic space allocation Dynamic file management in spaces Uniform interface to all storage systems  The Berkeley Storage Manager (BeStMan) is an implementation of the SRM standard The SRM specification is an OGF (Open Grid Forum) standard that was developed over the last 7 years BesStMan is used in ESG, several High-Energy-Physics (HEP) experiments, and other applications  BeStMan in ESG (see figure next slide) Used for coordinating space allocation and transparent access and file movement between ESG nodes and the gateway Currently interfaces to HPSS in NERSC and ORNL, to MSS at NCAR, and to disk systems at LLNL and LANL Also used to manage space on the NCAR gateway

11 Bulk Data Movement: Use of BeStMan in ESG BeStMan at Gateway accesses all other BeStMan in nodes to get requested files (highlighted in purple)

12 DataMoverLite (DML): Simplifying Data Movement to Clients Goal: automate pulling of files into user’s workstation  Using various transfer protocols (GridFTP, bbcp, https, …)  Have a GUI that shows transfer progress, or summary progress with command line  Supports entire directory transfers  Supports suspend/resume operations  DML available on Linux, PC, MAC  GUI shows info on completed, active, pending transfers  Also, file sizes, transfer times, transfer speed

13 Bulk Data Movement Service Requirements  Move terabytes to petabytes (many thousands of files)  Asynchronous long-lasting operation  Recovery from transient failures and automatic restart  Take advantage of (dynamic) network provisioning  Use GridFTP, other protocols if necessary  Space verification at target  Support for data checksums  On-demand transfer status information  On-demand completion time estimates  Statistics collection  For security reasons bulk data movement needs to be done in “pull mode”

14 Workflow for Future Bulk Data Movement Service Multi-file request coordinator Verify storage at Target Replicate directory structure Generate plan using statistics Monitor and generate statistics Recovery and restart On-demand status Checksum comparison Dynamic progress estimation File transfer client Request submission Initial request estimation Compose request for failed files Initialization Execution Suspend and resume

15 Data Replication and Mirroring  Requirement: several mirror sites around the world want to host key subsets (called a “core”) of ESG data sets  This is a new requirement for ESG Replication of climate data sets was not originally an ESG goal Originally considered impractical because of large size of climate data sets  With increasing importance of the IPCC data, international sites want to replicate or “mirror” key data sets Give scientists in a geographical region access to a “local” copy Reduce wide area latencies for data access Provide increased fault tolerance and disaster protection, since data sets are available at multiple sites

16 Impact of Data Replication/Mirroring  This work will make ESG data sets more accessible to climate scientists outside of the ESG-CET project  Initial planned mirror sites: UK’s British Atmospheric Data Centre (BADC) Germany’s Max Planck Institute for Meterology (MPIM) Both have participated in design discussions for mirroring functionality  Others mirror sites likely (e.g., in Asia) Global network topology considerations  Impact will be to increase the use of ESG and CMIP5 data sets by scientists around the world, thus advancing climate science discoveries

17 Requirements for Data Mirroring  Newly published data set(s) are added to a common core produced at a gateway  A mirror site replicates some or all of the data sets from the common core published by a gateway  Changes to existing data sets (additions, deletions, replacements, modifications) are propagated from publishing gateway to mirror sites

18 Data Mirroring Plans Going Forward  Implementation plan (currently in progress) involves integration of several key ESG components and new functionality Choose among available source replicas for data and metadata Invoke the Bulk Data Movement component to copy data sets reliably to the mirror site’s data node Use existing ESG metadata API operations to query the relevant metadata at the publishing node Use a modified version of the ESG publication client to publish newly replicated data sets at the mirror site’s gateway Identify updates that need to be propagated to mirror sites using versioning functionality.  Technical objectives In the next year: complete initial implementation and deployment; evaluate data mirroring at sites in ESG, Europe Add functionality, including support for automatic subscription and notification of mirrored data sets

19 Replication Architecture (1)

20 Replication Architecture (2)

21 Monitoring  Monitoring has contributed significantly to the robustness of the ESG infrastructure  Based on the Globus Monitoring and Discovery System (MDS)  ESG uses MDS to monitor the status of components in the distributed system GridFTP data transfer services Storage Resource Managers (SRMs) NCAR portal HTTP data services OpenDAP services Replica Location Services (RLSs)

22 Globus Monitoring and Discovery System  MDS Index Service Collects status information from information providers at each component Report whether a particular service being monitored is currently working correctly  MDS Trigger Service Takes actions based on monitored conditions Sends s to the Earth System Grid administrators’ mailing list when components fail

23 Impact of Monitoring and Future Plans  Has resulted in much faster recovery of failed services in the distributed ESG infrastructure  Lower downtime of our infrastructure  The ESG team is quickly informed when components fail Allows the team to quickly restart failed services Often before failures are encountered by users  We plan to deploy yet more sophisticated monitoring ESG infrastructure increasingly distributed, federated Also want to monitor status of mirror sites worldwide Monitor service performance as well as availability  Investigating NetLogger, PerfSONAR

24 Metrics  Metrics are required to track and record users interactions with the ESG enterprise system  Reporting is required to show the benefits of the ESG enterprise system to the scientific community at large  ESG Gateway requests metric data from its Data Nodes  An ESG Gateway will periodically download metrics data (SRM, OPeNDAP, LAS, server hardware performance) gathered by a Data Node for a give interval of time  Returned metrics data will then be stored at the ESG Gateway for future metrics reports

25 Metrics Requirements

26 Metrics Requirements

27 Metrics progress The gathering of important metrics for the ESG Gateway has been completed  User registrations  User logins  File downloads  User clickstreams  Browser type usage Report generation for key metrics has been completed  Total users registered, including monthly trends  Total files downloaded, including monthly trends

28 Metrics Plan Going Foward Several improvements are required in the near term for Metrics  Design and development of the Data Node “black box” metrics gathering software  Design and development of auto generated report notifications via  Design and development of a star schema for the metrics database

29 Data Versioning  Data changes, even after publication Errors in simulation, processing, metadata, etc.  Critically important that data publishers and consumers can identify which version of data they are working with Changes to data may affect results of analyses  Versioning previously handled manually Adequate for moderate amounts of closely controlled data (current production archives) Insufficient for global scale, especially with replication (key driver)  Now putting versioning on formal footing In collaboration with BADC, MPIM  Initial focus on identification of key use cases, developing and evaluating preliminary software designs

30 Proposed Versioning Software Design

31 Product Services: Delivering Visualization and Analysis to Users  Product Services provide a web-based easy-to-use interface to a vast array of interactive, science-relevant information products Make plots in 1 and 2 dimensions along any axis or combination of two axes including animation along the time axis Control plot appearance Launch external tools either via scripts to access data in desktop tools or direct launch of Google Earth Compare different data sets and variables in specialized user interface Request server-side analysis and view the results Supports plots of curvilinear data grids and on-the-fly re- gridding to rectangular grids  Web-based administrative interface for cache management

32 Product Services Architecture  Designed to integrate many data types and products from many legacy applications into a unified user- controlled environment  Combines incoming request with metadata to learn where the data are; what protocol is needed to read them and instructs backend services to read the data and create products

33 Product Services Offer Diverse Capabilities (1/2) Product Services provide a Web-based easy-to-use interface to a vast array of interactive, science-relevant information products Compute on-the-fly analysis via efficient server-side functions and plot the result Launch external tools like Google Earth, Matlab and others

34 Product Services Offer Diverse Capabilities (2/2) Make comparisons along an axes and/or between data sets Make comparisons along different cutting planes and/or between data sets