Data Management and Transfer in High-Performance Computational Grid Environments B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman,

Slides:



Advertisements
Similar presentations
Giggle: A Framework for Constructing Scalable Replica Location Services Ann Chervenak, Ewa Deelman, Ian Foster, Leanne Guy, Wolfgang Hoschekk, Adriana.
Advertisements

The Globus Striped GridFTP Framework and Server Bill Allcock 1 (presenting) John Bresnahan 1 Raj Kettimuthu 1 Mike Link 2 Catalin Dumitrescu 2 Ioan Raicu.
The Replica Location Service In wide area computing systems, it is often desirable to create copies (replicas) of data objects. Replication can be used.
Globus DataGrid Overview Bill Allcock, ANL GridPP Meeting 30 June 2003.
Chapter 10: Designing Databases
The Anatomy of the Grid: An Integrated View of Grid Architecture Carl Kesselman USC/Information Sciences Institute Ian Foster, Steve Tuecke Argonne National.
Data Grids Darshan R. Kapadia Gregor von Laszewski
Transaction.
GridFTP: File Transfer Protocol in Grid Computing Networks
Grid computing Globus GridFTP & Replica Management Robert Nickel BTU - Mathematik 01.Februar 2002.
GridFTP Introduction – Page 1Grid Forum 5 GridFTP Steve Tuecke Argonne National Laboratory.
Data Management I DBMS Relational Systems. Overview u Introduction u DBMS –components –types u Relational Model –characteristics –implementation u Physical.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Chapter 12 File Management Systems
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Introduction to UDDI From: OASIS, Introduction to UDDI: Important Features and Functional Concepts.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
The Data Replication Service Ann Chervenak Robert Schuler USC Information Sciences Institute.
1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.
Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Globus Data Replication Services Ann Chervenak, Robert Schuler USC Information Sciences Institute.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
File and Object Replication in Data Grids Chin-Yi Tsai.
Session-8 Data Management for Decision Support
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Globus Replica Management Bill Allcock, ANL PPDG Meeting at SLAC 20 Sep 2000.
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Cole David Ronnie Julio. Introduction Globus is A community of users and developers who collaborate on the use and development of open source software,
Globus – Part II Sathish Vadhiyar. Globus Information Service.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
The Globus eXtensible Input/Output System (XIO): A protocol independent IO system for the Grid Bill Allcock, John Bresnahan, Raj Kettimuthu and Joe Link.
Introduction to Active Directory
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
1 Chapter 2 Database Environment Pearson Education © 2009.
Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Protocols and Services for Distributed Data- Intensive Science Bill Allcock, ANL ACAT Conference 19 Oct 2000 Fermi National Accelerator Laboratory Contributors:
1 10 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 10 Designing Databases.
A Sneak Peak of What’s New in Globus GridFTP John Bresnahan Michael Link Raj Kettimuthu (Presenting) Argonne National Laboratory and The University of.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
The Data Grid: Towards an architecture for Distributed Management
Evaluation of “data” grid tools
OGSA Data Architecture Scenarios
Data Base System Lecture : Database Environment
Distributed Systems Bina Ramamurthy 11/30/2018 B.Ramamurthy.
Distributed Systems Bina Ramamurthy 12/2/2018 B.Ramamurthy.
Distributed Systems Bina Ramamurthy 4/22/2019 B.Ramamurthy.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,
Presentation transcript:

Data Management and Transfer in High-Performance Computational Grid Environments B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, S. Tuecke Parallel Computing Journal, Vol. 28 (5), May 2002, pp

Outline Introduction The Globus Toolkit GridFTP Replica Management Conclusions

Introduction 1 Two services that are fundamental to any Data Grid: reliable, high-speed transport and replica management. High-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access. Replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas.

Introduction 2 Given above two services, a wide range of higher-level data management services can be constructed.  reliable creation of a copy of large data collection  selection of the best replica  automatic creation of new replicas in response to application demands

The Globus Toolkit 1 The Globus Toolkit developed within the Globus project provides middleware services for Grid computing environments. Major components include the Grid Security Infrastructure, resource management services, information services, etc.

The Globus Toolkit 2 Data Grid services complement and build on these components.  The GridFTP transfer service and the replica management service use GSI for authentication and authorization.  Higher-level data replication services can use the information service to locate the “best” replica and the resource management service to reserve the computational, network, and storage resources required by a data movement operation.

Outline of GridFTP Why GridFTP? GridFTP Functionality The GridFTP Protocol Implementation GridFTP Performance

Why GridFTP? 1 The applications that we consider use a variety of storage systems, each designed to satisfy specific needs and requirements for storing, transferring and accessing large datasets. These include Distributed Parallel Storage System (DPSS) and the High Performance Storage System (HPSS), and the Storage Resource Broker (SRB).

Why GridFTP? 2 These storage systems typically use incompatible and often unpublished protocols for accessing data, and therefore each requires the use of its own client. These incompatible protocols and client libraries effectively partition the datasets available on the Grid. Applications that require access to data stored in different storage systems must use multiple access methods.

Why GridFTP? 3 So it would be advantageous to have a common level of interoperability between all of these disparate systems: a common – but extensible – underlying data transfer protocol. GridFTP: A Secure, Efficient Data Transport Mechanism  It extends the standard FTP protocol to include a superset of the features offered by the various Grid storage systems currently in use.

GridFTP Functionality 1 GridFTP functionality includes features that are supported by the FTP standard, extensions that have already been standardize.

GridFTP Functionality 2 Other features are new extensions to FTP.  Grid Security Infrastructure and Kerberos support  Third-party control of data transfer  Parallel data transfer  Striped data transfer  Partial file transfer  Automatic negotiation of TCP buffer/window size  Support for reliable and restartable data transfer

The GridFTP Protocol Implementation The implementation consists of two principal C libraries: the globus_ftp_control_library and the globus_ftp_client_library.

The globus_ftp_control_library The globus_ftp_control_library implements the control channel API.  This API provides routines for managing a GridFTP connection, including authentication, creation of control and data channels, and reading and writing data over data channels. Having separate control and data channels, as defined in the FTP protocol standard, greatly facilitates the support of such features as parallel, striped and third-party data transfers.

The globus_ftp_client_library The globus_ftp_client_library implements the GridFTP client API.  This API provides higher-level client features on top of the globus_ftp_control library, including complete file get and put operations, calls to set the level of parallelism for parallel data transfers, partial file transfer operations, third-party transfers, and eventually, functions to set TCP buffer sizes.

GridFTP Performance 1 Figure 2 shows the performance of GridFTP transfers between two workstations, one at Argonne National Laboratory in Illinois and the other at Lawrence Berkeley National Laboratory in California, connected over the ES-Net network ( Both workstations run the Linux operating system and have RAID storage systems with read/write bandwidth of approximately 60 megabytes per second.

GridFTP Performance 2 The experiment was run in random order relative to the number of streams, with the GridFTP measurement for a certain number of streams followed immediately by the iperf measurement for the same number of streams. Iperf measurements were made using a one megabyte window size and ran for 30 seconds. GridFTP measurements recorded the time to transfer a one gigabyte file.

GridFTP Performance 3

GridFTP Performance 4 The graph shows that GridFTP bandwidth increases with the number of parallel TCP streams between the two workstations, until bandwidth reaches about 200 megabits per second with seven to ten TCP streams. Differences between iperf and GridFTP performance can be attributed to overheads including authentication and protocol operations, reporting performance to the client, and checkpointing data to allow restart. GridFTP achieves on average approximately 78% of the iperf bandwidth, although there is a great deal of variability.

GridFTP Performance 5 The figure demonstrates GridFTP reliability.

Outline of Replica Management Introduction of Replica Management Key Architecture Decisions for the Replica Management Service A Replica Management System Scenario The Replica Catalog Replica Management

Grid Architecture

Introduction of Replica Management 1 This component is responsible for managing the replication of complete and partial copies of datasets, defined as collections of files. Replica management services include:  creating new copies of a complete or partial collection of files  registering these new copies in a Replica Catalog  allowing users and applications to query the catalog to find all existing copies of a particular file or collection of files

Introduction of Replica Management 2 The replica management architecture assumes the following data model.  Data are organized into files.  Users group files into collections.  A replica or location is a subset of a collection that is stored on a particular physical storage system.  There may be multiple, possibly overlapping subsets of a collection stored on multiple storage systems in a data grid.

Introduction of Replica Management 3 A logical file name is a globally unique identifier for a file within the data grid’s namespace. The purpose of the replica management service is to map a unique logical file name to a possibly different physical name for the file on a particular storage device.

Key Architecture Decisions for the Replica Management Service 1 1. Separation of Replication and Metadata Information  The objects that can be registered with the replica management service contain only the information required to map logical file and collection names to physical locations.  Any other information that might be associated with files or collections, such as descriptions of file contents or the experimental conditions under which files were created, should be stored in an orthogonal metadata management service.

Key Architecture Decisions for the Replica Management Service 2 2. Replication Semantics  For multiple replicas (locations) of a logical collection, we make no guarantees about file consistency, nor do we maintain any information on which was the “original” or “source” location from which one or more copies were made.  When users register files as replicas of a logical collection, they assert that these files are replicas under a user-specific definition of replication. Our replica management service does not perform any operations to check, guarantee or enforce the user’s assertion.

Key Architecture Decisions for the Replica Management Service 3 3. Replica Management Service Consistency  Although our architecture makes no guarantees about consistency among registered file replicas, we must make certain guarantees about the consistency of information stored in the replica management service itself.  Since computational and network failures are inevitable in distributed computing environments, the replica management service must be able to recover and return to a consistent state.  One way our architecture remains consistent is to guarantee that no file registration operation should successfully complete unless the file exists completely on the corresponding storage system.

Key Architecture Decisions for the Replica Management Service 4 4. Rollback  When a failure occurs during a multi-part operation, the information registered in the replica management service may become corrupted.  We guarantee that if failures occur during complex operations, we will rollback the state of the replica management service to the previously-consistent state before the operation began.  This requires us to save sufficient state about outstanding complex operations to revert to a consistent state after failures.

A Replica Management System Scenario

The Replica Catalog 1 The purpose of the replica catalog is to provide mappings between logical names for files or collections and one or more copies of those objects on physical storage systems. The catalog registers three types of entries: logical collections, locations, and logical files.

The Replica Catalog 2 A logical collection is a user-defined group of files.  Aggregating files should reduce both the number of entries in the catalog and the number of catalog manipulation operations required to manage replicas. Location entries in the replica catalog contain the information required for mapping a logical collection to a particular physical instance of that collection.  The location entry may register information about the physical storage system, such as the hostname, port and protocol.  Each location entry represents a complete or partial copy of a logical collection on a storage system.  One location entry corresponds to exactly one physical storage system location. Each logical collection may have an arbitrary number of associated location entries, each of which contains a (possibly overlapping) subset of the files in the collection.

The Replica Catalog 3 Logical files are entities with globally unique names that may have one or more physical instances. The catalog may optionally contain one logical file entry in the replica catalog for each logical file in a collection.

The Replica Catalog 4

Replica Management 1 Replica management operations combine storage system accesses with calls to the replica catalog. There are three kinds of operations: publish, copy and delete. The file_publish operation copies a file from a storage system that is not currently registered in the replica catalog onto a registered storage system and updates the replica catalog to record the presence of the new file.  It is intended to meet the needs of application communities that produce data files and then make them available for general use.

Replica Management 2 The file_copy operation is quite similar to the publish command. It copies a file from one registered replica catalog location’s storage system to another and updates the destination location and collection object to reflect the newly-copied file. The file_delete operation removes a filename from a registered replica catalog location entry and optionally also deletes the file from the storage system associated with the replica catalog location. The functions described in this section have been implemented in the globus_replica_management API.

Conclusions High-performance, distributed data-intensive applications require two fundamental services: secure, reliable, efficient data transfer and the ability to register, locate, and manage multiple copies of datasets. These two services can be used to build a range of higher-level capabilities.