DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.

DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed computing machinery required to support SciDAC goals is the efficient, high-performance, reliable, secure, and policy-aware management of large-scale data movement. This problem is fundamental to application domains as diverse as experimental physics (high energy physics, nuclear physics, light sources), simulation science (climate, computational chemistry, fusion, astrophysics), and large-scale collaboration. In each case, we have highly distributed user communities that require high-speed access to valuable data, whether for visualization or analysis. The quantities of data involved (terabytes to petabytes), the scale of the demand (hundreds or thousands of users, data-intensive analyses, real-time constraints), and the complexity of the infrastructure that must be managed (networks, tertiary storage systems, network caches, computers, visualization systems) make the problem extremely challenging. The Problem: The general problem that we propose to focus on is the reliable, high-performance movement of data between Data Grid components. In tackling this problem, we focus on three key questions: 1. Data Movement: How do we move data efficiently and reliably between two Data Grid components? 2. Data Transfer Management: How do we coordinate end-to-end data transfers so as to meet performance goals and make efficient use of available resources? 3. Collective Data Management: How do we manage structure, schedule, and manage data movement operations to satisfy higher-level goals? The Problem: The general problem that we propose to focus on is the reliable, high-performance movement of data between Data Grid components. In tackling this problem, we focus on three key questions: 1. Data Movement: How do we move data efficiently and reliably between two Data Grid components? 2. Data Transfer Management: How do we coordinate end-to-end data transfers so as to meet performance goals and make efficient use of available resources? 3. Collective Data Management: How do we manage structure, schedule, and manage data movement operations to satisfy higher-level goals? The eXtensible Input / Output System (XIO): XIO provides an abstraction layer over the familiar Read/Write/Open/Close (RWOC) API. Any stream based data source can use the same API, by writing an appropriate driver. The figure shows how the drivers are arranged in a stack with the XIO framework handling transitions between drivers. No buffer copies are done for efficiency. There is exactly one transport driver and it must be the first driver on the stack. It brings data in and out of the XIO process space. We provide file, TCP, and UDP transport drivers by default. Transform drivers are optional and alter the data before Transport. GSI is an example transform driver. We are in the process of producing a MODE E (parallel TCP) and a GridFTP driver. The GridFTP will allow a 3 rd party application to access files under the control of a GridFTP server using RWOC semantics. FRAMEWORKFRAMEWORK FRAMEWORKFRAMEWORK Transform Transport XIO SOURCESOURCE SOURCESOURCE SINKSINK SINKSINK XIOXIO XIOXIO BUFFERBUFFER BUFFERBUFFER XIOXIO XIOXIO GridFTP Server Could be: File, TCP, UDT, SRB, HDF5, etc with appropriate XIO driver Completely Re-implemented GridFTP Server: There are 3 primary reasons we replaced the wuftpd based implementation: 1) Ease of Maintenance 2) Ease of Extensibility 3) Licensing issues. As shown in the figure below, the new server uses XIO. Again, there are no buffer copies for performance. Using XIO provides much greater extensibility. For instance, if a mass storage vendor writes an XIO driver to their system, it should be relatively easy to produce a GridFTP server to front their mass store. At SC2003 demonstrated using the new server over a non-TCP protocol via XIO. We used UDT from Bob Grossman’s group at UIC. Such aggressive protocols may be appropriate in an engineered, semi-private network. In addition, we achieved 9 Gbs disk to disk using 15 servers with TCP over a 10 GigE link. The Tools: Future Work: Future areas of emphasis include:  MODE E (Parallel TCP) and GridFTP XIO drivers  GridFTP servers for sources other than files (HPSS, SRB, etc)  Scalability of the Reliable File Transfer (RFT) service (millions of files)  Striping released and more flexible striping schemes  Windows Port of GridFTP  Collective services that integrate / coordinate these and other services  Improved scalability and robustness of Replica Location Service  Grid service design and implementation of Replica Location Service  MODE E (Parallel TCP) and GridFTP XIO drivers  GridFTP servers for sources other than files (HPSS, SRB, etc)  Scalability of the Reliable File Transfer (RFT) service (millions of files)  Striping released and more flexible striping schemes  Windows Port of GridFTP  Collective services that integrate / coordinate these and other services  Improved scalability and robustness of Replica Location Service  Grid service design and implementation of Replica Location Service Replica Location Service (RLS): is a distributed registry service that records the locations of data copies and allows discovery of replicas. The RLS has been released in Globus Toolkit Version 3.0. A performance and scalability study on RLS was conducted in Fall 2003 and Spring 2004. This study demonstrates that individual RLS servers perform well, scaling up to millions of entries and tens of simultaneous requesting processes. The study also demonstrates that soft state updates of the distributed RLI index scale well when using incremental updates or Bloom filter compression. In January 2004, we released the Copy and Registration Service, which is an OGSI Grid service that wraps around the GT3 RLS implementation. The service integrates file copy operations with RLS registration operations. Through the Global Grid Forum’s OGSA Replication Service Working Group, we are standardizing Grid service interfaces for Replica Location Services. Replica Location Service (RLS): is a distributed registry service that records the locations of data copies and allows discovery of replicas. The RLS has been released in Globus Toolkit Version 3.0. A performance and scalability study on RLS was conducted in Fall 2003 and Spring 2004. This study demonstrates that individual RLS servers perform well, scaling up to millions of entries and tens of simultaneous requesting processes. The study also demonstrates that soft state updates of the distributed RLI index scale well when using incremental updates or Bloom filter compression. In January 2004, we released the Copy and Registration Service, which is an OGSI Grid service that wraps around the GT3 RLS implementation. The service integrates file copy operations with RLS registration operations. Through the Global Grid Forum’s OGSA Replication Service Working Group, we are standardizing Grid service interfaces for Replica Location Services. NeST: To be effectively utilized, all Grid resources need to have a “manager” that can allocate, manage reservations, load balance, etc.. This is common and very well developed for compute nodes. Other resources, such as network and storage are not as developed. NeST addresses, at least in part, the storage resource management issue. As shown in the diagram, there are 3 major components. The dispatcher is the main scheduler and is responsible for controlling the flow of information between the other components. Data movement requests are sent to the transfer manager; all other requests such as resource management and directory operation requests are handled by the storage manager. The dispatcher also periodically consolidates information about resource and data availability in the NeST and can publish this information into a global scheduling system. The storage manager has four main responsibilities: virtualizing and controlling the physical storage of the machine, directly executing non-transfer requests, implementing and enforcing access control, and managing guaranteed storage space in the form of lots. For Further Information: Ian Foster (ANL, Univ. of Chicago) foster@mcs.anl.gov Carl Kesselman (ISI, Univ. of Southern California)carl@isi.edu Miron Livny (Univ. of Wisconsin – Madison) livny@cs.wisc.edu The figure shows the design of the Replica Location Grid Service, which is based on OGSI Service Groups and associates Data Services that are replicas according to the policies of the Replica Location Grid Service.

DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.

Similar presentations

Presentation on theme: "DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.

Similar presentations

Presentation on theme: "DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed."— Presentation transcript:

Similar presentations

About project

Feedback