Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework.

Slides:



Advertisements
Similar presentations
Giggle: A Framework for Constructing Scalable Replica Location Services Ann Chervenak, Ewa Deelman, Ian Foster, Leanne Guy, Wolfgang Hoschekk, Adriana.
Advertisements

The Replica Location Service In wide area computing systems, it is often desirable to create copies (replicas) of data objects. Replication can be used.
Globus DataGrid Overview Bill Allcock, ANL GridPP Meeting 30 June 2003.
Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
Distributed Data Processing
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
High Performance Computing Course Notes Grid Computing.
Data Grids Darshan R. Kapadia Gregor von Laszewski
GridFTP: File Transfer Protocol in Grid Computing Networks
Technical Architectures
11 DICOM Image Communication in Globus-Based Medical Grids Michal Vossberg, Thomas Tolxdorff, Associate Member, IEEE, and Dagmar Krefting Ting-Wei, Chen.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Introduction to client/server architecture
Middleware for P2P architecture Jikai Yin, Shuai Zhang, Ziwen Zhang.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
FTP. SMS based FTP Introduction Existing System Proposed Solution Block Diagram Hardware and Software Features Benefits Future Scope Conclusion.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
GridFTP Guy Warner, NeSC Training.
Ali Kaplan Advisor: Prof. Geoffrey C. Fox 2/02/20091.
Ali Kaplan Advisor: Prof. Geoffrey C. Fox 14/27/2009.
Ali Kaplan Advisor: Prof. Geoffrey C. Fox 2/02/20091.
Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago.
Thesis Proposal Ali Kaplan
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Globus Data Replication Services Ann Chervenak, Robert Schuler USC Information Sciences Institute.
DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.
DISTRIBUTED COMPUTING
1 Introduction to Grid Computing. 2 What is a Grid? Many definitions exist in the literature Early definitions: Foster and Kesselman, 1998 “A computational.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Topaz : A GridFTP extension to Firefox M. Taufer, R. Zamudio, D. Catarino, K. Bhatia, B. Stearn University of Texas at El Paso San Diego Supercomputer.
File and Object Replication in Data Grids Chin-Yi Tsai.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
UDT as an Alternative Transport Protocol for GridFTP Raj Kettimuthu Argonne National Laboratory The University of Chicago.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
Data Management and Transfer in High-Performance Computational Grid Environments B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman,
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
GridFTP Richard Hopkins
Globus – Part II Sathish Vadhiyar. Globus Information Service.
BASIC NETWORK PROTOCOLS AND THEIR FUNCTIONS Created by: Ghadeer H. Abosaeed June 23,2012.
GridNEWS: A distributed Grid platform for efficient storage, annotating, indexing and searching of large audiovisual news content Ioannis Konstantinou.
The Globus eXtensible Input/Output System (XIO): A protocol independent IO system for the Grid Bill Allcock, John Bresnahan, Raj Kettimuthu and Joe Link.
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Manipulation with Globus Toolkit Ivan Ivanovski TU München,
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.
GridFTP Guy Warner, NeSC Training Team.
1 GridFTP and SRB Guy Warner Training, Outreach and Education Team, Edinburgh e-Science.
A Sneak Peak of What’s New in Globus GridFTP John Bresnahan Michael Link Raj Kettimuthu (Presenting) Argonne National Laboratory and The University of.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
An example of peer-to-peer application
Network Load Balancing
Introduction to Data Management in EGI
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Working at a Small-to-Medium Business or ISP – Chapter 7
Introduction to client/server architecture
Working at a Small-to-Medium Business or ISP – Chapter 7
Working at a Small-to-Medium Business or ISP – Chapter 7
GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
Presentation transcript:

Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework

Introduction Scientific applications generates terabytes or even petabytes. – High-energy physics Cern is a funded jointly by 20 European countries, with 3000 staff supporting 6500 researchers in 35 nations The large hadron collider (LHC) project will create 15 petabytes per year of data – Fusion power – Climate modeling – Earthquake engineering – Astronomy – Biology

Data Intensive Science: [1] Scientific discovery increasingly driven by data collection – Computationally intensive analyses – Massive data collections – Data distributed across networks of varying capability – Internationally distributed collaborations Dominant factor: data growth (1 Petabyte = 1000 TB) – 2000~0.5 Petabyte – 2005~10 Petabytes – 2010~100 Petabytes – 2015~1000 Petabytes?

Requirements for Scientific Data Transfer Transferring scientific data over large-scale requires – efficient, – high-performance, – reliable, – secure – policy-aware management – optimum use of resources (CPU, storage, network bandwidth)

Background There are successful attempts to meet the above requirements as – GridFTP – GridFTPXIO – GridHTTP – TeraGrid Copy (TGCP) – The Replica Location Service (RLS) – gLite

GridFTP Extension of the standard FTP protocol Reliable, secure high performance Efficient the de facto standard for transferring data in many Grid projects However, GridFTP does not offer a web service interface.

GridFTP (cont.) Additional features supported by the GridFTP protocol – Grid Security Infrastructures (GSI) and Kerberos support – Support for reliable and restartable data transfer: restart transfers from point of failure when failures occurred – Partial file transfer: regions of a file transfer. – Parallel data transfer: multiple TCP streams between two network endpoints to improve bandwidth. – Third-party control of data transfer: the ability to control transfers between storage servers from remote (third- party) server.

GridHTTP Allow large (gigabyte) files to be transferred at optimal speeds using HTTP Does not deviate from existing HTTP standards, But describes how to use existing headers and methods to produce an encrypted data stream. Support bulk data transfers via unencrypted HTTP, Support authentication and authorization with the usual grid credentials over HTTP.

GridFTPXIO The Globus eXtensible Input/Output (XIO) System provides an abstraction layer to transport protocols. enables different I/O problems to be presented uniformly as a simple open/close/read/write (OCRW) interface. a support framework for developing communication protocols. an interface that enables an existing application written with XIO to access their hardware. primary usage scenarios –Independence from the Transport Control Protocol –Ease of Adding GridFTP Support to Third-Party Applications –Ease of Providing GridFTP Access to Data Storage

TeraGrid Copy (TGCP) TeraGrid Copy (TGCP) solution includes three main components: – GridFTP Service – RFT Service – TGCP shell script In the striped configuration, – GridFTP service runs on several nodes of a cluster – the data to be transferred is partitioned among the nodes – each node may use several parallel streams to attain the maximum performance

TGCP (cont.) The tgcp script can use the globus-url-copy tool – (A) in either third-party transfer mode – (B) in conventional GridFTP client mode

TGCP (cont.) RFT Service will be used to manage the transfer. adds additional reliability to the transfer request transfer will be completed, if failure occurred during the transfer.

The Replica Location Service (RLS) provides a framework for tracking the physical locations of data that has been replicated. maps logical names to physical names. Replication of data items can reduce access latency, improve data locality, increase robustness, scalability and performance for distributed applications. does not operate in isolation, used with other components like the Reliable File Transfer service, GridFTP, the Metadata Catalog Service, and etc.

RLS (cont.) The current RLS implementation has the following features. – Local Replica Catalogs (LRCs) – Replica Location Indices (RLIs) – LRCs send information about their state to RLIs using soft state protocols. – Optional "Bloom Filter" compression can be used to summarize the contents of the LRC. – The current RLS implementation maintains static information about the LRCs and RLIs participating in the distributed system.

Our proposal: GridTorrent We are proposing a new distributed file peer- to-peer protocol in scientific data in an acceptable speed Similar to (GridFTP) redefining of FTP protocol to adjust it using in scientific data transfer There are many studies show that Bittorrent can be used for scientific applications

GridTorrent Architecture

Advantages Saves resources by taking advantage of the unused upload capacity of downloaders. – CPU – Network Bandwidth – Disk Reliable Jobs can be started and stopped using web interface Can be deployed under any system Secure

Initial Test results File size is around 185MB LAN test result: – Sources were on gridfarm machines (Bloomington, IN) and client was on complexity machine (Indianapolis, IN) – Transfer speed 71 Mbps. – PTCP transfer speed is around 80 Mbps with the same situation. – bandwidth usage of each source: WAN test result: Like LAN tests, sources were on gridfarm machines (Bloomington, IN) and client was on pipeline3 machine (San Diego, CA). Transfer speed is 17 Mbps PTCP transfer speed is around 27 Mbps with the same situation. seed1 Seed2Seed3seed4 44MB53Mb47Mb41MB seed1 Seed2Seed3seed4 52MB45Mb43Mb48MB

Why Bittorrent? Alternative Peer to Peer Protocols – FastTrack – Gnutella – eDonkey – Direct Connect – Ares Why BitTorrent? – Better bandwidth utilization – Never before speeds. – Up to 7 MB/s from the Internet. – Limit free riding – tit-for-tat – Limit leech attack – coupling upload & download – Spurious files not propagated – Ability to resume a download

Why Bittorrent? Bittorrent proved that it is suitable for distributing very large files. There are many companies using Bittorrent as distributing protocol – Amazon S3 – Microsoft’s Avalanche (inspired by Bittorrent) – Blizzard (Game production company) – Movie studios

Research Issues Current Bittorrent protocol is designed for actual network environment Modifications needed to provide pure scientific data transfer – modification on message format and frequency – parallel TCP/UDP – UDP – Web Service oriented client Requirements needed to provide pure scientific data transfer – Security – Content access management – Searching capability