Presentation is loading. Please wait.

Presentation is loading. Please wait.

The SMB Archive System: Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory.

Similar presentations


Presentation on theme: "The SMB Archive System: Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory."— Presentation transcript:

1 The SMB Archive System: Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory

2 Why a high capacity, long term data archive is needed Need a replacement for tapes Tapes age and medium formats change rapidly. Storage capacity and reliability of tapes limited. Much manual book-keeping is needed to keep track of data stored on tapes. Need to support large-area CCD detectors Three Q315 detectors will be generating 20-80 MB files at much increased rate when the SPEAR3 upgrade is complete. RAID data storage at SSRL will be 24 TB in 2004--all that data must be backed up somehow! Need to archive data as rapidly as it is collected. Need to support high-throughput structural biology Automated beam lines will generated huge amounts of data. Large numbers of samples and targets require that metadata be stored and tracked systematically. Data must be archived automatically and easy to retrieve.

3 SMB Archive Uses NPACI Resources at SDSC High Performance Storage System (HPSS) Centralized long term data storage system at SDSC. Stores over 344 TB of data in 18 million files. (Jan 2002) Capacity: 2000 GBytes Disk; 6000 TBytes Tape Storage. Storage Resource Broker (SRB) Client-server middleware provides uniform interface for accessing heterogeneous resources over the network. Presents data in hierarchical folders w/data and access controls. May be used to store and retrieve data on the HPSS at SDSC. Powerful metadata querying system allows data sets to be accessed based on their attributes. Data sets can be replicated over multiple resources. Organizations may install and maintain their own SRB Servers. We use the SRB installation at SDSC. National Partnership for Advanced Computational Infrastructure (NPACI) Mission: advance science by creating national computational infrastructure: the Grid. Maintains resources at San Diego Supercomputer Center (SDSC) including HPSS, SRB.

4 Organizations Using SRB Digital Libraries UCB, Umich, UCSB, Stanford,CDL NSF NSDL - UCAR / DLESE NASA Information Power Grid Astronomy National Virtual Observatory 2MASS Project (2 Micron All Sky Survey) Particle Physics Particle Physics Data Grid (DOE) GriPhyN Medicine Digital Embryo (NLM) Earth Systems Sciences ESIPS LTER Persistent Archives NARA LOC Neuro Science & Molecular Science TeleScience/NCMIR, BIRN SLAC, AfCS, …

5 InQ SRB client for Microsoft Windows SRB client applications Users must be able to upload data, download data, and view the data in the archive. Users perform these functions via SRB client applications. Available clients: Command-line programs (“S Commands”), InQ, MySRB. Tools for custom clients: SRB C library; Java API. InQ for Microsoft Windows InQ is the easiest to use client provided by NPACI. Individual files or entire folders may be uploaded or downloaded. Files in the archive may be browsed either by directory structure or by data attributes. Limitations of InQ Runs only on Microsoft Windows platforms. Windows is not the major platform used at synchrotron light sources or in crystallography research labs. No batch job capability for long archive jobs. Exposes confusing SRB features and terminology (resources, containers, collections, etc).

6 MySRB web browser-based SRB client MySRB MySRB is a powerful web-based SRB client which can be run from standard web browsers. Files in the archive may be browsed either by directory structure or by data attributes. Limitations of MySRB No way to upload or download more than one file at a time. The otherwise rich functionality and powerful features are confusing to users. The bottom line: Capabilities of HPSS and SRB far exceed the perceived needs of our beam line users. Our users need a customized interface with simplified functionality. Additional infrastructure had to be designed and implemented in order to make the SRB a viable storage system for crystallographic data. A browser-based user interface is ideal.

7 The SMB Archive interface for using the SRB Simple archive job definition Users may rapidly browse their /home and /data directories at SSRL. Directory contents are listed in the browser window. Directories may be navigated by clicking on directory names. Files to be uploaded may be filtered according to a list of wildcards. Subdirectories may be archived recursively. The only SRB related information required is the name of the new data collection to create. Convenient web browser interface Users may define archive jobs over the web from anywhere in the world using any common type of computer. Users need only log in with their SMB Unix account name and password.

8 Monitoring archive jobs and downloading data Batch operation Archive job runs in background once definition is confirmed. Browser does not hang during archival. New jobs may be started while previously defined jobs are in progress. Automatically restarts jobs if HPSS is unavailable. A job status page indicates definitions and status of all running jobs. User may abort running jobs. E-mail is sent to the user when a job is started and again when it is completed. Similar interface for data download Users browse their archived data sets in exactly the same fashion. Data may be downloaded from the archive to a directory at SSRL (analogous to an upload job). Another option is to download selected files in one or more tar files directly to any computer on the Internet.

9 Archive System Infrastructure But first a word about SRB Accounts: An SRB account (independent of the SSRL Unix Account) is required to archive data. Your SRB account permits you to upload/download any data using SRB clients. Handy web page on our site to create an SRB account: https://smb.slac.stanford.edu/secure/collaboratory/archive_system/SRBAccountForm.html Archive System Infrastructure – the Archive System uses the following software elements: Apache Web Server (v1.3.27) Apache Tomcat Servlet Container (v4.1.24) Java 2 Runtime (v1.4.1) SMB Authentication Gateway Server SMB Impersonation Server SRB JARGON Java API (v1.1) Archive System Servlets (for Upload, Download, and Job Maintenance) Archive System Background Applications All Archive System applications and servlets are written in Java. Archive System front-end is made up of Java servlets. Archive System back-end is made up of Java applications. All infrastructure elements are either available for free or are home-grown.

10 Significant infrastructure is required to provide this “simple” interface--but the payoff is huge. Authentication Gateway Server Java servlet that provides a common authentication protocol for all web-based and stand-alone applications. Used to authenticate archive system users. All web-based software developed at SSRL is being updated to use this single authentication server. Support for the authentication server has already been integrated into Blu- Ice/DCS. Allows users to navigate seamlessly between applications without authenticating multiple times. Will eventually allow access to beamline systems to be controlled automatically based on the beam schedule. Access to other resources (computing, data directories, etc.) available 24/7 Impersonation Server Unix daemon that can run any non- interactive program on behalf of any Unix user. Enables web applications to run background jobs for a user with the actual rights of the Unix user account. Accepts commands via the HTTP protocol. Verifies authentication information with the Authentication Server. Used by the archive system to list directories in the web browser and run background archive jobs as the user. Will allow further analyses to be automatically initiated by the beam line control system.

11 Archive System Web Architecture Internet Internet (Backbone) SMB Impersonation Archive Servlets (Tomcat) Define UploadDefine Download View Job Status Authentication Archive Jobs (background) Upload Jobs Download Jobs Job Maintenance SDSC MCAT SRB HPSS Disk Cache Tape Storage Web Browser ApacheApache

12 Archive Projects for the next year Optimize data transfer rates between SSRL and SDSC. Provide stand-alone application for users wishing to download datasets directly from the SRB. Implement other functions available in inQ and MySRB for manipulating existing collections (replicate, delete, etc.) Provide option for automatic data upload from Blu-Ice. Provide link from Blu-Ice to automatically start browser and load Archive page w/o user having to log in again. (New Authentication Server makes this possible.) Provide additional options for using SRB Metadata Catalog (MCAT) to describe, index, and retrieve data files. The Collaboratory for Macromolecular Crystallography is supported by the NIH, NCRR as a supplement to the SSRL Synchrotron Radiation Structural Biology Resource (P41-RR-01209). The SSRL Structural Molecular Biology program is funded by DOE BER, NIH NCRR, and NIH NIGMS.


Download ppt "The SMB Archive System: Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory."

Similar presentations


Ads by Google