Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.

Similar presentations


Presentation on theme: "Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009."— Presentation transcript:

1 Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009

2 What is the Data Working Group? Group of staff from all TeraGrid Resource Providers Responsible for planning and implementation of: –Software supporting data movement and management –Overall configuration of tools –Coordination with Software and other Working Groups Data Architecture effort –Smaller group of individuals –Looking at overall strategic plan for data in TG

3 Jordan’s Theory of Scientific Computation Campus TeraGrid Resource Data HPC Analysis/Viz Magic Results Analysis/Viz Data Campus TeraGrid Resource Profit

4 Data Collection and Preservation Data is the “stuff” of science Different forms: Simulation output, Sensor output, Experimental results, etc Data has significant, unpredictable reuse value Collections organize data from numerous sources Metadata for identification, search for location Preservation allows for long-term access and reuse

5 What tools do we provide? “Data Movement” Tools – get data from here to there and back again “Data Management” Tools – replicate, organize, and tag files, utilize databases “Data Collections” Tools – evolution of data management to include formal collections access

6 What resources are available? High-Performance Parallel File Systems –“Scratch” and “Work” Areas Archive systems –Use tapes for long-term storage –Very high capacity (petabytes or tens of petabytes) Wide-Area and Global file systems –Extension of parallel file systems over wide-area networks –One file system, available on multiple sites/resources

7 Kits and Information Services TeraGrid software organized into “capabilities” Capabilities collected in “Kits” Kits register their services and software TeraGrid Central Information Service collects this information for all RPs Users, applications query the Info Service

8 Data Movement Tools GridFTP Servers and Clients –Supports parallel transfers (threading) –Supports “striping” (use of multiple servers) –Globus-url-copy client allows selection of low-level options (network and storage block sizes, etc) –Not the simplest syntax Secure Copy (scp/ssh) –TeraGrid supports high-performance network extensions –Simple syntax, relatively easy to use –Not always as featureful as GridFTP UberFTP –FTP-like command-line client for GridFTP, other protocols

9 Data Management Tools Storage Resource Broker client –SDSC provides a TeraGrid-wide SRB service –Many data collections currently managed through SRB –SRB is now deprecated, being phased out Reliable File Transfer service –Globus utility for managing transfers –Uses database for persistent state storage –Support automated retry, transfers of file lists

10 Data Collections Tools Integrated Rule-Oriented Data System –In testing at TACC, SDSC –Supports storage of Data and Metadata –Supports management of data in archives and file systems, replication and checksum management –Can manage data based on programmable “rule engine” Database clients –JDBC/ODBC – implementation-independent DB interface –MySQL and Postgres clients for the most common open source databases –Some sites support Orale

11 Parallel File Systems GPFS and Lustre Multiple Servers, multiple disk arrays –Load is distributed across servers for high-performance –Files can be distributed on a per-file or per-block basis (striping) –Lustre allows per-directory and per-file user configuration of striping Basic technologies behind WAN file systems

12 Archive Systems Many different technologies and configurations HPSS and others use a custom command interface –Run special commands to store and retrieve files –Often referred to as “put and get” interfaces SAM-QFS at SDSC and TACC uses a file system interface –Looks just like any other file system –Can use GridFTP, SCP, etc to store and retrieve files –May have to wait to “stage” files from tape All archives support different classes of service with different storage characteristics

13 Classes of Service Disk-only –Never copy to tape, and/or never delete from disk –Used for small files –Often “bundle” many small files together for efficiency 1 Tape Copy –Copy to a tape, delete from disk –Most common type of service 2 Tape Copies –Replicate across two tapes in case of media failure –Usually has to be specially requested or configured Geographical replication – Coming soon …

14 Wide-Area File Systems Take advantage of parallel operation and wide network pipes Have been shown to utilize up to 30Gb/s cross- country Good for large datasets with distributed usage, i.e. compute-at-NICS, Visualize-at-TACC Significant technical accomplishment, still working to extend availability everywhere GPFS-WAN: SDSC, IU, NCSA (sometimes) Lustre-WAN: IU, PSC, LONI, TACC (coming soon)

15 Recommendations for New Users Develop a Data Management plan Understand your data workflow Understand the data resources you will use Automate the data workflow if possible Almost all data may be useful in collaboration Consider the long-term value of your data, and whether to donate it to a collection or organize it yourself

16 Input always welcome Data is an extraordinarily diverse field Lots of use cases, lots of needs Many needs have to do with policy, some have to do with tools Important to make sure we’re serving the user community Contact data-wg@teragrid.org or ctjordan@tacc.utexas.edu with comments and questionsdata-wg@teragrid.org ctjordan@tacc.utexas.edu


Download ppt "Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009."

Similar presentations


Ads by Google