Download presentation
Presentation is loading. Please wait.
Published byDanielle Manning Modified over 10 years ago
1
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE/
2
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Data Management Objectives Automate all aspects of data management –Discovery (without knowing the file name) –Access (without knowing its location) –Retrieval (using your preferred API) –Control (without having a personal account at the remote storage system) –Performance (use latency management mechanisms to minimize impact of wide-area-networks)
3
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Collections Replicated via SRB onto TeraGrid 2MASS –10 TBs, 5 million images DPOSS –3 TBs, 6000 images USNO-B –In progress SDSS –In progress MACHO –In negotiation
4
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure SRB Implementations Data collecting –Sensor systems, object ring buffers and portals Data organization –Collections, manage data context Data sharing –Data grids, manage heterogeneity Data publication –Digital libraries, support discovery Data preservation –Persistent archives, manage technology evolution Data analysis –Processing pipelines, manage knowledge extraction
5
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure NSF Infrastructure Projects Using SRB Partnership for Advanced Computational Infrastructure - PACI –Data grid - Storage Resource Broker Distributed Terascale Facility - DTF/ETF –Compute, storage, network resources Digital Library Initiative, Phase II - DLI2 –Publication, discovery, access Information Technology Research projects - ITR –SCEC Southern California Earthquake Center –GEON GeoSciences Network –SEEK Science Environment for Ecological Knowledge –GriPhyN Grid Physics Network –NVO National Virtual Observatory National Science Digital Library - NSDL –Support for education curricula modules
6
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Federal Infrastructure Projects Using SRB NASA –Information Power Grid - IPG –Advanced Data Grid - ADG –Data Management System - Data Assimilation Office Integration of DODS with Storage Resource Broker data grid –Earth Observing Satellite EOS data pools –Consortium of Earth Observing Satellites CEOS data grid Library of Congress –National Digital Information Infrastructure and Preservation Program - NDIIPP National Archives and Records Administration and National Historical Public Records Commission –Prototype persistent archives NIH –Biomedical Informatics Research Network data grid DOE –Particle Physics Data Grid - Babar, CMS
7
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure SDSC Collaborations Hayden Planetarium Simulation & Visualization Knowledge Network for BioComplexity (NSF) Mol Science – JCSG, AfCS Visual Embryo Project (NLM) RoadNet (NSF) Earth System Sciences – CEED, Bionome, SIO Explorer Hyper LTER Grid Portal (NPACI) Tera Scale Computing (NSF) Long Term Archiving Project (NARA) Education – Transana (NPACI) NSDL – National Science Digital Library (NSF) Digital Libraries – ADL, Stanford, UMichigan, UBerkeley, CDL … 31 additional collaborations
8
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Approach Use collections to organize digital entities –Digital entity - file, URL, SQL, directory, table, … Create logical name space –Location independent naming convention –Map state information created by data access services to the logical name space –Manage consistency constraints on the metadata update Build an interoperability mechanism –Map from storage repository protocols to preferred APIs
9
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Basic Concepts Logical name space –Map administrative, descriptive, authenticity, consistency metadata onto the logical name Storage repository abstraction –Standard operations performed at remote storage Information repository abstraction –Standard operations to manage collection in a database Access abstraction –Standard operations supported for metadata and data access Authentication abstraction –Collection-owned data, ACLs for data and metadata Latency management mechanisms
10
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Unix Shell Java, NT Browsers OAI WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application HRM ORB Access APIs Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Postgres, SQLServer, Informix C, C++, Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python
11
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Production Data Grid SDSC Storage Resource Broker –Federated client-server system, managing Over 70 TBs of data at SDSC Over 10 million files –Manages data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) Virtual Object Ring Buffers
12
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure SRB server SRB agent SRB server Federated SRB server model MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6
13
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Logical Name Space Example - Hayden Planetarium Generate fly-through of the evolution of the solar system Access data distributed across multiple administration domains Gigabyte files, total data size was 7 TBytes Very tight production schedule - 3 months
14
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
15
Hayden Data Flow NCSA SDSC AMNH NYC GPFS 7.5 TB IBM SP2 SGI Production parameters, movies, images data simulation visualization HPSS 7.5 TB 2.5 TB UniTree UVa NY CalTech BIRN
16
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Logical Name Space Global, location-independent identifiers for digital entities –Organized as collection hierarchy –Attributes mapped to logical name space Attributed managed in a database Types of system metadata –Physical location of file –Owner, size, creation time, update time –Access controls
17
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Mappings on Name Space Define logical resource name –List of physical resources Replication –Write to logical resource completes when all physical resources have a copy Load balancing –Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance –Write to a logical resource completes when copies exist on k of n physical resources
18
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Latency Management Example - Digital Sky Project 2MASS (2 Micron All Sky Survey): –Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen- Piao Lee, IPAC, Caltech NVO (National Virtual Observatory): –Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech SDSC – SRB : –Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore
19
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Digital Sky - 2MASS http://www.ipac.caltech.edu/2mass The input data was originally written to DLT tapes in the order seen by the telescope –10 TBytes of data, 5 million files Ingestion took nearly 1.5 years - almost daily reading of tapes, one at a time Images aggregated into 147,000 containers by SRB
20
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Digital Sky Data Ingestion Informix SUN SRB SUN E10K HPSS …. 800 GB 10 TB SDSC IPAC CALTECH input tapes from telescopes star catalog Data Cache
21
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure
22
SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network
23
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Containers Images sorted by spatial location –Retrieving one container accesses related images Minimizes impact on archive name space –HPSS stores 680 Tbytes in 17 million files Minimizes distribution of images across tapes Bulk unload by transport of containers
24
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure SRB Development Peer-to-peer federation –Support multiple independent MCAT catalogs –Replicate metadata mySQL/BerkeleyDB port OGSA/OGSI compliant interface GridFTP interfaces –Waiting for next release of the software (4thQ)
25
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure MySRB Features Data & File Management Collection Creation and Management Collection of Varied Objects –Files, SQL Objects, Databases, URLs, directories, archives, … Metadata Handling Browsing & Querying Interface Access Control Version Control (soon) Support proxy (remote) operations
26
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure MySRB Web-based Access to the SRB Secure HTTP Uses Cookies for Session Control Self Registration of Users Supported –Currently limited to SDSC users Self Registration of Resources (soon) Access to Both Data and Metadata
27
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Data Management Browse in Hierarchical Collections Registration of (remote) Legacy Files & Directories Registration of SQL Objects Registration of URLs Data Movement Operations –Ingest & Re-Ingest, Delete, Unlink –Replicate, Copy, Move, S-Link Access Control Operations –Read, Write, Own, Curate, Annotate, … –Ticket-based Access Version Control Operations (soon) –Read Lock, Write Lock, Unlock –Check In Check Out
28
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Types of Meta data System-level Metadata –Size, resource, owner, date, access control, … User-defined Meta data –for data & collections – triples –No limits in number of metadata –Support for Collection-level schemas Comments, default values, drop-down lists –Support for Standardized Schemas (eg. Dublin Core) Annotations –Supports textual annotations –Annotator, date, context also registered
29
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Meta Data Management Insert, Update and Delete of Metadata Access Control for Metadata (soon in mySRB) Querying across system-level, user-defined metadata and annotations –Query under collections & across collections Browsing on user-defined metadata Metadata supported for legacy files & directories Extract Metadata (using proxy operations)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.