Presentation is loading. Please wait.

Presentation is loading. Please wait.

San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.

Similar presentations


Presentation on theme: "San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent."— Presentation transcript:

1 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent Archives (Storage Resource Broker - SRB) Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan,

2 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SDSC SRB Team Reagan Moore Michael Wan Arcot Rajasekar Wayne Schroeder Arun Jagatheesan Charlie Cowart Lucas Gilbert George Kremenek Sheau-Yen Chen Bing Zhu Roman Olschanowsky (BIRN) Vicky Rowley (BIRN) Marcio Faerman (SCEC) Antoine De Torcy (IN2P3) Students & emeritus –Erik Vandekieft –Reena Mathew –Xi (Cynthia) Sheng –Allen Ding –Grace Lin –Qiao Xin –Daniel Moore –Ethan Chen –Jon Weinburg

3 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Topics Concepts behind data management Production data grid examples Integration of data grids with digital libraries and persistent archives

4 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Support data sharing between institutions –Discover relevant data without knowing the file name –Access data without knowing the storage location or storage access protocol –Retrieve data using your preferred API Organize distributed data in a collection hierarchy Manage latency in wide-area-networks Manage PetaBytes of data and hundreds of millions of files

5 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Digital Library Provide curation services –Organization, description, and management of data –Support schema extension Provide access services –Discovery, browsing, presentation, and manipulation of data Federate semantics across collections –Digital library crosswalks

6 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Persistent Archive Support archival processes –Appraisal, accession, arrangement, description, preservation, and access Manage technology evolution while preserving integrity and authenticity of data Minimize risk of data loss –Preserve collections for hundreds of years –Data replication

7 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Challenges Each community assigns different meanings to terms used to describe their requirements Data grid community –Persistent Archive is the infrastructure that manages storage technology evolution while preserving a collection Archivist community –Persistent Archive is the collection that is being preserved in some choice of infrastructure Together they define a preservation environment

8 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Challenges Preservation community traditionally views technology evolution as the problem rather than the solution –Preservation requires the ability to manipulate old formats Digital library community attempts to assert exact meaning for semantics. –Metadata Encoding and Transmission Standard is one approach towards the creation of a metadata framework with the ability to support extension schema Data grid community has not chosen standards for distributed data management –Computer science is just starting to understand how to characterize and manage data, information, and knowledge

9 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure To Make Progress Develop simplest possible description for describing data, information, and knowledge management Identify common infrastructure components Apply in production settings –Iterate, based on new expectations for data management

10 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Common Requirements for Data Management Distributed data sources –Management across administrative domains Heterogeneity –Multiple types of storage repositories Scalability –Support for billions of digital entities, PetaBytes of data Preservation –Management of technology evolution

11 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB Collections at SDSC

12 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Management Concepts (Elements) Collection –The organization of digital entities to simplify management and access. Context –The information that describes the digital entities in a collection. Content –The digital entities in a collection

13 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Types of Context Metadata Descriptive –Provenance information, discovery attributes Administrative –Location, ownership, size, time stamps Structural –Data model, internal components Behavioral –Display and manipulation operations Authenticity –Audit trails, checksums, access controls

14 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Metadata Standards METS - Metadata Encoding Transmission Standard –Defines standard structure and schema extension OAIS - Open Archival Information System –Preservation packages for submission, archiving, distribution OAI - Open Archives Initiative –Metadata retrieval based on Dublin Core provenance attributes

15 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Management Concepts (Mechanisms) Curation –The process of creating the context Closure –Assertion that the collection has global properties, including completeness and homogeneity under specified operations Consistency –Assertion that the context represents the content

16 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Information Technologies Data collecting –Sensor systems, object ring buffers and portals Data organization –Collections, manage data context Data sharing –Data grids, manage heterogeneity Data publication –Digital libraries, support discovery Data preservation –Persistent archives, manage technology evolution Data analysis –Processing pipelines, manage knowledge extraction

17 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Assertion Data Grids provide the underlying abstractions required to support –Digital libraries Curation processes Distributed collections Discovery and presentation services –Persistent archives Management of technology evolution Preservation of authenticity The management of data requires the use of information (semantic labels). The management of information requires the use of knowledge (relationships).

18 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Terms Data –Bits - zeros and ones Digital Entity –The bits that form an image of reality (file, object, image, data, metadata, string of bits, structured sets of string of bits) Information –Semantic labels applied to data Metadata –Semantic label and the associated data (attribute name and attribute value) Knowledge –Relationships between semantic labels applied to data –Relationships used to assert the application of a semantic label

19 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Components Federated client-server architecture –Servers can talk to each other independently of the client Infrastructure independent naming –Logical names for users, resources, files, applications Collective ownership of data –Collection-owned data, with infrastructure independent access control lists Context management –Record state information in a metadata catalog from data grid services such as replication Abstractions for dealing with heterogeneity

20 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Abstractions Logical name space for files –Global persistent identifier Storage repository virtualization –Standard operations supported on storage systems Information repository virtualization –Standard operations to manage collections in databases Access virtualization –Standard interface to support alternate APIs Latency management mechanisms –Aggregation, parallel I/O, replication, caching Security interoperability –GSSAPI, inter-realm authentication, collection-based authorization

21 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Storage Repository Virtualization Archive DatabaseFile System User Application

22 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Storage Repository Virtualization Archive DatabaseFile System Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

23 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Mappings on Resource Name Space Define logical resource name –List of physical resources Replication –Write to logical resource completes when all physical resources have a copy Load balancing –Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance –Write to a logical resource completes when copies exist on “k” of “n” physical resources

24 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Containers Archivists store hardcopy in “cardboard boxes” A container is the digital equivalent, the aggregation of digital files into a single file, with an associated “packing list” Containers are used to minimize access latency, keep similar digital entities together

25 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Stored at SDSC HPSS archive –Stores 1 Petabyte of data –Stores 17 million files Storage Resource Broker data grid –Stores 114 Terabytes of data –Stores 31 million files –Containers are used to aggregate files before loading into HPSS

26 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Unix Shell Java, NT Browsers GridFTP OAI WSDL SDSC Storage Resource Broker & Meta-data Catalog HRM Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application C, C++, Libraries Access APIs Drivers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase, SQLServer Consistency Management / Authorization-Authentication Logical Name Space Latency Management Data Transport Metadata Transport SRB Server Linux I/O DLL / Python

27 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Production Data Grid SDSC Storage Resource Broker –Federated client-server system, managing Over 100 TBs of data at SDSC Over 25 million files –Manages data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) Virtual Object Ring Buffers

28 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Virtualization Archive at SDSC Database At U Md File System at U Texas User Application

29 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Virtualization Archive at SDSC Database At U Md File System at U Texas Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system

30 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Logical Name Space Persistent, location-independent identifiers for digital entities –Organized as collection hierarchy –Attributes mapped to logical name space Attributed managed in a database Types of administrative metadata –Physical location of file –Owner, size, creation time, update time –Access controls

31 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure File Identifiers Logical file name –Infrastructure independent –Used to organize files into a collection hierarchy Globally unique identifier –GUID for asserting equivalence across collections Descriptive metadata –Support discovery Physical file name –Location of file

32 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Information Repository Virtualization Choice of database for Metadata Catalog User Application Operations used to manage administrative, descriptive, user-defined metadata Import from XML file Export to XML file Bulk load Bulk unload Schema extension Access controls Dynamic SQL generation Common operations for managing a catalog in a database

33 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Unix Shell Java, NT Browsers GridFTP OAI WSDL Access Virtualization Application C, C++, Libraries Linux I/O DLL / Python Common operations performed on all storage repositories Map from API to remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries

34 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Technology Evolution All components of the “Persistent Archive” will evolve –Hardware systems –Software systems –Protocols –Access methods –Encoding syntax for digital entities Create drivers for each new storage repository protocol –Migrate data to each new storage system Manage evolution of the encoding syntax through either transformative migration or emulation

35 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Are Repeated Media Migrations Feasible? At SDSC, cartridge capacity has increased from 200 Mbytes to 200 Gbytes for same cartridge cost Only migrate to new technology when the cost per Gigabyte is a factor of two lower Then the media cost is fixed when sum over all migrations (1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + …) = 2 SDSC migrates to new media to reduce cost –All tape are stored in robots to minimize labor costs

36 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Transformative Migration versus Emulation versus Digital Ontology Transformative Migration –Transform the encoding format to a new standard –Can combine encoding format transformation with media migration Emulation –Create a transportable parser for the original encoding format –Migrate emulator forward in time –Example - Multivalent Browser (written in Java) for parsing pdf, laTex, … Digital ontology –Characterize the structures and relationships present within the digital entity –Migrate the characterization forward in time

37 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Persistent Archives When migrate from an old technology to a new technology, both versions are available. Virtualization mechanisms used for federation across space can be used to manage migration over time Persistent archives can be built on data grid infrastructure

38 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Automation of Archival Processes Archival ProcessFunctionality AppraisalAssessment of digital entities AccessionImport of digital entities DescriptionAssignment of preservation metadata ArrangementLogical organization of digital entities PreservationLong-term storage AccessDiscovery and retrieval

39 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grid Core Capabilities Storage repository abstraction Storage interface to at least one repository Standard data access mechanism Standard data movement protocol support Containers for data Logical name space Registration of files in logical name space Retrieval by logical name Logical name space structural independence from physical file Persistent handle

40 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Collection owned data Collection hierarchy for organizing logical name space Standard metadata attributes (controlled vocabulary) Attribute creation and deletion Scalable metadata insertion Access control lists for logical name space Attributes for mapping from logical file name to physical file Encoding format specification attributes Data referenced by catalog query Containers for metadata Information Repository Abstraction

41 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Distributed Resilient Architecture Specification of system availability Standard error messages Status checking Authentication mechanism Specification of reliability against permanent data loss Specification of mechanism to validate integrity of data Specification of mechanism to assure integrity of data

42 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Virtual Data Grid Knowledge repositories for managing collection properties Characterization of the application of transformative migrations on encoding format Characterization of the application of archival processes

43 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB server SRB agent SRB server Federated SRB server model MCAT Read Application SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

44 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Latency Management - Bulk Operations Bulk register –Create a logical name for a file –Load context (metadata) Bulk load –Create a copy of the file on a data grid storage repository Bulk unload –Provide containers to hold small files and pointers to each file location Bulk delete Requests for bulk operations for access control, …

45 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network

46 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Southern California Earthquake Center Build community digital library Manage simulation and observational data –Anelastic wave propagation output –10 TBs, 1.5 million files Provide web-based interface –Support standard services on digital library Manage data distributed across multiple sites –USC, SDSC, UCSB, SDSU, SIO Provide standard metadata –Community based descriptive metadata –Administrative metadata –Application specific metadata

47 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SCEC Digital Library Technologies Portals –Knowledge interface to the library, presenting a coherent view of the services Knowledge Management Systems –Organize relationships between SCEC concepts and semantic labels Process management systems –Data processing pipelines to create derived data products Web services –Uniform capabilities provided across SCEC collections Data grid –Management of collections of distributed data Computational grid –Access to distributed compute resources Persistent archive –Management of technology evolution

48 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Metadata Organization (Domain View versus Run View) Domain ListFormatting Output Run Provenance Velocity ModelFault Model PhysicalNumerical SpatialTemporal Domain... Simulation ModelProgramComputer System

49 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure

50

51 Zone SRB Federation Mechanisms to impose consistency and access constraints when sharing: –Resources Controls on which zones may use a resource –User names (user-name / domain / SRB-zone) Users may be registered into another domain, but retain their home zone, similar to Shibboleth –Data files Controls on who specifies replication of data –Context metadata Controls on who manages updates to metadata

52 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Unix Shell Java, NT Browsers OAI, WSDL, OGSA HTTP Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization-Authentication Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB

53 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Peer-to-Peer Federation 1. Occasional Interchange - for specified users 2. Replicated Catalogs - entire state information replication 3. Resource Interaction - data replication 4. Replicated Data Zones - no user interactions between zones 5. Master-Slave Zones - slaves replicate data from master zone 6. Snow-Flake Zones - hierarchy of data replication zones 7. User / Data Replica Zones - user access from remote to home zone 8. Nomadic Zones “SRB in a Box” - synchronize local zone to parent zone 9. Free-floating “myZone” - synchronize without a parent zone 10. Archival “BackUp Zone” - synchronize to an archive SRB Version released December 19, 2003

54 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Principle peer-to-peer federation approaches (1536 possible combinations)

55 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Replicated Catalog Archival Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch Complete User-ID Sharing Peer-to-Peer Zones Replication Zones Hierarchical Zones Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data

56 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Deep Archive Impose sharing constraints: –Only system administrator access –Selected replication of files –Write once, with versions created on changes to data Impose consistency constraints –Coordinate update of preservation metadata with file replication Manage replicationof both data and metadata Use federation to guarantee preservation against –Local hardware and software failures –Local operation errors –Local disasters

57 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Research Information (semantic label) is an assertion that some criteria were met for the application of the label –Need to describe and manage the assertions (rules and relationships) used to apply semantic labels Information (semantic label) expresses a context-related meaning that should be associated with a digital entity –Meaning is determined by the context Characterization of information requires the ability to describe –The context that defines the assertions for assigning the label –The context that explains the meaning of the label Organization of information requires the use of relationships (knowledge)

58 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Knowledge Based Data Grid Roadmap Attributes Semantics Knowledge Information Data Ingest Services ManagementAccess Services (Model-based Access) (Data Handling System) MCAT/HDF Grids XML DTD SDLIP XTM DTD Rules - KQL Information Repository Attribute- based Query Feature-based Query Knowledge or Topic-Based Query / Browse Knowledge Repository for Rules Relationships Between Concepts Fields Containers Folders Storage (Replicas, Persistent IDs)

59 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure For More Information Reagan W. Moore San Diego Supercomputer Center


Download ppt "San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent."

Similar presentations


Ads by Google