Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.

Similar presentations


Presentation on theme: "Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda."— Presentation transcript:

1 Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda

2 2CS791 - Spr05 INTRODUCTION At present, archival storage systems provide support for storage of data, but provide no or very little support for managing the information needed to interpret or discover archived data. At present, archival storage systems provide support for storage of data, but provide no or very little support for managing the information needed to interpret or discover archived data. As the load of data increases, the need for an efficient infrastructure, that can provide automated means of information, management, querying and access, increases. As the load of data increases, the need for an efficient infrastructure, that can provide automated means of information, management, querying and access, increases.

3 3CS791 - Spr05 DATA Collections Strong relationship between digital objects and other data sets from the same discipline. Create data collections through the identification of common attributes. The common attributes now serve as meta data to the data collection. Organize these collections as an OODB or RDB. In case of RDB, the schema consists of the common attributes, and information of how the collection has been organized …., information required to federate two collections

4 4CS791 - Spr05 Persistent Collections Based upon the concept that both the original digital objects and the information required to assemble the digital objects into a data collection must be archived. Archive digital objects as members of the data collection. Dynamically build the data collection from the individual data objects stored in the archive.

5 5CS791 - Spr05 Integration of collections need the ability to interpret how a Integration of collections need the ability to interpret how a collection is organized and the ability to dynamically build an information discovery interface into the new collection Persistent Collection - integration of two collections in time (same collection instantiated on 2 different sets of technology) Federated Collection – integration of two collections in space Federated Collection – integration of two collections in space

6 6CS791 - Spr05 Information Architecture The technologies that are available to build an information infrastructure are: Archives—to manage data sets distributed across tertiary storage systems Databases—to organize information about the data sets Data-handling systems—to provide APIs for access to the data collections Digital libraries—to provide services for manipulating and presenting the data collections The integration of these technologies will lead to a collection-based persistent archive The National Partnership for Advanced Computational Infrastructure is developing an information infrastructure architecture to support the creation of scientific data collections using the above mentioned technologies. An information infrastructure, called DICE is being setup at the San Diego Super Computer Center, as a first step towards this goal.

7 7CS791 - Spr05 “ Data Intensive Computing Environment” Currently setting up a general digital library system for ingesting, managing, archiving, and accessing several collections of scientific data. Would contain documents, images, field-generated data and simulation results for disciplines ranging from astronomy and earth systems science to social science, ecology, and neuroscience. Information in the archived digital libraries should be available through the web as well as through APIs for processing on supercomputing platforms Should provide a means of interaction between the disciplines and their collections. This requires a meta-data catalog for schema level attributes such as discipline- specific ontologies and semantics. Considering the fast evolving data that is stored in this system, plan to migrate the ontolgies forward in time. Not only need to migrate forward the digital objects but also migrate forward the methods and procedures needed to access them. Go beyond storing preservation-level meta-data for the objects and also consider preservation-level meta-data for methods and APIs.

8 8CS791 - Spr05 DICE (Contd..) DICE is built around a Meta Data Catalog developed at SDSC. MCAT is a repository that handles 3 different levels of metadata: Digital object meta-data about type, formats, lineage (creation characteristics), ingestion protocols, usage methods, and domain-specific data set attributes; typically created for every data collection in order to support information discovery. System-level meta-data about audit trails, authentication, access control, and replication and partitioning of data sets; used to provide location transparency, access transparency and protocol transparency. Schema-level meta-data including ontology information; used to provide a way to migrate the collection to new technology and to federate data collections.

9 9CS791 - Spr05 MCAT Architecture Application-dependent meta-data that provides information specific to particular data sets and their collections (Ex: Dublin Core values for text objects) System-level meta-data that provides operational information. These include information about resources, users, methods and data objects Schema-level meta-data The first two types of meta-data are extensible. MCAT provides APIs for creating, modifying and deleting the above structures The MCAT is a database-based catalog that provides a repository of meta information about digital objects.

10 10CS791 - Spr05 Schema-level meta-data This includes: Logical Structure Attribute Clusters Token Attributes Linkages

11 11CS791 - Spr05 MCAT Architecture (Contd..) Figure from original paper

12 12CS791 - Spr05 Data collection federation Figure from original paper

13 13CS791 - Spr05 Storage Resource Broker Provides a uniform API for access to heterogeneous archival storage systems. Deals with federation of storage sites and replication of data objects. The MCAT information catalog systems play a vital role in publishing authenticated information, and storing and disseminating the information

14 14CS791 - Spr05 SRB-MCAT System Provides a data integration environment that provides: uniform access APIs across heterogeneous file systems, databases, and archival storage protocol-transparency and location-transparency when accessing distributed systems  uniform name space abstraction over the file system that are being brokered  meta-data-based access to files facilities for replication, copying or moving files across heterogeneous systems, performing resource level operations (proxy operations) on data before delivery to the client an integrated encryption and authentication system that can range from no security to fully encrypted and fully authenticated data transfer including security against man-in-the-middle security intrusions.

15 15CS791 - Spr05 We will need to provide users a uniform access to diverse storage resources in a heterogeneous computing environment, because:  The data sets under consideration can be very large, making it appropriate to store in archival tape systems directly.  The data sets may be too numerous to be stored in a single file system  The number of data sets may grow with many of the data sets being sparsely used after some initial period of time.

16 16CS791 - Spr05 The SDSC Storage Resource Broker is the middleware that provides distributed clients with uniform access to diverse storage resources in a heterogeneous computing environment. The SRB presents clients with a logical view of data sets stored in the SRB. Similar to the file name in the file system paradigm, each data set stored in SRB has a logical name, which may be used as a handle for data operation. Figure from original paper

17 17CS791 - Spr05 Collections in SRB Data sets in the SRB are grouped into a logical (hierarchical) structure called collections. The collection provides an abstraction for: placing similar objects (possibly, physically distributed) under one collection (e.g., image collections of a museum) and placing all dissimilar objects that have a common connection under one abstraction (e.g., all the text paragraphs, images, figures, and tables of a document).

18 18CS791 - Spr05 Data Replication in SRB Two ways: replicate an object during object creation or modification, using Logical Storage Resources off-line replication facility, to replicate an existing data set.

19 19CS791 - Spr05 SRB provides a facility for resource-side proxy operations. SRB also provides authentication and encryption facilities, access control list and ticket-based access, and auditing capabilities

20 20CS791 - Spr05 SRB Process Model Figure from original paper

21 21CS791 - Spr05 Summary SRB – MCAT support federation of data objects SRB – MCAT support federation of data objects It provides the infrastructure to support a collection based persistent archive, distributed across multiple sites. It provides the infrastructure to support a collection based persistent archive, distributed across multiple sites.

22 22CS791 - Spr05 THANK YOU Some of the phrases and lines of text used in this presentation are direct excerpts from the original paper


Download ppt "Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda."

Similar presentations


Ads by Google