Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.

Similar presentations


Presentation on theme: "National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan."— Presentation transcript:

1 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan W. Moore San Diego Supercomputer Center

2 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data and Knowledge Systems Group Staff Reagan Moore Ilkai Altintas Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano XuFei Qian Roman Olshanowsky Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Graduate Students A. Bagchi S. Bansal A. Behere R. Bharath S. Bharath M. Kulrul L. Sui Undergraduate Interns N. Cotofana M. Shumaker J. Trang L. Yin +/- NN

3 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Information Management Projects Digital Libraries –California Digital Library - Art Museum Image COnsortium –DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia –NLM Visible Embryo digital library - George Mason University –NSF Digital Library Initiative, Phase II - UCSB, Stanford –NSF NPACI Digital Sky - Caltech 2MASS sky survey –NSF National Science Education Digital Library - UCAR, Cornell, UCSB, U Mass Data Grid Environments –DOE Data Visualization Corridor - LLNL –DOE Particle Physics Data Grid - Stanford, Caltech –NASA Information Power Grid - NASA Ames –NIH Biomedical Informatics Research Network - Duke, Harvard –NSF Grid Physics Network - U Florida, Caltech, U Wisc, USC/ISI, U Chicago –NSF National Virtual Observatory - Johns Hopkins University, Caltech –NSF Southern California Earthquake Center - USC/ISI Persistent Archives –NARA Persistent Archive - UCB, U Maryland –NHPRC Archivist workbench - U Minnesota –NSF NSDL Persistent archive for curricula modules - Cornell

4 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Challenges Integrate an existing data collection into a data intensive analysis environment Federate access across multiple data collections Enable knowledge generation across observational and simulation data

5 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Digital Library Technology Data integration –How are digital entities accessed from an analysis system? What application access APIs are available? –How are they named? How many levels of naming are needed? –What discovery mechanisms are possible? How is discovery automated?

6 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Unix Shell Java, NT Browsers Web WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog Access APIs Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application HRM Access APIs Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase C, C++, Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python

7 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Digital Library Technology Data integration –How are digital entities accessed from an analysis system? What application access APIs are available? –How are they named? How many levels of naming are needed? –What discovery mechanisms are possible? How is discovery automated?

8 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Naming Conventions Concept spaceDiscipline concepts Data gridGlobal Identifier CollectionDiscipline attributes Archive / file systemsLocal file name Data modelAttributes that describe data structure

9 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Digital Library Technology Data integration –How are digital entities accessed from an analysis system? What application access APIs are available? –How are they named? How many levels of naming are needed? –What discovery mechanisms are possible? How is discovery automated?

10 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Digital Library Integrate hierarchical directory structure with collection-based attributes Collection Collection attributes Sub-Collection Sub-collection attributes Sub-Collection Sub-collection attributes Digital Entity Digital entity attributes

11 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Discovery Transparencies Naming transparency - find a data set without knowing its name –Map from attributes to a global file name Location transparency - access a data set without knowing where it is –Map from global file name to local file name Access transparency - access a data set without knowing the type of storage system –Federated client-server architecture

12 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Unix Shell Java, NT Browsers Web WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog Transparencies Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application HRM Access APIs Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase C, C++, Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python

13 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grid Capabilities / Services Logical name space Attribute manipulation Data manipulation Data and metadata transport Latency management Fault tolerance

14 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grid Tasks How are collections federated? –What data handling capabilities are needed for access over wide area networks? –How is access latency minimized over wide area networks? –How can collections of distributed data be managed?

15 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grid Manipulations Cache data –Make a local copy Replicate data –Make two copies and keep track of their location –Manage names within a logical name space Shadow data –Register data from a collection into the logical name space

16 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Latency Management Data streaming –Overlap I/O access time with data movement Data caching –Create a local copy to minimize I/O access time Data replication –Choose between multiple sources for data access Data aggregation –Use containers to hold multiple small data sets I/O aggregation –Use remote proxies to do remote filtering/data subsetting Metadata aggregation –Use XML files to manage metadata in bulk

17 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Minimizing Latency in I/O Pipes Replication & Staging Parallel Streaming Caching Remote Proxies Aggregation Data, Metadata SourceDestination Pre-fetch Network

18 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Sharing Data Across Collections Discipline collection –Publish a link to data in your collection into a discipline collection –Shadow links support registration without data movement –Explicitly copy data - streaming environment Discipline catalog –Register metadata into discipline catalog –Provide handle for a service to access the data –Web services environment

19 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Management of Distributed Data Emergence of consensus on data management infrastructure Data grids - provide mechanisms to integrate distributed data resources –Logical name space Provide global, persistent, identifier –Extend name space to support grid attributes Replication, containers, reliable transport, community IDs –Collection management - extensible attributes SCEC - federate observational and simulation data

20 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Logical Name Space Supports a global name space used to reference data stored at multiple sites Organized as a directory/subdirectory or collection/sub-collection –Extended by adding attributes to support Replication Reliable file transport Containers Community ID for data ownership Access control lists Discipline specific attributes –Attributes managed as a collection

21 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Convergence of Technologies Data grids as basis for distributed data management –Federation of distributed resources –Creation of logical name space to automate discovery Distributed data collections –Discovery based on attributes –Distributed data storage systems Digital libraries –Development of services for manipulating, viewing data Persistent archives –Management of technology evolution

22 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archives (Similar requirements to a data grid) Name transparency –Find a file by attributes (map from attributes to global name) Location transparency –Access a file by a global identifier (map from global to local file name) Access transparency –Use same API to access data in archive or file cache Authenticity –Disaster recovery, replicate data across storage systems –Audit and process management

23 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Unix Shell Java, NT Browsers Web WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog Persistence Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application HRM Access APIs Servers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase C, C++, Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency Management / Authorization-Authentication Prime Server Linux I/O DLL / Python

24 ERA Concept model

25 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Differentiating between Data, Information, and Knowledge Data –Digital object –Objects are streams of bits Information –Any tagged data, which is treated as an attribute. –Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object Knowledge –Relationships between attributes –Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

26 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Knowledge Semantic / logical relationships - concept spaces –Digital library crosswalks Procedural / temporal relationships - process maps –Workflow systems Structural / spatial relationships - atlases –GIS systems

27 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Automation of Knowledge Generation Levels of abstraction for managing repositories: –Data handling system Storage Resource Broker - logical name space –Information repository Extensible Metadata CATalog - schema indirection –Knowledge repository Model Based Mediation - integration of concept spaces with information repositories

28 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Knowledge Creation Roadmap Knowledge syntax (consensus) –RDF, XMI, Topic Map Knowledge management (recursive operations) –Oracle parallel database Knowledge manipulation (spatial/procedural rules) –Generation of inference rules and mapping to data models Knowledge generation (scalable inference engine) –Application of inference rules in inference engine

29 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Teragrid as a Knowledge Generation Resource Analyze 10-TB collections –250 TBs of data cache - stage collection from archive –Access rate of 3 GB/sec - stream collection Support massively parallel relationship mining –500 node Linux cluster - parallel data mining –2 TBs of memory - aggregate knowledge relationships Archive collections –8 PB mass store - manage persistent collections

30 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Knowledge Based Data Grid Roadmap Attributes Semantics Knowledge Information Data Ingest Services ManagementAccess Services (Model-based Access) (Data Handling System - SRB) MCAT/HDF Grids XML DTD SDLIP XTM DTD Rules - KQL Information Repository Attribute- based Query Feature-based Query Knowledge or Topic-Based Query / Browse Knowledge Repository for Rules Relationships Between Concepts Fields Containers Folders Storage (Replicas, Persistent IDs)

31 National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Further Information


Download ppt "National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan."

Similar presentations


Ads by Google