Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modern Data Management Overview Storage Resource Broker Reagan W. Moore

Similar presentations


Presentation on theme: "Modern Data Management Overview Storage Resource Broker Reagan W. Moore"— Presentation transcript:

1 Modern Data Management Overview Storage Resource Broker Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb

2 Topics Data management evolution Shared collections Digital Libraries Persistent Archives Building shared collections Project level / National level / International Demonstration of shared collections Access to collections at SDSC

3 Types of Data Management File system (AFS) Provides caching at remote sites, uses single authentication system Backup system (Veritas) Provides time-based copies of data, tools for re-loading backups Database system (Oracle 10g IFS) Can link metadata to files on an Internet File System Archive system (HPSS) Manages data stored on tape, supports parallel I/O streams Persistent object environment (Avaki) Provides vaults for storing objects Globus toolkit Provides differentiated services for building a data grid

4 Data Management Environments Data grids Manage shared collections Digital libraries Provide discovery, browsing, presentation services on top of collections Persistent archives Manage technology evolution while the authenticity and integrity of the assembled collection is preserved Real-time sensor networks Manage access to real-time data streams from thousands of sensors

5 Generic Infrastructure Can a single system provide all of the features needed to implement each type of data management system, while supporting access across administrative domains and managing data stored in multiple types of storage systems? Answer is data grid technology

6 Types of Data Management File system (AFS) Data grid manages replication, parallel I/O, containers Backup system (Veritas) Data grid supports replicas, versions, and snapshots of files and containers Database system (Oracle 10g IFS) Data grid virtualizes catalogs - schema extension, bulk metadata load Archive system (HPSS) Data grid integrates access across archives and file systems Persistent object environment (Avaki) Data grid manages user-defined metadata and collection hierarchy Globus toolkit - set of differentiated services Data grid manages consistent state information

7 Shared Collections Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions Register digital entity into the shared collection Assign owner, access controls Assign descriptive, provenance metadata Manage state information Audit trails, versions, replicas, backups, locks Size, checksum, validation date, synchronization date, … Manage interactions with storage systems Unix file systems, Windows file systems, tape archives, … Manage interactions with preferred access mechanisms Web browser, Java, WSDL, C library, …

8 SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

9 Generic Infrastructure Digital libraries now build upon data grids to manage distributed collections DSpace digital library - MIT and Hewlitt Packard Fedora digitial library - Cornell University and University of Virginia Persistent archives build upon data grids to manage technology evolution NARA research prototype persistent archive California Digital Library - Digital Preservation Repository NSF National Science Digital Library persistent archive

10 Southern California Earthquake Center SCEC Community Library Select Receiver (Lat/Lon) Output Time History Seismograms Select Scenario Fault Model Source Model Intuitive User Interface – –Pull-Down Query Menus – –Graphical Selection of Source Model – –Clickable LA Basin Map (Olsen) – –Seismogram/History extraction (Olsen) Access SCEC Digital Library – –Data stored in a data grid – –Annotated by modelers – –Standard naming convention – –Automated extraction of selected data and metadata – –Management of visualizations SCEC Digital Library

11 Terashake Data Handling Simulate 7.7 magnitude earthquake on San Andreas fault 50 Terabytes in a simulation Move 10 Terabytes per day Post-Processing of wave field Movies of seismic wave propagation Seismogram formatting for interactive on-line analysis Velocity magnitude Displacement vector field Cumulative peak maps Statistics used in visualizations Register derived data products into SCEC digital library

12 Humidity Climate Ecological Wireless Oceanography Wind Speed Climate Ecological Wireless Oceanography Seismic Geophysics ROADNet Sensor Network Data Integration Fire start Rain start Frank Vernon - UCSD/SIO

13 ChileJune 13, 2005 Mw 7.9 Frank Vernon - UCSD/SIO

14 National Science Digital Library URLs for educational material for all grade levels registered into repository at Cornell SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive 750,000 URLs 13 million web pages About 3 TBs of data

15

16

17 Worldwide University Network Data Grid SDSC Manchester Southampton White Rose NCSA U. Bergen A functioning, general purpose international Data Grid for academic collaborations Manchester-SDSC mirror

18 KEK Data Grid Japan Taiwan South Korea Australia Poland US A functioning, general purpose international Data Grid for high- energy physics Manchester-SDSC mirror

19 BaBar High-energy Physics Stanford Linear Accelerator Lyon, France Rome, Italy San Diego RAL, UK A functioning international Data Grid for high-energy physics Manchester-SDSC mirror Moved over 100 TBs of data

20 Astronomy Data Grid Chile Tucson, Arizona NCSA, Illinois A functioning international Data Grid for Astronomy Manchester-SDSC mirror Moved over 400,000 images

21 International Institutions (2005)

22

23 Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF) HTTP, DSpace, OpenDAP, GridFTP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.3.1

24 SRB Objectives Automate all aspects of data discovery, access, management, analysis, preservation Security paramount Distributed data Provide distributed data support for Data sharing - data grids Data publication - digital libraries Data preservation - persistent archives Data collections - Real time sensor data

25 SRB Developers Reagan Moore - PI Michael Wan - SRB Architect Arcot Rajasekar - SRB Manager Wayne Schroeder - SRB Productization Charlie Cowart- inQ Lucas Gilbert - Jargon Bing Zhu - Perl, Python, Windows Antoine de Torcy - mySRB web browser Sheau-Yen Chen - SRB Administration George Kremenek- SRB Collections Arun Jagatheesan - Matrix workflow Marcio Faerman - SCEC Application Sifang Lu - ROADnet Application Richard Marciano - SALT persistent archives 75 FTE-years of support About 300,000 lines of C

26 History 1995 - DARPA Massive Data Analysis Systems 1997 - DARPA/USPTO Distributed Object Computation Testbed 1998 - NSF National Partnership for Advanced Computational Infrastructure 1998 - DOE Accelerated Strategic Computing Initiative data grid 1999 - NARA persistent archive 2000 - NASA Information Power Grid 2001 - NLM Digital Embryo digital library 2001 - DOE Particle Physics data grid 2001 - NSF Grid Physics Network data grid 2001 - NSF National Virtual Observatory data grid 2002 - NSF National Science Digital Library persistent archive 2003 - NSF Southern California Earthquake Center digital library 2003 - NIH Biomedical Informatics Research Network data grid 2003 - NSF Real-time Observatories, Applications, and Data management Network 2004 - NSF ITR, Constraint based data systems 2005 - LC Digital Preservation Lifecycle Management 2005 - LC National Digital Information Infrastructure and Preservation program

27 Development SRB 1.1.8 - December 15, 2000 Basic distributed data management system Metadata Catalog SRB 2.0 - February 18, 2003 Parallel I/O support Bulk operations SRB 3.0 - August 30, 2003 Federation of data grids SRB 3.3.1 - April 6, 2005 Feature requests (extensible schema)

28 SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network

29 Latency Management - Bulk Operations Bulk register Create a logical name for a file Load context (metadata) Bulk load Create a copy of the file on a data grid storage repository Bulk unload Provide containers to hold small files and pointers to each file location Bulk delete Trash can Sticky bits for access control, …

30 Logical Name Spaces Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (C library, Unix, Web Browser) Data access directly between application and storage repository using names required by the local repository

31 Logical Name Spaces Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection

32 Federation Between Data Grids Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

33 Types of Risk Media failure Replicate data onto multiple media Vendor specific systemic errors Replicate data onto multiple vendor products Operational error Replicate data onto a second administrative domain Natural disaster Replicate data to a geographically remote site Malicious user Replicate data to a deep archive

34 How Many Replicas Three sites minimize risk Primary site Supports interactive user access to data Secondary site Supports interactive user access when first site is down Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures Deep archive Provides 3rd media copy, staging environment for data ingestion, no user access

35 State of the Art Technology Grid - workflow virtualization Support execution of jobs (processes) across multiple compute servers Data grid - data virtualization Manage a shared collection that is distributed across multiple storage servers Semantic grid - information virtualization Create a common understanding of information (metadata) across multiple collections.

36 For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.sdsc.edu/srb/


Download ppt "Modern Data Management Overview Storage Resource Broker Reagan W. Moore"

Similar presentations


Ads by Google