National Data Infrastructure Projects EarthCube Layered Architecture (GEO) DataNet Federation Consortium (OCI) integrated Rule Oriented Data System (SDCI)

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
Integrated Rule Oriented Data System (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan
Wayne Schroeder, Paul Tooby Data Intensive Cyber Environments Team (DICE) DICE Center, University of North Carolina at Chapel Hill; Institute for Neural.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
A Very Brief Introduction to iRODS
Sustainable Preservation Services for Archivists through Distributed Custody Caryn Wojcik State of Michigan Records Management Services.
Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School.
Extracting and Ingesting DDI Metadata and Digital Objects from a Data Archive into the iRODS extension of the NARA TPAP Using the OAI-PMH J. Ward, A. de.
iRODS: Interoperability in Data Management
EarthCube Layered Architecture Concept Award Interoperability Mechanisms.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
National Science Foundation Cooperative Agreement: OCI
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
About CUAHSI The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ universities.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
USING METADATA TO FACILITATE UNDERSTANDING AND CERTIFICATION ABOUT THE PRESERVATION PROPERTIES OF A PRESERVATION SYSTEM Jewel H. Ward, Hao Xu, Mike C.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Rule-Based Distributed Data Management iRODS Jan 23, Reagan W. Moore Mike Wan Arcot Rajasekar Wayne Schroeder San Diego.
1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Data Management Planning Session Kevin Gomes Michael Meisinger Arcot Rajasekar Michael Wan October 19, 2007.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
National Science Foundation Cooperative Agreement: OCI
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
OOI Cyberinfrastructure and Semantics OOI CI Architecture & Design Team UCSD/Calit2 Ocean Observing Systems Semantic Interoperability Workshop, November.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
© Copyright IBM Corporation 2016 Diagram Template IBM Cloud Architecture Center Using the Diagram Template This template is for use in creating a visual.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Policy Based Data Management Environments (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan Mike Conway Antoine de Torcy Richard Marciano Jewel Ward
Use of Policies to Enforce Collection Properties Richard Marciano Reagan Moore University of North Chapel Hill Data Intensive Cyber Environments.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
DataNet Federation Consortium
INTAROS WP5 Data integration and management
Design and Manufacturing in a Distributed Computer Environment
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Bird of Feather Session
Presentation transcript:

National Data Infrastructure Projects EarthCube Layered Architecture (GEO) DataNet Federation Consortium (OCI) integrated Rule Oriented Data System (SDCI) Reagan W. Moore (DICE-UNC) Arcot Rajasekar (DICE-UNC)

iRODS  Integrated Rule Oriented Data System  DICE – Reagan Moore  Concepts – Arcot Rajasekar  Architect – Mike Wan  Security / metadata / production – Wayne Schroeder  Rule engine – Hao Xu  User interface (Java) – Mike Conway  Applications – Antoine de Torcy  Administration – Sheau-Yen Chen  E-iRODS (enterprise version developed by RENCI)  Management – Charles Schmitt  Production version – Jason Coposky  Test environment – Terrell Russell  Tutorials – Leesa Brieger

Examples of “National” Infrastructure  Data Grids (data sharing)  National Optical Astronomy Observatory  Ocean Observatories Initiative  The iPlant Collaborative  Babar High Energy Physics  Broad Institute genomics data grid  WellCome Trust Sanger Institute genomics data grid  Digital Libraries (data publication)  French National Library  Texas Digital Library  UNC-CH SILS LifeTime Library  Repositories / Archives(data preservation)  NASA Center for Climate Simulation  Carolina Digital Repository

Community Resources Access- clients Semantics- catalogs Data- repositories Analyses- workflows Transformations- services Physics- models Community Resources Access- clients Semantics- catalogs Data- repositories Analyses- workflows Transformations- services Physics- models EarthCube Interoperability Environment Data Access - protocol brokers Vocabulary - semantic mappings Data Discovery - union catalogs Repository Discovery- registries Data Products- data caches Governance- policies EarthCube Interoperability Environment Data Access - protocol brokers Vocabulary - semantic mappings Data Discovery - union catalogs Repository Discovery- registries Data Products- data caches Governance- policies Collaboration Environment Data Sharing - joint research collection Analysis Sharing - jointly shared workflows Research Tracking - provenance capturing Policy sharing- research consensus Collaboration Environment Data Sharing - joint research collection Analysis Sharing - jointly shared workflows Research Tracking - provenance capturing Policy sharing- research consensus Research Environment Local Analyses- laptop Production Runs- institutional server Physics Modeling- HPC grids Sensor Data - observation networks Research Environment Local Analyses- laptop Production Runs- institutional server Physics Modeling- HPC grids Sensor Data - observation networks EarthCube Infrastructure Components

Approaches  National data grid  Single system supporting data sharing across institutions  Australian Research Collaboration Service  Top down approach  Federation environment  Establish trust mechanisms to enable data access between systems  Collaboration Environment  Support data sharing across community resources  Requires interoperability mechanisms to enable access to remote repositories  Register data into a logical collection  Bottom-up federation of existing data management systems

Policy-Based Data Environments  Purpose  Reason a collection is assembled  Properties  Attributes needed to ensure the purpose  Policies  Controls for enforcing desired properties,  mapped to computer actionable rules  Procedures  Functions that implement the policies  Mapped to computer executable workflows  Persistent state information  Results of applying the procedures  mapped to system metadata  Property verification  Validation that state information conforms to the desired purpose  mapped to periodically executed policies

Goals and Impact  Collaborative research  Sharable collections  Sharable workflows  Reproducible science  Automate data retrieval, transformation  Re-execution and provenance of workflows  Reference collections  Community knowledge resources (catalogs, repositories)  Manage data life cycle through evolution of policies as user community broadens  Student participation in research  Policy controlled research analyses

DFC Vision - Data Driven Science  Enable reproducible science through collaborative research on shared workflows and data collections  Researcher management of workflows and data  Policy-based management of entire scientific data life cycle from data analysis pipelines to long-term sustainability of reference collections  Implement NSF national scale data cyber-infrastructure  Federation of exemplar data management technologies from national research initiatives  Provision of interoperability mechanisms  Proven technology implemented in extant data grids  Integrate “live” research data collections into education initiatives  Student digital libraries accessing national data sets Project Shared Collection Processing Pipeline Digital Library Reference Collection Federation Community-based Collection Life Cycle 9/13/2015 8

Community-based Collection Life Cycle Project Collection Private Local Policy Data Grid Shared Distribution Policy Digital Library Published Description Policy Data Processing Pipeline Analyzed Service Policy Reference Collection Preserved Representation Policy Federation Sustained Re-purposing Policy Stages correspond to addition of new policies for a broader community Virtualize the stages of the collection life cycle through policy evolution The driving purpose changes at each stage of the data life cycle

Building Community Resources  Digital libraries use collections to define context  Provenance information  Descriptive information  Administrative information  Policy-based data management use procedures to encapsulate domain knowledge  Workflows for generation of data  Workflows for administration of data  Workflows for enforcement of management policies  Workflows for verifying collection properties

Computer Actionable Knowledge  Dataobjectsbits  Informationnamesmetadata  Knowledgerelationships between namesprocedures  Wisdomrelationships between relationships policy points  DatabitsPosix I/O  InformationmetadataRelational database  Knowledgeprocedures Workflows  Wisdompolicy pointsRule engine

Shared Collections – Data Grid File System Client 50 clients: web browser, unix shell command, … Data grid middleware provides global name, single sign-on, policy enforcement, metadata, replication Tape Archive Data Grid Multiple types of systems can be used to store data

Policy-based Data Management Client iRODS-server Rule-engine Rule base Workflows iRODS-server Rule-engine Rule base Workflows iRODS-server Rule Engine Rule base Workflows iRODS-server Rule Engine Rule base Workflows Storage Logical Collection (data grid) Logical Collection (data grid) Consensus on Policies and Procedures controls the Data Collection Virtualize collection Virtualize workflow

Data Workflow Virtualization Storage System Storage Protocol Access Interface Policy Enforcement Points Standard Micro-services Standard I/O Operations Data Grid Trap actions requested by the client at multiple policy enforcement points. Map from policy to standard micro-services. Map from micro-services to standard Posix I/O operations. Map standard I/O operations to the protocol supported by the storage system

iRODS Distributed Data Management

Rule to count metadata values myTestRule { #Input parameters are: # String with conditional query #Output parameter is: # Result string msiExecStrCondQuery(*Select,*QOut); foreach(*QOut) { msiPrintKeyValPair("stdout",*QOut) } INPUT *Select=$"SELECT count(META_DATA_ATTR_VALUE), order(META_DATA_ATTR_NAME), META_DATA_ATTR_NAME where COLL_NAME like '/lifelibZone/home/rwmoore%'" OUTPUT ruleExecOut

Eco-Hydrology Choose gauge or outlet (HIS) Extract drainage area (NHDPlus) Digital Elevation Model (DEM) Worldfile Flowtable RHESSys Slope Aspect Streams (NHD) Roads (DOT) Strata Hillslope Patch Basin Stream network Nested watershed structure Land Use Leaf Area Index Phenology Soil Data NLCD (EPA) Landsat TM MODIS USDA Soil and vegetation parameter files RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state. For each box, create a micro- service to automate task, and chain into a workflow

Event-Driven Real-Time Drought Analysis/Prediction Workflow Data Grid – Collaboration Environment RAPID (river routing model) RAPID (river routing model) NASA NLDAS-2 Other data sources Invoke Monitor Output Store Visualization

SILS LifeTime Library  Student digital libraries  Enable students to build collections of Photographs MP3 audio files Class documents Video Web site archive  Resources provided by School of Information and Library Science at UNC-CH  Student collections range from 2 GBytes to 150 Gbytes  Number of files from 2000 to 12,000

SILS LifeTime Library Policies  Library management  Replication  Checksums  Versioning  Strict access controls  Quotas  Metadata catalog replication  Installation environment archiving  Ingestion  Automated synchronization of student directory with LifeTime Library  Automated loading of MP3 metadata

Policy-Driven Repository Infrastructure project funded by the Institute for Museum and Library Services Carolina Digital Repository

Carolina Digital Repository Ingest Workflow

Capturing Workflow Provenance Workflow file Directory holding all input and output files associated with workflow file (mounted collection that is linked to the workflow file) Input parameter file, lists parameters and input and output file names Directory holding all output files generated for invocation of eCWkflow.run, the version number is incremented for each execution Automatically generated run file for Executing each input file Output file created for eCWKflow.mpf eCWkflow.mss /earthCube/eCWkflow eCWkflow.mpf /earthCube/eCWkflow/eCWkflow.runDir0 eCWkflow.run Outfile eCWkflow2.run eCWkflow2.mpf /earthCube/eCWkflow/eCWkflow2.runDir0 Newfile

Automating Time Series Data Access Client Requests time period Client Requests time period Logical Collection Time Index NetCDF file Data grid automatically generates a time index into all files deposited into the collection. Each access defines the desired time period, and the data grid retrieves data from the relevant files. Being developed for iRODS 3.3 for use by OOI

Publications  Rajasekar, R., M. Wan, R. Moore, W. Schroeder, S.-Y. Chen, L. Gilbert, C.-Y. Hou, C. Lee, R. Marciano, P. Tooby, A. de Torcy, B. Zhu, “iRODS Primer: Integrated Rule-Oriented Data System”, Morgan & Claypool,  Ward, R., M. Wan, W. Schroeder, A. Rajasekar, A. de Torcy, T. Russell, H. Xu, R. Moore, “The integrated Rule-Oriented Data System (iRODS 3.0) Micro-service Workbook”, DICE Foundation, November 2011, ISBN: , Amazon.com

iRODS - Open Source Software   Distributed under BSD license  Current version is iRODS 3.2 Typically have three releases per year  Scale of capabilities: 338 system attributes (users, files, collections, resources, rules) 272 basic functions (micro-services) 80 policy enforcement points  Downloads 39 countries 62 US academic institutions

Reagan W. Moore NSF OCI “DataNet Federation Consortium” NSF OCI “Improvement of iRODS for Multi-Disciplinary Applications” NSF OCI “NARA Transcontinental Persistent Archives Prototype” NSF SDCI “Data Grids for Community Driven Applications” iRODS - Open Source Software