Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards a Provenance Architecture Karen Schuchardt PNNL.

Similar presentations


Presentation on theme: "Towards a Provenance Architecture Karen Schuchardt PNNL."— Presentation transcript:

1 Towards a Provenance Architecture Karen Schuchardt PNNL

2 Kepler Provenence Meeting Jan 05 2 OutlineOutline Past and Present Work Use Cases Thoughts on Workflow Provenance and Architectures

3 Kepler Provenence Meeting Jan 05 3 Past and Present Provenance Work Ecce Chemistry Environment Electronic Laboratory Notebooks Collaboratory for Multi-Scale Chemical Science (CMCS) Scientific Annotation Middleware Towards a Semantic Data Grid for Systems Science mid 90s late 90s 2000 2004- 2006

4 Kepler Provenence Meeting Jan 05 4 Ecce Chemistry Environment Chemistry-based calculation workflow Provenance Captured as user performs actions W’s (who, what, when) Job submissionstatus Info Relationships (Xlinks) between calculations, outputs, inputs etc Linkbase for molecular dynamics multi-step processes WebDAV-based server captures all inputs, outputs and metadata Provenance used to provide at-a-glance summary of work performed, duplicate and rerun, search, Bind rules based on types and relationships

5 Kepler Provenence Meeting Jan 05 5 Electronic Laboratory Notebooks Hierarchical, Chronological Chapters/Pages/Notes File upload, sketch, text, equations, forms, image capture, … Add/View/Search Notes Records functionality: Non-repudiation - digital signatures and timestamps Persistence/completeness - write- once/no deletions/audit trail Standardized lifecycle – signing/witnessing policies, archiving, retention schedules, … Now based on WebDAV Provenance Structure of notebook Records data Mimetype-based functionality

6 Kepler Provenence Meeting Jan 05 6 Collaboratory for Multi-Scale Chemical Sciences (CMCS) Dublin Core for basic pedigree: title, creator, dates, publisher, is-referenced- by, references, replaces, is-replaced-by, has-version Dublin Core Element Set and Qualified Dublin Core Both XML and RDF to encode metadata values Use of XLink to express values of relationships CMCS properties for chemical science to enable searching: species name, CAS, chemical properties, and chemical formula. CMCS properties for defining scientific data: has-inputs, has-outputs, and is-part- of-project. CMCS properties for scientific publication and peer review annotations: is- sanctioned-by. Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!

7 Kepler Provenence Meeting Jan 05 7 Scientific Annotation Middleware Provides a node plus metadata/relationship view of underlying data sources Support put/get/search/access control of arbitrary data/metadata Configurable metadata extraction from binary/ASCII/XML files Configurable Data Translation Semantic/graph queries RDF Export Notebook Services (page display, signatures, timestamps, …) Pluggable security Direct connection between metadata and resource limits use as next generation provenance store

8 Kepler Provenence Meeting Jan 05 8 Towards a Semantic Data Grid Explore frameworks for advanced model-driven data integration capabilities Seamlessly integrate files, databases Automated scientific workflow mechanisms Capture, represent, and disseminate knowledge Identify changes via discovery mechanisms Internally funded 2 year project

9 Kepler Provenence Meeting Jan 05 9 Towards a Semantic Data Grid What proteins in my organism(s) are both predicted and shown by experiment to interact with E. Coli Resources required Microarray spreadsheets NCBI data services BIND data base DIP database Work-group specific databases Other data services Extraction Translation Merging HPC Services Public Web services Discovery

10 Kepler Provenence Meeting Jan 05 10 Use Case - Personal Records Capture and organize display of provenance simplifies the job keeping track of activities performed over the course of long research process Example: Bioinformatisist performs data integration/analysis for many diverse projects. After 6 months, he/she can’t remember what a particular result pertained to or how it was generated.

11 Kepler Provenence Meeting Jan 05 11 Use Case - Verifiability Data generated from instruments/experiments undergoes numerous automatic processes before becoming available to researcher(s) Example: High-throughput biology experiments run through several automated and in some cases manual processes before it becomes available to the bioinformatisist. The bioinformatisist often does not trust the data. They want to know who created, what was done to it, when it was generated….

12 Kepler Provenence Meeting Jan 05 12 Use Case - Applicability Increasingly, research problems span disciplines or scales. Though data needs to move across these boundaries, it is often a manual process involving personal communications. Example: In the combustion multi-scale research environment, data generated at one scale (e.g. thermochemical data) serves as input to successive scales (e.g. mechanisms). But its not that simple - we must be able to determine the applicability of available data - are the theoretical underpinnings under which it was generated consistent with the intended use?

13 Kepler Provenence Meeting Jan 05 13 Use Case - Best Practices By capturing and providing access to provenance of prior work, best practices can be shared. Example: This is a little bit hypothetical but… best practices can be shared by sharing workflow definitions or by viewing provenance (and inputs) from instances of workflows.

14 Kepler Provenence Meeting Jan 05 14 Types of Provenance in Workflow Environment Interaction Provenance Data that moves between services State Provenance Data known only to the actor itself Observable Provenance Start/completion times Error detection

15 Kepler Provenence Meeting Jan 05 15 Other Provenance Other Applications will record data Pedigree/Provenance Experiment Metadata Project Organization Categorization Detected Features Instrument logs Digital Signatures Endorsements Community Annotations Other workflow engines

16 Kepler Provenence Meeting Jan 05 16 Logical Architecture Provenance Store(s) Query Interface Submission Interface User Recording Tools Portlets Annotator Notebooks Science Applications Client Query Library Client Submission Library Experiment Services Workflow engine Domain specific services Presentation Services Visualizer/ Browser Difference Visualizer Workflow construction Processing Services Difference Analyzer Quality Analyzer Extracted from escience Strawman - Moreau Provenance Store(s)

17 Kepler Provenence Meeting Jan 05 17 Components of Physical Architecture One or more RDF triple stores Global naming service Arbitrary data stores for data referenced by the provenance Security services (pluggable for scalability)

18 Kepler Provenence Meeting Jan 05 18 Workflow and Provenance Requires binding to provenance service Need mechanism to associate provenance from workflow instance Id? Links? Requires communication of service information or other mechanism for actors to contribute state provenance

19 Kepler Provenence Meeting Jan 05 19 SummarySummary We’ve done a lot of work on provenance but see value in moving to more flexible architecture Workflow engines are just one component that can contribute to the provenance of research results. Provenance capture should be thought of as a cross-cutting technology Models for provenance need to be flexible allowing arbitrary content Provenance services need to be scalable low-footprint usages for individual applications large experimental facilities Virtual organizations


Download ppt "Towards a Provenance Architecture Karen Schuchardt PNNL."

Similar presentations


Ads by Google