Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton

Slides:



Advertisements
Similar presentations
IATUL Porto, May 21, 2006 DOI and e-Science Dr Anne E Trefethen Oxford e-Research Centre
Advertisements

Exploiting the WWW: Lessons from a UK Research Project on a Health Record BrokerExploiting the WWW: Lessons from a UK Research Project on a Health Record.
©2006 University of Southampton IT Innovation Centre and other members of the SIMDAT consortium A SIMDAT Perspective on Grid Standards and Specifications.
1 From Grids to Service-Oriented Knowledge Utilities research challenges Thierry Priol.
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
MyGrid Security Issues Simon Miles University of Southampton.
ASPiS - Architecture for a Shibboleth-Protected iRODS System Mark Hedges, Tobias Blanke Centre for e-Research, Kings College London Adil Hasan, Jens Jensen.
Less is More Lightweight Ontologies and User Interfaces for Smart Labs J. G. Frey, G. V. Hughes, H. R. Mills, m. c. schraefel, G. M. Smith, David De Roure.
Configuration management
An Open Provenance Model for Scientific Workflows Professor Luc Moreau University of Southampton
Provenance: concepts, architecture and envisioned tools Professor Luc Moreau University of Southampton
Architecture Tutorial Summary and Conclusions. Architecture Tutorial The Provenance Architecture.
Principles of Personalisation of Service Discovery Electronics and Computer Science, University of Southampton myGrid UK e-Science Project Juri Papay,
Provenance in Distr. Organ Transplant Management Applying Provenance in Distributed Organ Management Sergio Álvarez, Javier Vázquez-Salceda, Tamás Kifor,
PrIMe PrIMe : Provenance Incorporating Methodology Steve Munroe The EU Grid Provenance Project University of Southampton UK
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
An integrative approach for attaching semantic annotations to service descriptions Luc Moreau, University of Southampton,UK.
An Intelligent Broker Approach to Semantics-based Service Composition Yufeng Zhang National Lab. for Parallel and Distributed Processing Department of.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
Automatically Extracting and Verifying Design Patterns in Java Code James Norris Ruchika Agrawal Computer Science Department Stanford University {jcn,
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
FREMA: e-Learning Framework Reference Model for Assessment David Millard Yvonne Howard IAM, DSSE, LTG University of Southampton, UK.
Provenance Challenges and Technologies for Grids Luc Moreau University of Southampton
This chapter is extracted from Sommerville’s slides. Text book chapter
Provenance in myGrid and beyond Luc Moreau, University of Southampton, UK.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
The GRIMOIRES Service Registry Weijian Fang and Luc Moreau School of Electronics and Computer Science University of Southampton.
Provenance Aware Service Oriented Architecture (1 year on) Professor Luc Moreau University of Southampton
Architecture Tutorial Provenance: overview Professor Luc Moreau University of Southampton
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Semantically Enhanced Model Experiment Evaluation Process (SeMEEP) within the Atmospheric Chemistry Community Chris Martin 1,2, Mo Haji 2, Peter Dew 2,
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
WSMX Execution Semantics Executable Software Specification Eyal Oren DERI
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
DAME: A Distributed Diagnostics Environment for Maintenance Duncan Russell University of Leeds.
Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.
Semantic based P2P System for local e-Government Fernando Ortiz-Rodriguez 1, Raúl Palma de León 2 and Boris Villazón-Terrazas 2 1 1Universidad Tamaulipeca.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
Introduction to Semantic Web Service Architecture ► The vision of the Semantic Web ► Ontologies as the basic building block ► Semantic Web Service Architecture.
July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,
The Astronomy challenge: How can workflow preservation help? Susana Sánchez, Jose Enrique Ruíz, Lourdes Verdes-Montenegro, Julian Garrido, Juan de Dios.
OPODIS'04 A protocol for recording provenance in service-oriented Grids Paul Groth, Michael Luck, Luc Moreau University of Southampton.
Formalising a protocol for recording provenance in Grids Paul Groth – University of Southampton.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Oct 2004 Jeremy Frey Informatics1 Automation and Semantics: The CombeChem Experience Jeremy Frey CombeDay Feb 2005.
B2A Pharma Prototype Implementation of an industrial-strength pharmaceutical workflow in a Grid environment Falk Zimmermann NEC Europe Ltd. IT Research.
Provenance in Distr. Organ Transplant Management EU PROVENANCE project: an open provenance architecture for distributed.
Holding slide prior to starting show. Lessons Learned from the GECEM Portal David Walker Cardiff University
Welcome Grids and Applied Language Theory Dave Berry Research Manager 16 th October 2003.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
Mathematical Service Matching Using Description Logic and OWL Kamelia Asadzadeh Manjili
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau University of Southampton.
Provenance: Problem, Architectural issues, Towards Trust
Similarities between Grid-enabled Medical and Engineering Applications
1st International Conference on Semantics, Knowledge and Grid
Presentation transcript:

Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton

Contents Provenance: problem definition Use cases of provenance in e-Science Architectural vision for provenance First experimentation, current work Research agenda Conclusion

Provenance: definition Main Entry: prov·e·nance Pronunciation: 'präv-n&n(t)s, 'prä-v&-"nän(t)s Function: noun Etymology: French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come -- more at PRO-, COME Date: : ORIGIN, SOURCE 2 : the history of ownership of a valued object or work of art or literaturePRO-COMEORIGINSOURCE (Merriam-Webster Online)

The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations

Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded?

Provenance and Workflows Workflow enactment has become popular in the Grid and Web Services communities Workflow enactment can be seen as a scripted form of virtual organisation The problem is similar: how can we determine the origin of enactment results?

Use cases Bioinformatics Aerospace Engineering Organ transplant management Combechem Physics

Provenance in Bioinformatics myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Manchester, Southampton, Newcastle, Nothinham, EBI IBM, SUN, GSK, AZ, Merck

Graves disease Autoimmune disease of the thyroid Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophthalmos

Provenance in Bioinformatics Notification Service Knowledge Services DB2 Registry Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M

Provenance: Execution Trail process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output Gene:AC SNP:

Provenance: Domain Level Trail Gene:AC SNP: process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by

Provenance: Annotation Gene:AC SNP: process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by urn: Alice disputed by

myGrid Provenance Requirements Execution trail Knowledge level representation of the execution, expressed in domain specific terms Undisputed view of execution Capability of annotating and providing interpretations to results Interpretation of execution

Provenance in a Bioinformatics myGrid focus is on the scientist and their collaborations: provenance is a form of log book. There are other uses of provenance in bioinformatics Provenance in Drugs Discovery process Requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).

Provenance in Bioinformatics The mRNA that is to be translated contains stretches of noncoding sequence that are removed before translation begins. Noncoding stretches are called introns (for INtervening sequences) Sequences that are translated are called exons (for EXpressed sequence). Klaus-Peter Zauners study the quantity of information (Kolmogorov complexity) contained in introns and exons involves bioinformatics and statistical processes, relying on brute force and guess work

Provenance in Bioinformatics Determining the difference in the system during two runs of an experiment. Determining how best to run the experiment in future Historical record and proof of process Checks on validity of process Tracing the origin of data

Provenance in Aerospace Engineering Provenance requirement: to maintain a historical record of inputs/outputs from each sub-system involved in simulations. Aircrafts provenance data need to be kept for up to 99 years when sold to some countries. Currently, little direct support is available for this.

Provenance in Organ Transplant Management Decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and doctors and surgeons knowledge Heavily regulated domain: European, national, regional and site specific rules govern how decisions are made. Application of these rules must be ensured, be auditable and may change over time Provenance allows tracking previous decisions: crucial to maximise the efficiency in matching and recovery rate of patients

Provenance in Combechem Mechanism by which PhD students supervisor may check that a students experiment was performed properly, especially if the results are odd. If enough information is recorded about an experiment, the paper describing it can be automatically created. Protection of intellectual property rights. The signing chemist will use their expertise to determine whether the experiment was performed correctly, and the provenance should be complete enough that they could potentially re-run the experiment to check the results.

GridPhyN

Architectural Vision

What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance. Methods are generally adhoc and do not interoperate.

Architectural Vision Typical workflow enactment in service oriented architecture …

Architectural Vision … with provenance support

A First Prototype

Sequence Diagram/Data Model Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution).

PReP: Provenance Recording Protocol clientservice invocation result Provenance Service invocation and result notify invocation and result notify negotiate configuration

PReP Formalisation Abstract machines Properties Termination Liveness Safety Foundation for adding necessary cryptographic techniques

PReP: Client Side State Space

Research Agenda (1) In order for provenance data to be useful, we expect such a protocol to support some classical properties of distributed algorithms. Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties

Research Agenda (2) Access control Medical applications: organ transplant, IXI, e- Diamond Scalability DC2 10^7 files, CERN envision 10^12 files From execution level provenance, how to infer domain level provenance.

Research Agenda (3) Using provenance of data, trust metrics of the data can be derived from: Trust the user places in invoked services Trust the user places in the input data Trust the user places in the enacted workflow Trust the user places in the enactor Trust the user places in the provenance service.

The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the fundamental computer science for e-Science call In collaboration with Cardiff

EU Provenance STREP: Enabling and Supporting Provenance in Grids for Complex Problems IBM United Kingdom Ltd, University of Southampton, German Aerospace Centre, University of Wales, Cardiff, Universitat Politecnica de Catalunya, MTA SZTAKI To design, conceive and implement an industrial-strength open provenance architecture for Grid computing, and to deploy and evaluate it in complex grid applications (aerospace engineering and organ transplant management).

Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Necessity to design a secure, scalable and configurable architecture capable of supporting multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.

Acknowledgements myGrid Simon Miles, Juri Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne, Mark Greenwood, Carole Goble, Martin Szomszor Combechem Gareth Hughes, Hugo Mills, monica schraeffel PASOA Omer Rana, Paul Groth, Simon Miles, Ben Caroll EU-Provenance Syd Chapman, John Ibbotson, Laszlo Varga, Steve Willmott, Ulises Cortes, Andreas Schreiber, Rolf Hempel