Presentation on theme: "Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton"— Presentation transcript:
Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton L.Moreau@ecs.soton.ac.uk
Contents Provenance: problem definition Use cases of provenance in e-Science Architectural vision for provenance First experimentation, current work Research agenda Conclusion
Provenance: definition Main Entry: prov·e·nance Pronunciation: 'präv-n&n(t)s, 'prä-v&-"nän(t)s Function: noun Etymology: French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come -- more at PRO-, COME Date: 1785 1 : ORIGIN, SOURCE 2 : the history of ownership of a valued object or work of art or literaturePRO-COMEORIGINSOURCE (Merriam-Webster Online)
The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations
Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded?
Provenance and Workflows Workflow enactment has become popular in the Grid and Web Services communities Workflow enactment can be seen as a scripted form of virtual organisation The problem is similar: how can we determine the origin of enactment results?
Use cases Bioinformatics Aerospace Engineering Organ transplant management Combechem Physics
Provenance in Bioinformatics myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Manchester, Southampton, Newcastle, Nothinham, EBI IBM, SUN, GSK, AZ, Merck
Graves disease Autoimmune disease of the thyroid Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophthalmos
Provenance in Bioinformatics Notification Service Knowledge Services DB2 Registry Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M
Provenance: Execution Trail process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output Gene:AC005412.6SNP:000010197
Provenance: Domain Level Trail Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by
Provenance: Annotation Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by urn: Alice disputed by
myGrid Provenance Requirements Execution trail Knowledge level representation of the execution, expressed in domain specific terms Undisputed view of execution Capability of annotating and providing interpretations to results Interpretation of execution
Provenance in a Bioinformatics myGrid focus is on the scientist and their collaborations: provenance is a form of log book. There are other uses of provenance in bioinformatics Provenance in Drugs Discovery process Requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).
Provenance in Bioinformatics The mRNA that is to be translated contains stretches of noncoding sequence that are removed before translation begins. Noncoding stretches are called introns (for INtervening sequences) Sequences that are translated are called exons (for EXpressed sequence). Klaus-Peter Zauners study the quantity of information (Kolmogorov complexity) contained in introns and exons involves bioinformatics and statistical processes, relying on brute force and guess work
Provenance in Bioinformatics Determining the difference in the system during two runs of an experiment. Determining how best to run the experiment in future Historical record and proof of process Checks on validity of process Tracing the origin of data
Provenance in Aerospace Engineering Provenance requirement: to maintain a historical record of inputs/outputs from each sub-system involved in simulations. Aircrafts provenance data need to be kept for up to 99 years when sold to some countries. Currently, little direct support is available for this.
Provenance in Organ Transplant Management Decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and doctors and surgeons knowledge Heavily regulated domain: European, national, regional and site specific rules govern how decisions are made. Application of these rules must be ensured, be auditable and may change over time Provenance allows tracking previous decisions: crucial to maximise the efficiency in matching and recovery rate of patients
Provenance in Combechem Mechanism by which PhD students supervisor may check that a students experiment was performed properly, especially if the results are odd. If enough information is recorded about an experiment, the paper describing it can be automatically created. Protection of intellectual property rights. The signing chemist will use their expertise to determine whether the experiment was performed correctly, and the provenance should be complete enough that they could potentially re-run the experiment to check the results.
What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance. Methods are generally adhoc and do not interoperate.
Architectural Vision Typical workflow enactment in service oriented architecture …
Research Agenda (1) In order for provenance data to be useful, we expect such a protocol to support some classical properties of distributed algorithms. Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties
Research Agenda (2) Access control Medical applications: organ transplant, IXI, e- Diamond Scalability DC2 10^7 files, CERN envision 10^12 files From execution level provenance, how to infer domain level provenance.
Research Agenda (3) Using provenance of data, trust metrics of the data can be derived from: Trust the user places in invoked services Trust the user places in the input data Trust the user places in the enacted workflow Trust the user places in the enactor Trust the user places in the provenance service.
The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the fundamental computer science for e-Science call In collaboration with Cardiff www.pasoa.org
EU Provenance STREP: Enabling and Supporting Provenance in Grids for Complex Problems IBM United Kingdom Ltd, University of Southampton, German Aerospace Centre, University of Wales, Cardiff, Universitat Politecnica de Catalunya, MTA SZTAKI To design, conceive and implement an industrial-strength open provenance architecture for Grid computing, and to deploy and evaluate it in complex grid applications (aerospace engineering and organ transplant management).
Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Necessity to design a secure, scalable and configurable architecture capable of supporting multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.
Acknowledgements myGrid Simon Miles, Juri Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne, Mark Greenwood, Carole Goble, Martin Szomszor Combechem Gareth Hughes, Hugo Mills, monica schraeffel PASOA Omer Rana, Paul Groth, Simon Miles, Ben Caroll EU-Provenance Syd Chapman, John Ibbotson, Laszlo Varga, Steve Willmott, Ulises Cortes, Andreas Schreiber, Rolf Hempel