Presentation is loading. Please wait.

Presentation is loading. Please wait.

Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK.

Similar presentations


Presentation on theme: "Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK."— Presentation transcript:

1 Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK

2 … or the Provenance of my interest for Provenance Luc Moreau, University of Southampton, UK

3 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

4 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

5 Large amounts of data EMBL July 2001 150 Gbytes Microarray 1 Petabyte per annum Sanger Centre 20 terabytes of data Genome sequences increase 4x per annum http://www3.ebi.ac.uk/Services/DBStats/

6 Complexity Diversity Heterogeneity Disease Drug Disease Clinical trial Phenotyp e Protein Protein Structur e Protein Sequen ce P-P interaction s Proteo me Gene sequen ce Genom e sequen ce Gene express ion homology Genomic, proteomic, transcriptomic, metabalomic, protein- protein interactions, regulatory bio-networks, alignments, disease, patterns & motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), …

7 Heterogeneity Data types & forms Community Autonomy Over 500 different databases Different formats, structure, schemas, coverage… Web interfaces, flat file distribution,…

8 Heterogeneous Data Multimedia Images & Video Text annotations & literature Descriptive as well as numeric Knowledge-based Text Extraction

9 Bioinformatics Analysis Different algorithms BLAST, FASTA, pSW Different implementations WU-BLAST, NCBI-BLAST Different service providers NCBI, EBI, DDBJ

10 Drug Discovery

11 In silico experimentation Discovery of resources and tools, staging of operations, sharing of results Process is as important as outcome Science is dynamic – change happens Scientific discovery is personal & global Provenance and history

12 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

13 myGrid EPSRC funded pilot project Generic middleware within application setting 36 month in 42 month performance period Start 1 st October 2001 16 full-time post docs altogether 6 DTA studentships 1 technical project manager

14 myGrid consortium Scientific Team Biologists and Bioinformaticians GSK, AZ, Merck KGaA, Manchester, EBI Technical Team Manchester, Southampton, Newcastle, Sheffield, EBI, Nottingham IBM, SUN GeneticXchange Network Inference, Epistemics Ltd

15 myGrid outcomes  e-Scientists  Bioinformatics demonstrator (Graves’ disease and Williams’ syndrome)  Developers  myGrid-in-a-Box developer’s kit (currently myGrid 0.4)  Integrating some existing bioinformatics tools with myGrid (EBI services)

16 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

17 Graves disease Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophthalmos

18 The Biology GD caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system. Why is the lymphocyte causing the antibodies that attack the thyroid cell?

19 Graves’ Disease Experimental Process What genes are associated with Graves’ Disease? Candidate Gene What is known about my candidate gene? Literature Previous Research Databases Experimental Annotation Pipeline What SNPs (single nucleotide polymorphisms) in my candidate gene might be relevant? Verify relevance of SNPs. How can I visualise annotations to my candidate gene? Genotype Assay Design System 3D Protein Structure & SNP Visualisation

20 Experiment life cycle Executing experiments Workflow enactment Distributed Query processing Job execution Provenance generation Single sign-on authorisation Event notification Resource & service discovery Repository creation Workflow creation Database query formation Discovering and reusing experiments and resources Workflow discovery & refinement Resource & service discovery Repository creation Provenance Managing experiments Information repository Metadata management Provenance management Workflow evolution Event notification Providing services & experiments Service registration Workflow deposition Metadata Annotation Third party registration Personalisation Personalised registries Personalised workflows Info repository views Personalised annotations Personalised metadata Security Forming experiments

21 A work bench for demonstrating services myView on the mIR Workflow Metadata about workflow note about workflow

22 A workflow represents an experiment that can be run on the Grid. A workflow takes data as input. It performs ‘activities’, which are steps involved in analysing the data, including using tools and services, querying databases and running other workflows. A workflow can be run on the user’s local machine, or remotely, taking advantage of resources that are distributed. Data intensive grid having to deal with heterogeneity of the data and processes. Worflows

23 myGrid schematic Graves disease scenario Workbench Workflow editor Event Notification Workflow Enactment Information repository Service Registry Knowledge management Text services Bio services Distributed query processing Services Core components Generic Applications Exemplars Talisman SoapLab Gateway

24 Notification Service Knowledge Services DB2 Registry Service Oriented Architecture Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M

25 myGrid Deployment

26 myGrid 0.4 (Nov 2003) Describer (MAN): A tool for attaching semantic descriptions to WS and workflows Find Service (MAN): A component for classifying and discovering services and workflows via their semantic descriptions Ontology Server (MAN): The DAML+OIL reasoner Workbench (NOT): a NetBeans module for examining and updating the MIR and submitting workflows for enactment e-Science Gateway (NOT): An API giving access to myGrid core services MIR (myGrid information repository) (MAN/NEW): A Web Service accessing a repository that can hold data for an individual scientist or a team of scientists. Notification Service (IAM): A general-purpose Web Service that supports a publish/subscribe model of event notification, based on JMS Registry View service (IAM): A Web Service supporting a registry of published Web Services and workflows annotated with metadata, including semantic descriptions Freefluo (ITI): workflow enactment engine Taverna (EBI): workflow editing environment

27 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

28 Provenance: definition Main Entry: prov·e·nance Pronunciation: 'präv-n&n(t)s, 'prä-v&-"nän(t)s Function: noun Etymology: French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come -- more at PRO-, COME Date: 1785 1 : ORIGIN, SOURCE 2 : the history of ownership of a valued object or work of art or literaturePRO-COMEORIGINSOURCE

29

30

31

32

33 Provenance Experiment is repeatable, if not reproducible, and explained by provenance records Who, what, where, why, when, (w)how? The traceability of knowledge as it is evolves and as it is derived. Immutable metadata Migration – travels with its data but may not be stored with it. Private vs Shared provenance records. Credit. Provenance is related to:

34 A full provenance record is linked with the results. It’s a log of execution. Early Provenance Capture

35 Kinds of “Provenance” Backward Derivation An explanation of when, by who, how something was produced. Linking items, usually in a directed graph. Execution Process- centric To be contrasted with forward derivation, which is a path like a workflow, script or query. mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000

36 Kinds of “Provenance” Annotations Attached to items or collections of items, in a structured, semi- structured or free text form. Annotations on one item or linking items. An explanation of why, when, where, who, what, how. Data-centric

37 Kinds of “Provenance” in myGrid Derivations Workflow Enactment Engine provides a detailed provenance record stored in the myGrid Information Repository (mIR) describing what was done, with what services and when XML document, soon to be an RDF model Annotations Every mIR object has Dublin Core provenance properties described in an attribute value model

38 Provenance of data Operational execution trail process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output Gene:AC005412.6SNP:000010197

39 From Provenance to Knowledge Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output contains_single_nucleotide_polymorphism as stated by Declarative semantic execution trail

40 From Provenance to Knowledge Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output contains_single_nucleotide_polymorphism as stated by Trust and attribution urn: Carole Goble disputed by

41 Provenance vs … Provenance vs Annotation Provenance of an annotation Annotation of Provenance Provenance vs Workflow Provenance describes past execution A workflow is a script for future execution

42 What is Provenance? Annotations may be subject of interpretation (e.g. Alice believes annotation X, whereas Bob does not). Provenance should aim at recording an undisputed view of an execution.

43 What is Provenance? Provenance traces execution Provenance must be generated automatically Annotations can be either generated automatically or created by the user Annotations can contain semantic augmentation, which can be derived automatically or supplied manually.

44 Generating provenance RDF registry mIR FreeFluo WFEE Workflow execution Template OWL descriptions Identify workflow Execution Provenance log Data and metadata from the run Input data & parameters startTime, endTime, service instances invoked … Bind services Knowledge Provenance log Workflow knowledge template RDF+OWL Knowledge arising from workflow Scufl

45 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

46 Provenance in a Bioinformatics Grid myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Provenance in Drugs Discovery process: FDA requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).

47 Provenance in Aerospace Engineering Provenance requirement: to maintain a historical record of outputs from each sub-system involved in simulations. Aircrafts’ provenance data need to be kept for up to 99 years when sold to some countries. Currently, little direct support is available for this.

48 Provenance in Organ Transplant Management Decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and doctors’ and surgeons’ knowledge  Heavily regulated domain: European, national, regional and site specific rules govern how decisions are made.  Application of these rules must be ensured, be auditable and may change over time  Provenance allows tracking previous decisions: crucial to maximise the efficiency in matching and recovery rate of patients

49 The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations Understanding how a given service is likely to modify data flowing into it, and how this data has been generated is crucial.

50 Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded? The lack of information about the origin of results does not help users to trust such open environments.

51 Provenance and Workflows Workflow enactment has become popular in the Grid and Web Services communities Workflow enactment can be seen as a scripted form of virtual organisation. The problem is similar: how can we determine the origin of enactment results.

52 Provenance: Definition Provenance is some data able to explain how a particular result has been derived. In a service-oriented architecture, provenance identifies what data is passed between services, what services are available, and what results are generated for particular sets of input values, etc. Using provenance, a user can trace the “process” that led to the aggregation of services producing a particular output.

53 Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions

54 What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.

55 Architectural Vision

56 Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.

57 Architectural Vision Storage could be achieved by a provenance service. Provenance service would provide support for analysis, navigation or reasoning over provenance Client side support for submitting provenance data to the provenance service.

58 A First Prototype (Szomszor,Moreau 03) A service-oriented architecture for provenance support in Grid and Web Services environments, based on the idea of a provenance service; A client-side API for recording provenance data for Web Service invocation; A data model for storing provenance data; A server-side interface for querying provenance data; Two components making use of provenance: provenance browsing and provenance validation.

59 Prototype Overview

60 Prototype Sequence Diagram

61 To identify the interactions between provenance service, client side library and enactment engine Creation of a session Need to be able to support the most complex workflows including conditional branching, iteration, recursion and parallel execution. Support asynchronous submission of provenance data so that provenance submission does not delay workflow execution.

62 Prototype Provenance Data Model

63 Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution).

64 Prototype Provenance Browser

65 Discussion In order for provenance data to be useful, we expect such a protocol to support some “classical” properties of distributed algorithms. Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties

66 Towards Trust

67 Using the provenance of data, trust metrics of the data can be derived from: Trust the user places in invoked services Trust the user places in the input data Trust the user places in the enacted workflow Trust the user places in the enactor Trust the user places in the provenance service.

68  The purpose of project PASOA to investigate provenance in Grid architectures  Funded by EPSRC under the “fundamental computer science for e-Science call”  In collaboration with Cardiff  www.pasoa.org

69 Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.

70 Publications [SM03] Martin Szomszor and Luc Moreau. Recording and reasoning over data provenance in web and grid services. In International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE'03), volume 2888 of Lecture Notes in Computer Science, pages 603-620, Catania, Sicily, Italy, November 2003. [MCS + 03] Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based trust for grid computing - position paper. 2003. [GGS+03] Mark Greenwood, Carole Goble, Robert Stevens, Jun Zhao, Matthew Addis, Darren Marvin, Luc Moreau, and Tom Oinn. Provenance of e-science experiments - experience from bioinformatics. In Proceedings of the UK OST e-Science second All Hands Meeting 2003 (AHM'03), pages 223-226, Nottingham, UK, September 2003.

71 Acknowledgements The myGrid Southampton Team: Simon Miles, Juri Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne Mark Greenwood, Carole Goble, Manchester Martin Szomszor, Southampton Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven Willmott, UPC

72 www.mygrid.org.uk m


Download ppt "Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK."

Similar presentations


Ads by Google