Presentation is loading. Please wait.

Presentation is loading. Please wait.

IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,

Similar presentations


Presentation on theme: "IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,"— Presentation transcript:

1 IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury, Richard Stapenhurst Information Management Group School of Computer Science The University of Manchester, UK

2 IPAW'08 – Salt Lake City, Utah, June 2008 Outline Setting: the problem of quality control in scientific workflows Quality control is an automated decision process –accept /reject data based on user-defined criteria –part of the workflow  quality workflow Role of workflow provenance in explaining automated decisions –why was data element X accepted/rejected?

3 IPAW'08 – Salt Lake City, Utah, June 2008 Scope of provenance analysis Model-driven quality workflows: –automatically generated from a specification –makes for a predictable workflow structure Services in quality workflows are semantically annotated The provenance data model exploits the semantics: –provenance queries leverage the ontology –provenance elements explained in ontology terms

4 IPAW'08 – Salt Lake City, Utah, June 2008 Practical setting Scientific workflows accelerate the rate at which results are produced Quality control on the results becomes paramount –automation / high throughput limit the options for systematic human inspection –use of public resources (data, services) may introduce noise: e.g. dirty data Risk of producing invalid results but: quality metrics vary with data and application domain

5 IPAW'08 – Salt Lake City, Utah, June 2008 Example: protein identification process Data output Protein identification algorithm “Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive Evidence: mass coverage (MC) measures the amount of protein seqnce matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain

6 IPAW'08 – Salt Lake City, Utah, June 2008 Quality process components PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion: Evidence: mass coverage (MC)‏ Hit ratio (HR)‏ ELDP Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering actions rules: if (score < x)‏ then reject The Qurator hypothesis [VLDB06] quality controls have a common process representation –regardless of their specific data and application domain The Qurator hypothesis [VLDB06] quality controls have a common process representation –regardless of their specific data and application domain

7 IPAW'08 – Salt Lake City, Utah, June 2008 From quality processes to quality workflows Approach in practice: users provide a declarative specification of an abstract quality process (a “Quality View”)‏ The abstract process is automatically translated into a quality workflow –this makes arbitrary Taverna workflows “quality-aware”

8 IPAW'08 – Salt Lake City, Utah, June 2008 Example: original proteomics workflow Quality flow embedding point

9 IPAW'08 – Salt Lake City, Utah, June 2008 Example: embedded quality workflow

10 IPAW'08 – Salt Lake City, Utah, June 2008 Qurator provenance component scope: workflow run scope: workflow run data being quality assessed data being quality assessed quality metrics applied to the data value of metric on the data value of metric on the data evidence used to compute metrics evidence used to compute metrics quality rules based on metrics values quality rules based on metrics values statistics Specialised for quality workflows

11 IPAW'08 – Salt Lake City, Utah, June 2008 Semantics of quality processors upper ontology for Information Quality upper ontology for Information Quality extensions to the proteomics domain extensions to the proteomics domain services and data

12 IPAW'08 – Salt Lake City, Utah, June 2008 Provenance model Provenance elements are individuals of ontology classes –OWL ontology => RDF provenance data Static model – RDF graph –workflow graph structure, services –auto-generated along with the quality workflow itself Dynamic model – RDF graph –populated during workflow execution –RDF resources can be elements of the static model –data values are literals

13 IPAW'08 – Salt Lake City, Utah, June 2008 Static model (fragment)‏

14 IPAW'08 – Salt Lake City, Utah, June 2008 Dynamic model (fragment)‏

15 IPAW'08 – Salt Lake City, Utah, June 2008 Provenance service interface Java SPARQL API (Jena ARQ)‏ –GUI shown earlier is an example Queries are straightforward SPARQL –3-layer workflow pattern => no recursion Examples –all evidence for data elements of class ProteinHitEntry [for a given execution] ?x rdf:type ProteinHitEntry –all action outcomes [for a given execution] –values for all quality metrics [for a given execution and data element] –...

16 IPAW'08 – Salt Lake City, Utah, June 2008 Conclusions An experiment in “semantic provenance” –restricted to quality workflows Semantic service annotations => high-level provenance query / presentation Key enabler: workflow is the result of a compilation step –regular pattern facilitates analysis / presentation Speculative conclusion: –workflows are targets, not sources... –model-driven generation of workflows has benefits and will happen more and more Speculative conclusion: –workflows are targets, not sources... –model-driven generation of workflows has benefits and will happen more and more


Download ppt "IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,"

Similar presentations


Ads by Google