Download presentation
Presentation is loading. Please wait.
Published byAbril Legates Modified over 9 years ago
1
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow Framework aka Kepler/DAKS (for Luc’s collection: before: “We do provenance!”; now: “ … and it almost killed us!”) Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas Norbert Podhorszki
2
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Goals for the Provenance Challenge Implement an RWS-style provenance model for Collection- Oriented Scientific Workflows Take advantage of Collection-Oriented SWFs to –Automatically infer state-reset events –Reduce the number of provenance-relevant events that need to be recorded (keep it minimal) –Simplify association of traces and provenance into one self- contained “trace” file for input, output, and dependencies Support science-oriented provenance and queries –Emphasize data dependencies (lineage) as well as process details Decouple provenance representation from particular scientific workflow technology (Kepler)
3
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Workflows Generic support for workflows that operate over nested data collections (trees) Abstract Model –Actors receive input trees, read contents of subtrees matching some criteria (scope), and optionally add or delete subtree nodes –Each scope instance corresponds to one actor invocation AlignWarp Scope = AnatomyImage … AnatomyImage ImageReferenceImage WarpParamSet … AnatomyImage ImageReferenceImage 1 2 1 23
4
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Workflows Kepler Implementation –Collections are serialized within heterogeneous token streams –Actor execution is pipelined based on each actor’s scope –Enables concurrent processing of nested data collections –Collections can contain data, metadata, actor parameters, and other collections
5
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance Challenge SWF Input data is read by collection reader –Execution driven by number and size of anatomy image sets specified by XML file Slicer configured on the fly via parameter tokens –E.g. to create the 3 slices required for each image set Output trace serialized into XML by collection writer –Trace implicitly contains input data, output data, and lineage
6
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance Embedded Provenance Tokens –Data and invocation dependencies stored as tokens within the stream –Actor API for declaring data dependencies –Invocation dependencies added automatically … Data Dependencies –Insertion and deletion events capture actor, invocation count, and direct data dependencies Process Dependencies –Invocation dependencies record which steps created data or modified collections used by another actor invocation Insertion Dependencies AnatomyImage ImageReferenceImage WarpParamSet
7
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Minimal Provenance Information Without Provenance With Provenance
8
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Execution traces imply provenance graphs Graph edges encode data lineage and process relations Lineage(Trace, Node, DependentNode, Actor, InvocCount) Provenance operations work over traces and graphs: Input(Trace, Node) Output(Trace, Node) Param(Trace, Name, Value, Actor, InvocCount) Metadata(Trace, Key, Value, Node) etc. Querying Collection-Oriented Provenance AtlasSlice (337) Image (311) Header (312) Slicer : 1 Data/Collection creation lineageCollection “last version” lineage AtlasImage (308) Image (311) Header (312) AtlasSlice (337) AtlasImage (308) Image (311) Header (312) Slicer : 1
9
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results We used two different runs –Each run has embedded metadata and parameter settings –First run equivalent to challenge workflow –Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection AnatomyImage ReferenceImage1 Image Header1 Header AnatomyImage ReferenceImage2 Image Header2 Header AnatomyImage ReferenceImage3 Image Header3 Header AnatomyImage ReferenceImage4 Image Header4 Header input to first run
10
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results WorkflowInput ImageCollection AnatomyImage input to second run … ImageCollection AnatomyImage … … … … … … … … We used two different runs –Each run has embedded metadata and parameter settings –First run equivalent to challenge workflow –Second run containing three sets of image collections, containing different numbers of images
11
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 1) Full Data Dependencies Query: ?- trace(1, T), nodeId(T, 341, N1), nodeId(T, 349, N2), nodeId(T, 357, N3), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges).
12
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 1) Question 1: Process that led to Atlas X Graphic Query: ?- trace(1, T), nodeId(T, 341, N), lineageEdges(T, N, Edges), drawEdges(Edges). Returns subset of lineage edges
13
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 2) Question 1: Process that led to Atlas X Graphic Query: trace(2, T), nodeId(T, 973, N1), nodeId(T, 1093, N1), nodeId(T, 1193, N1), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges). Single workflow run where not all output dependent on all input.
14
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Summary Benefits of our approach –Provenance support for Collection-Oriented SWFs –Minimal provenance information stored in self-contained trace file –Provenance automatically embedded within data stream, simple actor provenance API –Able to answer provenance challenge queries using simple operations (see WIKI entry) -- Note that we ignored question 7 Suggestion for Future Provenance Challenge –More complex/realistic workflows (e.g., from Bioinformatics) Loops, nesting, partial dependencies, concurrency –More “scientist-oriented” provenance queries Explicit queries for data dependencies (e.g., see Wiki entry) Assume user doesn’t know the structure of the trace (Queries 5)
15
Provenance Challenge @ GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. References An Approach for Pipelining Nested Collections in Scientific Workflows, Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17, 2005.An Approach for Pipelining Nested Collections in Scientific Workflows A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher, Shirley Cohen, Susan B. Davidson. International Provenance and Annotation Workshop (IPAW'06), 2006.A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data, Timothy McPhillips, Shawn Bowers, Bertram Ludaescher. 3rd International Workshop on Data Integration in the Life Sciences (DILS'06), 2006.Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.