Download presentation
Presentation is loading. Please wait.
1
Manish Kumar Anand maanand@ucdavis.edu Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler
2
2 Scientific Workflows Discoveries achieved via complex computations Workflows replacing traditional scripting approaches Enable automation, reproducibility, sharing, provenance Perl script Scientific workflow system
3
3 Provenance A record of processes, inputs/outputs, dependencies Supports reproducibility, interpretation, verification AZG AYG AXG AlignWarpReslice SoftmeanSlicerConvert
4
4 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline
5
5 Conventional Provenance Models Records – Inputs / outputs of invocations Infers –Data dependency –Invocation dependency Workflow execution graph Data dependency Invocation dependency input( a, s 1 ), output( a, s 2 ), input( b, s 2 ), input( c, s 2 ), … Assumptions: -Data is atomic -Invocations consume all inputs and produce new outputs - Every output depends on all inputs
6
6 s1s1 a s2s2 s3s3 s4s4 (b) Challenges in Modeling Provenance Many scientific workflow systems also support: a)Both data “transformers” and “pass-through” b)Processes with different dependency patterns c)Structured data (XML) Models of provenance must consider these factors s1s1 a (a) s2s2 s3s3 s4s4 s1s1 a s2s2 s3s3 s2s2 s1s1 s3s3 s4s4 s5s5 s1s1 s2s2 s3s3 a (c)
7
7 Unified Provenance Model
8
8 Efficient Provenance Representation Instead of storing each version –Only store a single combined version Along with a set of updates ( ’s) –Updates and dependencies represented as annotations 1 2 4 5 6 1 2 34 a 1 2 34 5 6 +a -a 2 34 5 6 +a -a 1 Expanded Condensed a = { ins (5,a), dep (5,2), del (3,a)} a = { ins (5,a), dep (5,2), del (3,a), ins (6,a), dep (5,3), dep (5,4), dep (6,2), dep (6,3), dep (6,4)}
9
9 Expanding and Condensing Traces 2 34 5 6 +a -a 1 Expanded 1 2 34 5 6 +a -a Condensed
10
10 Trace Views convertslicersoftmeanreslicewarpalignwarp 1 2 678 910 Image Header Image Header RefImage AnatomyImage Images … S1S1 1 2 11 WarpParamSet AnatomyImage Images … S2S2 1 12 13 ReslicedImage Images … S3S3 14 1 15 16 AtlasImage Images … S4S4 17 1 15 18 AtlasSlice AtlasImage Images … S5S5 Image Header Image Header 1 15 19 AtlasGraphic AtlasImage Images … S6S6 Condensed Trace Expanded Trace Using a postorder (i.e, bottom-up, left-to-right) traversal Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y) Remove invocation order annotations -Those implied according to rules in (3--8) Uses three distinct preorder (i.e., top-down, left-to-right) traversals Pass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationships Pass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes
11
11 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline
12
12 Storage Strategies Use standard relational DBMS and minimize storage size, update time and query time Store immediate and transitive dependencies –Faster query execution Reduction techniques –Represent dependencies in reduced form
13
13 Storage Strategies 5 storage strategies – NC: Naive Collapsed – NE: Naive Expanded – SE: Simple Expanded – RE: Reduced Expanded – RC: Reduced Collapsed Compare: –Storage size, update time, query time NC Trace Collapsed NE Trace Expanded SE Trace Expanded Transitive Dep. RE Reduced Trace Expanded Transitive Dep. RC Reduced Trace Collapsed Transitive Dep. Reduction Algorithms
14
14 Analysis of Storage Strategies SE : Worst storage size and update time RC : Very expensive query time RE: Recommended storage strategy Storage sizeRC < RE < NC < NE < SE Update timeRC < RE < NC < NE < SE Query timeSE < RE < NE < RC < NC Storage Size Traces Cells (1000) Update Time Traces Time(s) Query Time NE NC SE RE RC RE RC NE SE
15
15 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline
16
16 Querying Provenance can be Expensive Queries are often recursive –Complex to formulate –Expensive to evaluate Standard querying approaches –Tied to storage representation –Query language expertise Need to query across structures, lineage, or both How to express provenance queries easily and execute them efficiently? (Q) Select lineage path that derived all children of AtlasImage created by slicer Structures Lineage
17
17 select t.runId, t2.nodeId, t.nodeId as depNodeId from ( select d1.runId, d1.pDep, d1.nodeId from dependency d1 where runId=runId_in union select p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_in and d2.runId=runId_in and d2.pDep=p1.toPointer ) as t, depMinMaxPointer p2, ( select t.runId, r1.nodeId, t.pDep from ( select dc1.runId, dc1.pDepC, dc1.pDep from depCdepPointer dc1 where runId=runId_in union select p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_in and dc2.runId=runId_in and dc2.pDepC=p1.toPointer ) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1 where p2.runId = runId_in and r1.runId=runId_in and rp1.runId=runId_in and r1.nodeId=nodeId_in and r1.pointer=rp1.pointer and rp1.pDep = p2.fromPointer and t.pDepC=p2.toPointer and t.pDep BETWEEN p2.depMin AND p2.depMax union … … To Express this Query … SQL (eg, transitive dependencies) Hard for domain scientists (… and SQL experts) Optimization depends on SQL engine [He et al. SIGMOD 08] Need for higher-level provenance query language create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin DECLARE finished integer default 0; … declare cur_1 cursor for select depNodeId from dependency where runId=runId_in and itemNodeId=nodeId_tmp; set nodeId_tmp = nodeId_in; set depCnt = (select count(*) from dependency where runId=runId_in and itemNodeId=nodeId_tmp); if (depCnt is not null) then open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; if finished then leave get_cur_1; end if; insert into depcT (nodeId) values(depNodeId_tmp); end LOOP get_cur_1; close cur_1; set cnt = 1; while (cnt <= depCnt) do set nodeId_tmp = (select nodeId from depcT where no=cnt); set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in); set row_cnt =0; open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; set flag = (select 1 from depcT where nodeId = depNodeId_tmp); if (flag is null) then insert into depcT (nodeId) values(depNodeId_tmp); end if; if (row_cnt > row_limit) then leave get_cur_1; end if; set row_cnt = row_cnt + 1; … … SQL (stored procedures)
18
18 QLP Constructs First Provenance Challenge Queries Formulated in QLP Query 1*..//AtlasXGraphic Query 2#softmean..//AtlasXGraphic Query 3#softmean..#slicer..#convert..//AtlasXGraphic Query 4invocations(#align_warp[m=“12”, dateofExecution="Monday"] Query 5outputs(//AnatomyHeaders[maximum=“4096”]..//AtlasGraphic) Query 6outputs(#align_warp[-m=“12”]..#softmean) Query 7#convert..*, #pgmtoppm..* Query 8outputs(//AnatomyImages[center=“Uchicago”].#align_warp) Query 9//AtlasGraphic[studyModality=“speech” | “visual” | “audio”]/@*
19
19 Querying Multiple Dimensions 1. Obtain structures from @in and @out version operators 2. Apply XPath expressions to structure 3. Apply lineage queries to each resulting node Q QLP : * derived //AtlasImage/* @out slicer * derived 18 Structures Lineage //AtlasImage/* (Q) Select lineage path that derived all children of AtlasImage created by slicer 1 15 18 AtlasSlice AtlasImage Images … S5S5 @out slicer
20
20 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline
21
21 Provenance Browser Browse different views of a trace Data dependencies, collection structure, actor invocations Move “forward” and “backward” through execution
22
22 Collection History Collection and invocation view Incrementally step through execution history
23
23 Conclusion Capture –Supports nested data collections, explicit data dependency, update semantics Storage –Reduce update time, storage size and query time Query –A high-level provenance query language (QLP) Query structures with lineage graphs Formulate queries easily and concisely Browse/Vizualize –Provenance Browser, a visualization tool to view and navigate across provenance views
24
24 References 1.M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009 2.M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009 3.S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.