Presentation on theme: "Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University."— Presentation transcript:
Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University of Pennsylvania 1
Workflow Start (s) Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees end (t) 2 Graphical representation of a sequence of actions to perform a task (eg. a biological experiment) Vertex Module (program) Takes a set of data items as input Produces a set of data items as output Edge Control (and Data) flow Data is typically a file Has a start (s) and an end (t) module Run: An execution of the workflow Actual data appears on the edges A module can be executed when data on each incoming edges have been computed TGCCG TGTGG CTAAAT G… CTGTG C … CTAAAT GTCTG TGC… GGCTA AATGTC TG TGCCG TGTGG CGTC… ATCCGT GTGGC TA..
High throughput technologies generate huge amount of data, which must be analyzed in computational experiments The analysis may be complex and multi-step Scientific workflow systems are frequently used to help conceptualize and manage the analysis process as well as intermediate and final data products Increasing need to record the provenance (i.e. the origin or history) of data products defined as a depends-on relationship between module execution and other data products many scientific workflow systems (e.g. Vistrails, Kepler, Taverna) now support provenance Data Provenance in Scientific Workflows 3
Need for Provenance 4 TGCCGTGT GGCTAAAT GTCTGTGC … CCCTTTCC GTGTGGCT AAATGTCT GTGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC … ATGGCCGT GTGGTCTG TGCCTAAC TAACTAA… Alignments ClustalW PAUPS Phillips … Bootstrap Biologists workspace Bioinformatics protocols Which sequences have been used to produce this tree? How this tree has been generated? ? Can I throw away some of these data? Which ones are really important to keep? s Split Entries Align Sequences Functional DataCurate Annotations Format Construct Trees t
Provenance Overload s Split Entries Align Sequences Functional DataCurate Annotations Format-2 Format-1 Format-3 Construct Trees t 5 Workflow Specification s Split Entries Align Sequences Functional Data Curate Annotations Format Construct Trees t Workflow run d 1 …d 100 d 201 …d 301 d 302 …d 402 d 403 d 404 …d 454 d 455 d 456 d 457 d 458 d 459 d 460 Construct Trees immediate provenance deep provenance Curate Annotations Format-3Format-2 Functional Data Format-1 Align Sequences Split Entries s Can we reduce the amount of provenance shown to the user?
Relevant Modules and Composition 6 [BCD + 08] shows how to focus user attention on relevant portion of provenance information User specifies relevant modules System creates composite modules (clusters) The result is called a user-view s Construct Trees t Align Sequences s Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees t
User-view Reduces Provenance Information 7 d 459 d 458 d 460 d 201 …d 301 d 456 M1M1 M2M2 M3M3 What properties should a good user-view have? Problem: Can the number of clusters be minimized in a good user-view? s Construct Trees Align Sequences s Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees t
Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 8 Outlines
Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 9
Workflow Specification Workflow Specification: (G, s, t, R) A directed graph G(V, E) Unique start module (source) s and unique finish module (sink) t R: set of relevant modules NR: V – R, non-relevant modules s, t R |V| = n, |E| = m, |R| = k 10 s R-node NR-node t
User View H: User-View of (G, s, t, R) A directed graph, H, whose nodes are clusters/composite modules of nodes in G. The nodes of H form a partition of the nodes in G. An edge e = (u, v) in G survives in H as e if the end points u, v belong to different clusters in H The edge e in G induces the edge in H or e is an origin of e R-cluster: contains at least one R-node NR-cluster: contains only NR-nodes 11 R-cluster NR-cluster s t
12 Direct dependencies between relevant clusters should be preserved, defined in terms of elementary path: a path where all the intermediate nodes are NR-nodes At most one R-node in each cluster: R-cluster assumes the meaning of the R-node Good and Bad User Views r1r1 r3r3 r2r2 r4r4 SpecificationBad view-1Bad view-2Good view-1Good view-2
Three Properties of a Good User-view 13 Property 1 (well-formed) each cluster in H should contain at most one R-node from G r1r1 G: SpecificationH: User-view r1r1 r4r4 r2r2 r3r3 r4r4 r2r2 r3r3
Three Properties of a Good User-view 14 Property 2 (soundness) every edge on an elementary path between two R-clusters in H should have all the origins on an elementary path between the corresponding R-nodes in G r1r1 r3r3 r2r2 r1r1 r3r3 r2r2 d G: SpecificationH: User-view Not sound! r 2 was not dependent on d in G, but dependent in H
Three Properties of a Good User-view 15 Property 3 (completeness) every edge on an elementary path between two R-nodes in G should induce an edge on an elementary path between the corresponding R-clusters in H d SpecificationUser view Not complete! d produced by r 1 was directly consumed by r 3 in G, but processed by r 2 in H r1r1 r3r3 r2r2 r1r1 r3r3 r2r2
Given directed graph G(V, E), source s, sink t, a set of R of R-nodes (s, t R), |R| = k, find a good user view H that minimizes the total number of clusters (optimum user-view) in poly-time. Optimization Problem 16
Can we find an optimum user-view in general directed graphs? Is this problem NP-complete? What about special directed graphs that capture many common workflows? Can we find matching upper and lower bounds of the #clusters in terms of k (= |R|) and not n (= |V|)? In general graphs? In some special graphs? Questions 17 Unknown [BCD + 08] gives a poly-time algorithm to find a minimal good user-view, which may not be of minimum size Optimum clustering for series-parallel graphs Tight bounds for general and series-parallel graphs
Series-Parallel Graphs 18 An edge (Base case) G1G1 G2G2 Series Composition Parallel Composition
Examples: (Non)Series-Parallel Graphs 19 Characterization of two-terminal SP-graph (VTL79) A two-terminal DAG is an SP graph if and only if it does not contain a subgraph homeomorphic to this forbidden subgraph SP graphsNon-SP graphs
Series-Parallel Graph (SP-graphs) s Split Entries Align Sequences Functional DataCurate Annotations Format Construct Trees t SP graphs are the workflow equivalent of structured programming (without iteration) Many workflows encountered in practice are SP graphs and do not allow looping 20 SP graph!
Contributions 21 Optimum Clustering Upper Bound on #clusters Lower Bound on #clusters SP Graphs YES (by an O(n) time algorithm ) 2k - 3 General Graphs ? (2 k-1 – k) 2 + k (analyze the #clusters output by [BCD + 08]) (2 k-1 – k) 2 + k Moreover, we express global conditions for a good user-view in terms of local conditions for each cluster for general graphs… useful when k << n
Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 22
Algorithm SP-View 23 s t Forward-pass Process the vertices in a topological order If an R-node do nothing If an NR-node if single R-predecessor o merge if >= 1 NR-predecessor o merge with last predecessor else o do nothing Produce an intermediate clustering
Algorithm SP-View 24 s t Reverse-pass Take intermediate clustering by Forward pass as input Produce a reverse topological order on the clusters Perform a symmetric procedure as done in the Forward pass on the clusters C10 C7 C6 C8 C9 C5 C4C3 C2 C1 C11 C13 Reduces 16 modules to 10 clusters Cannot do better than 10 (k = 9)! O(m+n) = O(n) time C12
Correctness 25 Proved by induction on each intermediate step for cluster formation Any workflow specification is a good user-view In each step, we preserve the SP-property we have a good user-view use equivalent local conditions for clusters use forbidden subgraph characterization of two- terminal SP graphs [VTL79]
Upper Bound 26 s t #clusters 2k-3 Here we show a weaker bound: 2k-1 Each surviving NR-cluster has at least one unique R-predecessor as a witness t is no ones predecessor! #clusters k + k-1 = 2k-1
Lower Bound 27 s t = r 0 r1r1 r2r2 r k-3 r k-2 = r k-1 p1p1 p2p2 p k-4 p k-3 #nodes = k + k-3 = 2k-3 No two nodes can be merged in any good user-view Optimum #clusters = 2k-3
Optimality 28 Outline of the steps … Suppose SP-View outputs N 1 R-clusters, N 2 NR-clusters total #clusters = N 1 + N 2 N 1 = k, can not be reduced Each NR-cluster contains one essential NR-node that cannot be included in any R-cluster If two essential NR-nodes are put in different clusters by SP- View, no good user-view can put them in the same cluster Any good user view has at least N 2 NR-clusters.
Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 29
Other Results (General Graphs) 30 Upper bound on the number of clusters We show that the algorithm in [BCD + 08] produces (2 k-1 – k) 2 + k clusters This is independent of the total number of nodes n Tight lower bound We show that there exists a graph that needs (2 k-1 – k) 2 + k clusters in any good user-view.
31 Can we solve the optimization problem on general directed graphs? Is it NP-complete? Can we get a constant-factor approximation to the optimum solution? Can we extend our algorithm to handle a larger class of directed graphs? Open Problems