Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Models of Computation (MoCs) to Models of Provenance (MoPs) Bertram Ludäscher Dept. of Computer Science & Genome Center University of California,

Similar presentations


Presentation on theme: "From Models of Computation (MoCs) to Models of Provenance (MoPs) Bertram Ludäscher Dept. of Computer Science & Genome Center University of California,"— Presentation transcript:

1 From Models of Computation (MoCs) to Models of Provenance (MoPs) Bertram Ludäscher Dept. of Computer Science & Genome Center University of California, Davis ludaesch@ucdavis.edu UC DAVIS Department of Computer Science

2 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Pop Quiz Time!! How does this execute? It depends … –DAG(man) –SDF –PN –DDF –COMAD –Petri-Net: actors = transitions channels = places Different MoCs –different programming languages Different features: –Data-ware vs. -agnostic –Parallelism: Task-parallel, pipeline parallel, streaming pipeline parallel, data parallel –Loops? data transport, control- flow, time, … D C B A

3 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Objectives, Goals Better understand different notions of provenance –… and workflow … and how they relate Cross-fertilize between –CS subdisciplines (databases, workflows, BPM, PL, concurrency,…) –basic research and applications new research problems, informed by apps (but not only) impact apps by knowledge transfer from basic research In eigener Sache… –Curiosity/fun-driven research Petri-net vs Kahn, COMAD vs Taverna, XML streaming vs NRC, … –A subset (B union C); Hamming; partial data-structures, … –… in addition to (or thereby) making the world better

4 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC So many models, so little time types of users, roles use cases queries what they want to do Wf models, MoCs, MoPs expressiveness, complexity

5 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Notions, Terminology (Scientific) Workflow –A program? A specification? A partial one? cannot be properly defined!? (cf. family resemblance in classification) Family resemblance (German Familienähnlichkeit [1]) is a philosophical idea proposed by Ludwig Wittgenstein, with the most well known exposition being given in the posthumously published book Philosophical Investigations (1953) [2]. The idea itself takes its name from Wittgenstein's metaphorical description of a type of relationship he argued was exhibited by language.[3] Wittgenstein's point was that things which may be thought to be connected by one essential common feature may in fact be connected by a series of overlapping similarities, where no one feature is common to all. Games, which Wittgenstein used to explain the notion, have become the paradigmatic example of a group that is related by family resemblances. In classification theory: polythetic vs. monothetic

6 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Notions, Terminology Model of Computation (MoC) –Takes a wf W, a domain / director / model of computation M –Then for any input x, defines what y = M(W, x) is –Implies a set of observables Run: –Representation of an execution in terms of basic observables, i.e., implied by the MoC Trace –Representation (approximation) of an execution in terms of relevant observables (for a use-case, query) Model of Provenance (MoP) –… make this precise (maybe for some MoCs)

7 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Different types of Provenance Data provenance: –lineage, data dependencies Execution provenance: –other runtime observables Querying lineage vs querying the execution Workflow evolution provenance – Vistrails Provenance is more important than the results! The Selfish P-Assertion / Selfish Provenance Graph! (cf. The Selfish Gene)

8 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Provenance Uses for the Domain Scientists Query the lineage of a data product –from what data was this computed? –real dependencies please!!! Evaluate the results of a workflow –do I like how this result was computed? Reuse data products of one workflow run in another –(re-)attach prior data products to a new workflow Archive scientific results in a repository Replicate the results reported by another researcher Discover all results derived from a given dataset –… i.e. across all runs Explain unexpected results –… via parameter-, dataset-, object-dependencies in the scientists terms (yes, you may substitute ontology here … )

9 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Provenance for the WF Engineer / Plumber A Workflow Engineers View –Monitor, benchmark, and optimize workflow performance –Record resource usage for a workflow execution –Smart Re-run of (variants of) previous executions –Checkpointing & restart (e.g. for crash recovery, load balancing) –Debug or troubleshoot a workflow run –Explain when, where, why a workflow crashed

10 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC And the right level of modeling is … Common approach: Lets record everything –What does that mean??? Say your workflow is implemented in Kepler: –Workflow invocation + input + output –Actor invocation + input + output –Everything that has be written to (read from) a port –Something else? And what about a trace of the JVM instructions? … the assembly level instructions? … firmware code? … signals What are the observables?

11 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Summary: what people do with provenance Result validation (different in: science vs workflow logic) Result debugging (science vs wf logic) Reproducibility Repeatability Explanation (derivations, traces, proof trees) Runtime monitoring –Profiling, benchmarking Performance Optimization (smart re-run) Fault-tolerance, crash-recovery Workflow design QUICK DEMO

12 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler/pPOD workflows new director data types, collections assembly-line processing provenance enabled actor library Cipres web services local applications format conversion GUI components workspace extension access to workflows access to run traces

13 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler/pPOD Provenance Browser Reusable widgets for viewing different aspects of a trace Move forward and backward through execution Data dependencies, collection structure, actor invocations

14 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler/pPOD Provenance Browser Collection and invocation VIEW Incrementally step through execution history Actor invocation graph shows pipelining, implicit branches

15 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Complex SDM/CPES workflow in Kepler –50+ composite actors (subworkflows) –4 levels of hierarchy –1000+ atomic (Java) actors –Model of Computation: Dataflow (~Kahn-PN) Task parallel pipeline parallel (streaming!) 43 actors, 3 levels 196 actors, 4 levels 30 actors 206 actors, 4 levels 137 actors 33 actors 150 123 actors 66 actors 12 actors 243 actors, 4 levels Source: Norbert Podhorszki (UC Davis --> ORNL)

16 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC 16 Workflow Framework: another MoC Provenance, Tracking & Meta-Data (DBs and Portals) Control Plane (light data flows) Execution Plane (Heavy Lifting Computations and flows) Synchronous or Asynchronous Kepler

17 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Streaming tokens & dealing with failure 3 3 2 2 1 1 transfer 1 failed 2 convert 1arch 1 transfer 3convert 3arch 3 Source: Norbert Podhorszki (UC Davis --> ORNL)

18 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC After a restart… 3 3 2 2 1 1 skip 1 transfer 2 skip 1 convert 2 skip 1 arch 2 skip 3 Source: Norbert Podhorszki (UC Davis --> ORNL)

19 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler + Process Central (Execution monitoring) Faraaz Sareshwala (ECS-199 project)

20 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler + Weka (Data Mining Package) Peter Reutemann, University of Waikato, NZ

21 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler Flex Client Christopher Tuot, DFKI, Germany

22 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kepler on the Web Tristan King, James Cook University, Australia

23 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Taverna, MyExperiment

24 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Types of Data Provenance Black-box –know (next to) nothing at compile-time –at runtime: keep data lineage: R/W/fire observables Grey-box 1.can look inside (inside some black boxes) 2.… or FP signatures: A :: t1, t2 t3,t4 3.… or semantic annotations (sem.types) 4.… or dependency signatures! e.g. subworkflows, COMAD! White-box –statically (compile-time) analyzable –v(P1*P2, X,Z) :- r(P1, X,_,Z), r(P2, _,_,Z). –most database work seem to fall here fAq t1 t2 t3 t4 X1 X2 Y1 Y2

25 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Different kinds of scenarios, problems Given database D, output y = Q(D) … find Q such Q(y) yields part of D on which y depends Given a runtime recording / trace: –… query for lineage (scientist), performance (engineer), … –… and a modification, do a smart-rerun –… and a crash, do a recovery

26 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC From MoC to MoP via Observables Model of Computation MoC M –specification/algorithm to compute o = M(W,P,i) –a director or scheduler implements M –gives rise to formal notions of computation (aka run) R; typically tree models –Formalisms to define M? Via a meta-interpreter? Model of Provenance MoP M –approximation M of M –a trace T approximates a run R by inclusion/exclusion of observables – T = R – Ignored-observables (Ignorables) + Model-observables Observables (of a MoC M) –functional observables (may influence output o) token rate, notions of firing, … –non-functional observables (not part of M, do not influence o) token timestamp, size, … (unless the MoC cares about those) –Actors should not be able to observe anything! Race conditions via arrival times…

27 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Models of Computation (A WF Engineers Issue) Directors separate orchestration/ scheduling concerns from conceptual design –Synchronous Dataflow (SDF) Statically analyzable: schedule, no deadlocks, fixed buffer requirements; executable as a single thread by the director. –Process Networks (PN) Generalizes SDF. Actors execute as separate threads/processes, with queues of unbounded size (Kahn/MacQueen networks). –Directed Acyclic Graph (DAG) Special case of SDF. No loops, no pipelining, no state (one invocation per actor) –Continuous Time (CT) Connections represent the value of a continuous time signal at some point in time... Often used to model physical processes. –Discrete Event (DE) Actors communicate through a queue of events in time. Used for instantaneous reactions in physical systems. –…

28 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Vanilla Process Network Functional Programming Dataflow Network XML Transformation Network Collection-oriented Modeling & Design (COMAD) Language & Abstractions; Modeling & Design The limitations of my modeling / wf language are the limitations of my design world. – BL

29 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Automatic Iteration in Kahn Networks Given f: x y and input stream Kahn process is F: –i.e., big F is a kind of stream-map of some small f –Kahn doesnt talk about the little f –Dennis Dataflow does (via firing rules)

30 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kahn Processes over streams: F monotone Edward A. Lee and Eleftherios Matsikoudis, "The Semantics of Dataflow with Firing," Chapter in From Semantics to Computer Science: Essays in memory of Gilles Kahn. Gérard Huet, Gordon Plotkin, Jean-Jacques Lévy, Yves Bertot, editors, Preprint Version, March 07, 2008, Copyright (c) Cambridge University Press, 2008. Edward A. Lee and Eleftherios Matsikoudis, "The Semantics of Dataflow with Firing," Chapter in From Semantics to Computer Science: Essays in memory of Gilles Kahn. Gérard Huet, Gordon Plotkin, Jean-Jacques Lévy, Yves Bertot, editors, Preprint Version, March 07, 2008, Copyright (c) Cambridge University Press, 2008.

31 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Kahn Processes over streams: F continuous

32 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Dataflow WITH Firing

33 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Dataflow processes: From little f to big F …

34 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Dataflow with Firing: Dennis dataflow

35 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Source: Edward Lee http://ptolemy.eecs.berkeley.edu/ Synchronous Dataflow (SDF)

36 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Lets talk about observables … DAG model: … [A] [B] … –Observables: start(job)@time, finish(job)@time –Correctness criterion: start(B) > end(A) Variants of PN … –{x} [A] {y} –Observable: set of reads, set of writes –y_j may depend on subset of xs –but no x_j depends on any y_i –{x1@t1, x2@t2, …} [A] {y1@t1, y2@t2, …} –Can draw inference: write y_i@t may depend on all x_i prior to t More informed models: –Actor signatures (compile-time static analysis) –Actor assertions (runtime richer provenance graph) Special case: RWS

37 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Real Data Dependencies over token streams Stateless actors, firing on each token –[ x1, x2, x3, … ] F [ f(x1), f(x2), f(x3), … ] –Generates dependencies: x_i f_i y_i –Note: F = map(f) Stateful actor, firing on each token –[ x1, x2, x3, … ] F [ f(x1), f(x1,x2), f(x1,x2,x3), … ] –Kahn-McQueen Process Networks prefix monotonic, deterministic computations –Generates dependencies (here): x_j f_i y_i, f.a. i<= j F […|x3|x2|x1][…|y3|y2|y1]

38 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Behold the Beauty of Scientific Workflow Design Author: Kristian Stevens, UC Davis

39 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC … Shimology Part 2: the ugly truth inside Author: Kristian Stevens, UC Davis

40 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC COMAD: Virtual Assembly Lines Actors select parts of token stream, forward rest Special tokens denote collections, metadata, & parameters Actors insert tokens into and remove tokens from stream Some advantages of COMAD: –workflows with loops, branches, composition (subworkflows) –concurrency, pipelining (streaming) –resilient to change (data nesting, add/remove actors) –simpler workflow designs …… Compute Consensus … … Proj Seqs Aligns … … Trees S1S1 S 10 A1A1 A2A2 T1T1 T5T5 ><<<>>>< S 10 S1S1 A2A2 A1A1 T5T5 T1T1 T6T6 T6T6

41 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Input Change-Resilience (nested data types) S. Bowers, Daniel Zinn (UC Davis)

42 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Optimizing COMAD: User- vs. System View Daniel Zinn (UC Davis)

43 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC X-CSR (XML Scissor): Cut-Ship-Reassemble Daniel Zinn, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher (UC Davis), ICDE 2009 Daniel Zinn, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher (UC Davis), ICDE 2009

44 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC What we (will) get: Change-Resilience ABC SR W X S R W W +X Original: Automatic Configuration: ? Infer Configuration X of X Daniel Zinn (UC Davis)

45 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC (Scientific) Workflow Modeling Paradigms & MoCs Vanilla Process Network Functional Programming Dataflow Network XML Transformation Network Collection-oriented Modeling & Design framework (COMAD)

46 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Collection-Oriented Modeling & Design (COMAD)

47 From MoCs to MoPs, B. Ludäscher eScience Theme 9, Oct 13-17, 2008, SLC Implicit iteration – compare Kahn vs COMAD


Download ppt "From Models of Computation (MoCs) to Models of Provenance (MoPs) Bertram Ludäscher Dept. of Computer Science & Genome Center University of California,"

Similar presentations


Ads by Google