Workflow Provenance Bill Howe
Bill Howe, eScience Institute Comparison Data Model Prog. Model Services GPL * Typing (maybe) Workflow dataflow typing, provenance, Pegasus-style resource mapping, task parallelism Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, data parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance MS Dryad IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute What is Provenance? src: David Holland Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Example src: David Holland Bill Howe, eScience Institute 11/12/2018
An Example 1. Agent messages are recorded as interactions, either by the agents or by the agent platform PROVENANCE Store 2. Agents record the internal relationships between inputs and outputs, plus extra meaningful information. TU.1 Data Collection request OTM.1 Donor Data request HC.1 Patient Data request EHCR Hospital A EHCR Hospital B TU.2 Serology Test request OTM.2 Donor Data HC.2 Patient Data Transplant Unit Interface Agent OTM Donor Data Collector Agent TU.3 Brain Death Notification + report If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages OTM.3 Serology test request TU.4 Decision request Test Lab. Interface Agent OTM.4 Serology test result + report TU.5 Decision + report Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute caused by response to contains parts of Patient Data Request HC.1 Hospital B HC.2 Which is the basis for donation decision D? caused by response to Data Collection Request TU.1 Donor OTM.1 based on Brain Death Notification TU.3 Donor Data OTM.2 Serology Test Result OTM.4 User X is logged in User Z User W is logged in User Y justified by Brain Death report TU.3 response to Decision Request TU.4 Donation Decision TU.5 caused by Decision report TU.5 justified by Author A authored by Author C Author B caused by Serology Test Request TU.2 response to Serology Test Request OTM.3 caused by justified by Serology report OTM.4 Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Use cases Data Quality Audit Trail Replication Recipes Attribution Informational/Communication What else? Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Research Questions Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Provenance Taxonomy Bill Howe, eScience Institute 11/12/2018
Types of Provenance, Redux Data Provenance Metadata + History of a Data Object Workflow Provenance Metadata + History of the workflow itself Source control Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute COMAD Collection-oriented Modeling and Design Susan Davidson, Upenn Workflows may exhibit assembly line semantics open and close interleaved “read scopes” and “write scopes” Bill Howe, eScience Institute 11/12/2018
Provenance Aware Storage System David Holland, Harvard Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute PASS Architecture Prov. and Storage Layer Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute VisTrails demo Bill Howe, eScience Institute 11/12/2018
Other Provenance Systems Pegasus/Wings ZOOM ES3 SDG Karma JP Mindswap Redux RWS NCSCI USC/ISI OPA VDL MyGrid Bill Howe, eScience Institute 11/12/2018
Open Provenance Challenge 2006, First: Compare Expressiveness of provenance systems 2007, Second: Interoperability and Exchange 2008, Third: Evaluation of the Open Provenance Model 2010, Fourth and Last to apply the Open Provenance Model to a broad end-to-end scenario, and demonstrate novel functionality that can only be achieved by the presence of an an interoperable solution for provenance Bill Howe, eScience Institute 11/12/2018
First Open Provenance Challenge Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Challenge Workflow Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Challenge Queries Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Challenge Queries (2) Bill Howe, eScience Institute 11/12/2018
Categorization of Provenance Systems Execution Environment Representation Technology SQL, RDF, etc. Query Language Research Emphasis Execution, Recording, Storing, Querying Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Categorization (2) Includes WF Representation Data Derivation vs. Causal Events “Nouns” or “Verbs” Annotations Time Naming Tracked Data, Granularity Files, collections, bytes, tuples Abstraction Mechanisms functions, etc. Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Results Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Results Bill Howe, eScience Institute 11/12/2018
Bill Howe, eScience Institute Results Bill Howe, eScience Institute 11/12/2018