Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M.

Similar presentations


Presentation on theme: "Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M."— Presentation transcript:

1 Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil UFRJ

2 Agenda Introduction Motivation  Control flow in data centric workflows Objective  Provenance Gathering in Distributed Workflows with Explicit Control Flows Case of Use  Control Flow on VisTrails Conclusion

3 Distribution & Heterogeneity in Workflows Scientific Wf enables data intensive analyses  Use of grid x remote parallel machines  Use of different WfMS -Different provenance capture mechanisms  Use Centralized x Distributed WfMS -often offer disjoint set of capabilities How to obtain a homogeneous provenance representation and capture mechanism?

4 Control flow matters in data centric workflows Scientific workflows also need control structures to specify how the data flow should be directed Goderis et al. [6] stress the importance of combining different models of computation in one scientific workflow Bowers et al. [5] say that:  “modeling control-flow using only dataflow constructs can quickly lead to overly complex workflows that are hard to understand, reuse, reconfigure, maintain, and schedule” Tudruj et al. [7] state the importance of general dynamic control flow, but focus on synchronization of parallel execution  Presented a set of generic control structures and proposed the use of a monitoring middleware

5 A real example: OrthoSearch workflow Detect distant homologies on five parasites associated with tropical neglected diseases

6 BLAST MAFFT/HMMER packages Best Hits Finder FormatDB InterPRO OrthoSearch specification in Kepler Some lighweight tasks can run locally Suppose we need to execute MAFFT/HMMER in a High Performance Environment Just send it to a grid ! Time consuming tasks

7 BLAST MAFFT/HMMER packages Best Hits Finder FormatDB InterPRO OrthoSearch - loops, choice, … How to map this to the grid language ?

8 LOCAL BLAST MAFFT/HMMER packages Best Hits Finder FormatDB InterPRO OrthoSearch - loops, choice, … Alternatively, send one job at a time to execute remotely Can be very inefficient !

9 OrthoSearch - loops, choice, … Rewrite this to the grid language. e.g. Triana, supports loops ! But, how to bring provenance data back to Kepler ? How to register loop iterations ?

10 OrthoSearch - loops, choice: other issues What if my available grid does not have a WfMS ? What if my available grid supports another WfMS ? What if the grid WfMS does not support loops ? Generic control flow modules with remote provenance gathering!

11 Motivation Workflow design  Different WfMS present their own control structures, parallel execution models, etc. -Expose different modeling semantics to the users! Provenance gathering  WfMS register provenance in their own schema  Often encompassing specific grid features  Based on application domain attributes A lot of mappings and conversions! Many challenges in changing WfMS for the same workflow

12 Objective Diminish the dependence of the workflow definition on the WfMS  uncoupling the provenance gathering system from the WfMS  having some control flow of execution independent of the WfMS workflow specification language Plugging control flow and provenance gathering modules along the workflow original tasks  the workflow specification can be executed almost independently of the current WfMS  provenance can be gathered uniformly

13 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf

14 Scientific Workflow Control Flows COGs DB MAFFT hmmbuild fastacmd formatdb hmmsearch hmmcalibrate Ptn DB Reciprocals Best Hits Finder InterPRO Reannotated genes hmmpfam HMMER BLAST Implicit DECISION Implicit LOOP

15 Scientific Workflow with Explicit Control Flows Explicit LOOP MAFFThmmbuild hmmsearch hmmcalibrate hmmpfam HMMER Initial condition MUX IF T F All these modules can be sent to execute in any HPC environment Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules Explicit DECISION Meta-Workflow  eases migration of a Wf from WfMS to another!

16 Control flow modules on VisTrails All these control flow modules were made available on Vistrails More explicit control is now available Remote execution can keep specified control Remote execution can bring provenance data back to Vistrails with compatible structure Advantages

17 Orthosearch on VisTrails All these inner modules (sub- workflow) can be sent to execute in a grid or HPC environment Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules In Vistrails the loop could not be implemented because it is a DAG based WfMS External LOOP (parameter exploration) Explicit DECISION

18 Scientific Workflow - Heterogeinity COGs DB MAFFT hmmbuild fastacmd formatdb hmmsearch hmmcalibrate Ptn DB Reciprocals Best Hits Finder InterPRO Reannotated genes hmmpfam HMMER BLAST Time consuming

19 Orthosearch on VisTrails BLAST modules should be sent to execute in PC cluster Provenance gathering mechanisms can be inserted in the control flow modules to be sent to the parallel environement In Vistrails this can be achieved using the MidMon modules REMOTE PARALLEL EXECUTION BLAST

20 MidMon on VisTrails Monitoring tool that checks scientific processes running on distributed environments Message exchange-based tool Decoupled and present modular infrastructure Support to legacy applications on distributed resources Implementation Data Modules Control Modules BLAST

21 Concluding We share the same motivation of Bowers et al., Goderis et al. and Tudruj et al. And the same as Groth et al. We propose:  A set of generic control-flow structures independent of WfMS Our implementation has shown that:  Control-flow structures can allow generic sub-workflow remote execution  Remote process provenance can be captured in the same representation of the wf  Workflow refactoring is facilitated  Control-flow structures can be coupled to monitoring middleware Using explicit control flow Provenance independent of a WfMS

22 Conclusion Distribution & Heterogeneity are inevitable in scientific workflows Adding control-flow modules to the scientific workflow specification can help the execution by heterogeneous WfMS running on distributed environments  Acts as documentation of the execution control workflow  Allows to evaluate and monitor the activities of the workflow  Helps to gather provenance from heterogeneous and independent environments with low programming efforts MidMon on top of VisTrails  Enable scientists to monitor the submitted jobs status on their desktops  Preserves workflows’ original features

23 Future work Use workflow views, e.g. ZOOM*  Our solution makes the workflow very verbose Use software component reuse and refactoring techniques to help the automatic incorporation of these modules  “Using Provenance to Improve Workflow Design” Tosta et al. Work with other workflows from bioinformatics and oil industry

24 Using explicit control processes in distributed workflows to gather provenance Sergio M. S. da Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil Thanks !

25

26 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf MUX Describes a convergence between two or more input ports, resulting in just one branch

27 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf DEMUX Represents an incoming branch that diverges into two or more parts. Just one of the outgoing branches is enabled depending on a condition associated

28 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf STRING CONTROL The workflow is divided in two or more branches, and just one of them can be enabled; the other outgoing branches are withdrawn

29 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf NUMBER CONTROL All output data are originated simultaneously

30 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf NUMBER COMPARE Two or more incoming branches become one outgoing branch, which will be only enabled after the complete activation of all the input data.

31 Scientific Workflow Control Flows A small set of generic workflow-level control modules Based on workflow patterns by Van der Aalst et al. Workflow PatternModule Structured DiscriminatorMux Exclusive ChoiceDemux Deferred ChoiceString Control Multiple Instances without synchronizationNumber Control SynchronizationNumber Compare Exclusive ChoiceIf IF Same pattern of the Demux But present two differences : If has only two input ports and has a logical expression, where the scientists can create any condition they need.

32 MidMon Offer a generic and lightweight monitoring tool that checks scientific processes running on distributed environments  Message exchange-based, 2 layered modular infrastructure  Decoupled and lightweight, crossing different network boundaries  Easy to deploy and manage  Support to legacy applications on distributed resources

33 Midmon Monitoring Data state data may be possible to be monitored it may be possible to monitor about the state of the environment it may be possible to monitor about service availability

34 Midmon – State Data List of task state data that it may be possible to monitor:  Progress of a service - Rely on check points within the service, or a service may be able to provide an estimate of its progress  Completion of a service - This could be a simple event that indicates that a service has produced all of its output file  Data consumption rate of a service - This is a measure of the rate at which service is consuming data from input file  Data production rate of a service - This is a measure of the rate at which service is generating data for output file

35 Midmon – State of the environment A list of the useful data that it may be possible to monitor about the state of the environment is:  Available execution nodes - This could be a list of changes in the available execution nodes in the environment  Load on an execution node - This is a measure of the load in a execution node. It could be one, or a tuple, or a composite of services, e.g., the CPU load, the number of processes, and the free resources of the execution node  Load on a network link - This is a measure of the usage of a network link, in terms of the available bandwidth  Memory usage on an execution node - This is a measure of the usage of memory in a execution node

36 Midmon – Service availability The following is a list of useful data that it may be possible to monitor about service availability  Available services - This could be a list of the services available as mapping targets for tasks in a workflow. The data could also include, e.g., the status of services currently deployed  Available data resources. This could be a list of the data resources available as mapping targets for inputs and outputs in a workflow

37 OrthoSearch – SSH version Without Control-Flow modules

38 hmmPFam hmmSearch OrthoSearch on Kepler 1/3

39 FormatDB FastaCmd OrthoSearch on Kepler 2/3

40 InterPro OrthoSearch on Kepler 3/3


Download ppt "Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M."

Similar presentations


Ads by Google