Presentation is loading. Please wait.

Presentation is loading. Please wait.

Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific.

Similar presentations


Presentation on theme: "Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific."— Presentation transcript:

1 Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific Workflow Results

2 Scientific Workflows Scientific workflows are increasingly used by scientists as a means for specifying and enacting their experiments. They tend to be data intensive The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows. 2 IPAW 2012

3 Duplicates in Workflow Results The datasets obtained as a result of workflow execution often contain duplicates. As a result: The analysis and interpretation of workflow results may become tedious. The presence of duplicates also unnecessarily increases the size of workflow results. 3 IPAW 2012

4 Duplicate Record Detection Research in duplicate record detection has been active for more than three decades. Elmagarmid et al., 2007 conducted a comprehensive survey of the topics. We do not aim to design yet another algorithm for comparing and matching records. Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007. 4 IPAW 2012

5 Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflow results based on provenance traces. Preliminary validation using real-world workflows. 5 IPAW 2012

6 Preliminaries: Data-Driven Workflows A data driven workflow can be defined as a directed graph: A node represent an analysis operation, which has a set of input and output parameters. The edges are dataflow dependencies: 6 IPAW 2012

7 Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations. Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records. Transfer: to specify transfer of records along the edges of the workflow. 7 IPAW 2012

8 Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflow results based on provenance traces. Preliminary validation using real-world workflows. 8 IPAW 2012

9 Provenance-Guided Detection of Duplicates: Approach To guide the detection of duplicates in workflow results we explore the following fact: An operation that is known to be deterministic produces identical output bindings given the same input binding. 9 IPAW 2012

10 IdentifyProteinGetGOTerm Provenance-Guided Detection of Duplicates: Example 1.The set of records R i that are bound to the input parameter of the starting operation are compared to identify duplicate records. The result of this phase is a partition of disjoint sets of identical records. RiRi RoRo R’ i R’ o i i’ o’ o 10 IPAW 2012

11 Provenance-Guided Detection of Duplicates: Example 2.The sets of records R o, R’ i and R’ o are partitioned into sets of identical records based on the partitioning of R i. For example: IdentifyProteinGetGOTerm RiRi RoRo R’ i R’ o i i’ o’ o 11 IPAW 2012

12 Provenance-Guided Detection of Duplicates: Example In the example just described, the operations that compose the workflow have exactly one input and one output parameter. However, the algorithm presented in the paper supports operations with multiple input and output parameters. Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case. This raises the question as to how to determine that a given operation is deterministic. 12 IPAW 2012

13 Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1. Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2. If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic. 13 IPAW 2012

14 Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to: Identify when two collections are identical Two collections R i and R j are identical if they are of the same size and there is a bijective mapping: that maps each record r i in R i to a record r j in R j such that r i and r j are identical Identify duplicates records between two collections that are known to be identical Identify a bijective mapping that maps every r i in R i to an identical rj in Rj. 14 IPAW 2012

15 Outline Data-Driven Workflows and Provenance Trace A method for guiding duplicates detection in workflow results based on provenance traces. Preliminary validation using real-world workflows. 15 IPAW 2012

16 Validation The method that we presented in this paper can be applied when the operations are deterministic. To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue 16 IPAW 2012

17 Validation After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic. Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012). This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach. IPAW, 2012. 17 IPAW 2012

18 Conclusions and Future Work we described a method that can be used to guide duplicate detection in workflow results. Monitoring the determinism of analysis operations Extending the method to support duplicate detection across the results of different workflows. 18 IPAW 2012

19 Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific Workflow Results


Download ppt "Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific."

Similar presentations


Ads by Google