Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim.

Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim

Contents Data Quality Data Quality Overview Overview Quality Inference Quality Inference Data Provenance Data Provenance Data Provenance Definitions Data Provenance Definitions Taxonomy of Provenance Techniques Taxonomy of Provenance Techniques

Data Quality Overview What is the Data Quality? What is the Data Quality? Accuracy Accuracy Timeliness Timeliness Credibility (Trustworthy) Credibility (Trustworthy) Users and domains subjective Users and domains subjective

Data Quality Overview Example Example Database collected over a period of time and by a variety of company department Database collected over a period of time and by a variety of company department Company Name Address Number of Employees A 20 Rode St. 3,000 B 50 Main Av. 500

Data Quality Overview Questions: Questions: When it created When it created Where it came from Where it came from How and Why obtained How and Why obtained Company Name Address Number of Employees A 20 Rode St. 3,000 B 50 Main Av. 500 Jan-12-00, by sales Feb-5-00, by ABC Oct-24-00, by acctig Oct-10-00, by EFG

Data Quality Overview How to store it? How to store it? Annotations by tagging Annotations by tagging Provenance Provenance

Data Quality Inference Next questions: Next questions: Can we trust data sets or data sources? Can we trust data sets or data sources? Answer: Answer: Ranking by quality on data set generated from data sources Ranking by quality on data set generated from data sources

Data Quality Inference Motivation Motivation Data are: Data are: Distributed Distributed Erroneous Erroneous Shared and Integrated Shared and Integrated

Data Quality Inference Data source ranking Data source ranking 1. Rank the data sets or sources in order of their accuracies 2. Determine the top-k accurate data sets or source

Data Quality Inference Framework Framework D: a set of data source D: a set of data source Ti(k, v): table for a query Q, k is the key and v is the value at time t Ti(k, v): table for a query Q, k is the key and v is the value at time t Ai  [0, 1]: Accuracy of data source Di Ai  [0, 1]: Accuracy of data source Di Ai < Aj if Di is less accurate than Dj Ai < Aj if Di is less accurate than Dj

Data Quality Inference General Framework General Framework h(t): historical function, 0  h(t)  1 h(t): historical function, 0  h(t)  1 weighted sum of all within the last w time indexes weighted sum of all within the last w time indexes c(i,t): cohesion function c(i,t): cohesion function

Data Quality Inference Cohesion function, c(i,t) Cohesion function, c(i,t) Determines: Determines: new accuracy estimate new accuracy estimate how well each data agrees with one another how well each data agrees with one another f(i,t): dampening factor function f(i,t): dampening factor function a(i,j,t): agreement function a(i,j,t): agreement function

Data Quality Inference Dampening factor function, f(i,t) Dampening factor function, f(i,t) Probability, f(i,t) in data source Probability, f(i,t) in data source Similar to Google’s PageRank: Similar to Google’s PageRank: high-quality sites receive a higher PageRank, high-quality sites receive a higher PageRank, Google remembers each time it conducts a search Google remembers each time it conducts a search Prevent the solution from zeros for all Prevent the solution from zeros for all

Data Quality Inference Agreement function, a(i,j,t) Agreement function, a(i,j,t) tupleOverlap(i,j,t) tupleOverlap(i,j,t) Measure the proportion of tuples in approximate agreement Measure the proportion of tuples in approximate agreement cosineOverlap(i,j,t) cosineOverlap(i,j,t) Measure the complement of the cosine distance of two sets of data over the same key values Measure the complement of the cosine distance of two sets of data over the same key values eOverlap(i,j,t) - Euclidian-based function eOverlap(i,j,t) - Euclidian-based function Euclidian distance in n-dimension Euclidian distance in n-dimension

Data Quality Inference Agreement function, a(i,j,t) Agreement function, a(i,j,t) Using Euclidian distance, Using Euclidian distance, eOverlap(i,j,t) = 1 – eDist(V(i,j,t), V(j,i,t)) eOverlap(i,j,t) = 1 – eDist(V(i,j,t), V(j,i,t))

Data Quality Inference Experimental results Experimental results 100 data sources 100 data sources 20 different tuples (key, value) 20 different tuples (key, value) Randomly assigned Randomly assigned Dampening function f(i,t), 0.5 Dampening function f(i,t), 0.5

Data Quality Inference Experimental results Experimental results

Data Provenance Data Provenance Definitions Data Provenance Definitions Taxonomy of Provenance Techniques Taxonomy of Provenance Techniques Application of Provenance Application of Provenance Subject of Provenance Subject of Provenance Representation of Provenance Representation of Provenance Provenance storage Provenance storage Provenance Dissemination Provenance Dissemination Examples of Data provenance Techniques Examples of Data provenance Techniques

What is Data Provenance Data provenance: In database system domain: Data provenance, a kind of metadata, sometimes called “lineage" or “pedigree" is the description of the origins of a piece of data and the process by which it arrived in a database. Data provenance as information that helps determine the derivation history of a data product, starting from its original sources. E-Science: E-science is computationally intensive science. It is also the type of science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing. Examples of this include social simulations, particle physics, earth sciences and bio-informatics...

Why Data Provenance is important When you find some data on the Web, do you have any information about how it got there? It is quite possible that it was copied from somewhere else on the Web, which, in turn may have also been copied; and in this process it may have been transformed and edited. If you are a scientist, or any kind of scholar, you would like to have confidence in the accuracy and timeliness of the data that you are working with. Medical research requires tight controls on the quality of data because mistakes can harm people’s health. Data quality in bioinformatics may not be as immediate, but it is no less important. Among the sciences, the field of Molecular Biology is possibly one of the most sophisticated consumers of modern database technology and has generated a wealth of new database issues. A substantial fraction of research in genetics is conducted in "dry" laboratories using in silico experiments – analysis of data in the available databases.

Taxonomy of Provenance Techniques This paper c ategorizes provenance systems based on: This paper c ategorizes provenance systems based on: Why the record provenance Why the record provenance application of data provenance What they describe What they describe Subject of provenance How they represent provenance How they represent provenance Provenance Representation How to store provenance How to store provenance Storing Provenance Ways to disseminate provenance Ways to disseminate provenance Provenance Dissemination

Taxonomy of Provenance

Application of Provenance Provenance systems can support a number of uses. Several applications of provenance information as follows: Data Quality: Lineage can be used to estimate data quality and data reliability based on the source data and transformations. It can also provide proof statements on data derivation. Data Quality: Lineage can be used to estimate data quality and data reliability based on the source data and transformations. It can also provide proof statements on data derivation. Audit Trail: Provenance can be used to trace the audit trail of data, determine resource usage, and detect errors in data generation. Audit Trail: Provenance can be used to trace the audit trail of data, determine resource usage, and detect errors in data generation. Replication Recipes: Detailed provenance information can allow repetition of data derivation, help maintain its currency, and be a recipe for replication. Replication Recipes: Detailed provenance information can allow repetition of data derivation, help maintain its currency, and be a recipe for replication. Attribution: Pedigree can establish the copyright and ownership of data, enable its citation, and determine liability in case of erroneous data. Attribution: Pedigree can establish the copyright and ownership of data, enable its citation, and determine liability in case of erroneous data. Provenance systems can support a number of uses. Several applications of provenance information as follows: Provenance systems can support a number of uses. Several applications of provenance information as follows: Informational: A generic use of lineage is to query based on lineage metadata for data discovery. It can also be browsed to provide a context to interpret data. Informational: A generic use of lineage is to query based on lineage metadata for data discovery. It can also be browsed to provide a context to interpret data.

Subject of Provenance Provenance Models: data-oriented model data-oriented model an explicit model, lineage metadata is specifically gathered about the data product. One can delineate the provenance metadata about the data product from metadata concerning other resources. process-oriented model process-oriented model An indirect model, where the deriving processes are the primary entities for which provenance is collected, and the data provenance is determined by inspecting the input and output data products of these processes. Provenance Granularity (Coarse Grained/Fine Grained) The usefulness of provenance and the cost of collecting and storing provenance in a certain domain is linked to the granularity at which it is collected. Range from provenance on attributes and tuples in a database to provenance for collections of files, say, generated by an ensemble experiment run.

Representation of Provenance Two major approaches: Annotations: Annotations: Metadata comprising of the derivation history of a data product is collected as annotations and descriptions about source data and processes. Advantage: richer and, in addition to the derivation history, often include the parameters passed to the derivation processes, the versions of the workflows that will enable reproduction of the data, or even related publication references Inversion Inversion Uses the property by which some derivations can be inverted to find the input data supplied to them to derive the output data. Examples include queries and user-defined functions in databases that can be inverted automatically or by explicit functions. Advantage: more compact, the information it provides is sparse and limited to the derivation history of the data.

Representation of Provenance(contd…) Many current provenance systems that use annotations have adopted XML for representing the lineage information. Some also capture semantic information within provenance using domain ontologies in languages like RDF and OWL. Ontologies precisely express the concepts and relationships used in the provenance and provide good contextual information.

Provenance Storage Scalability Scalability Provenance information can grow to be larger than the data it describes if the data is fine-grained and provenance information rich. So the manner in which the provenance metadata is stored is important to its scalability. Provenance information can grow to be larger than the data it describes if the data is fine-grained and provenance information rich. So the manner in which the provenance metadata is stored is important to its scalability. The inversion method is arguably more scalable than using annotations. However, one can reduce storage needs in the annotation method by recording just the immediately preceding transformation step that creates the data and recursively inspecting the provenance information of those ancestors for the complete derivation history. The inversion method is arguably more scalable than using annotations. However, one can reduce storage needs in the annotation method by recording just the immediately preceding transformation step that creates the data and recursively inspecting the provenance information of those ancestors for the complete derivation history. Overhead Overhead Less frequently use provenance information can be archived to reduce storage overhead or a demand-supply model based on usefulness can retain provenance for those frequently used. Less frequently use provenance information can be archived to reduce storage overhead or a demand-supply model based on usefulness can retain provenance for those frequently used. If provenance depends on users manually adding annotations instead of automatically collecting it, the burden on the user may prevent complete provenance from being recorded and available in a machine accessible form that has semantic value If provenance depends on users manually adding annotations instead of automatically collecting it, the burden on the user may prevent complete provenance from being recorded and available in a machine accessible form that has semantic value

Provenance Dissemination Visual Graph Visual Graph A common way of disseminating provenance data is through a derivation graph that users can browse and inspect Queries Queries Users can also search for datasets based on their provenance metadata, such as to locate all datasets generated by a executing a certain workflow. If semantic provenance information is available, these query results can automatically feed input datasets for a workflow at runtime. The derivation history of datasets can be used to replicate data at another site, or update it if a dataset is stale due to changes made to its ancestors. Service API Service API Provenance retrieval APIs can additionally allow users to implement their own mechanism of usage

S urvey of Data Provenance Techniques

Provenance in a Bioinformatics Grid (myGrid) myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Keep the scientist informed as to the provenance of data relevant to their experiment space Keep the scientist informed as to the provenance of data relevant to their experiment space

What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.

Architectural Vision Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.

Architectural Vision Storage could be achieved by a provenance service. Storage could be achieved by a provenance service. Provenance service would provide support for analysis, navigation or reasoning over provenance Provenance service would provide support for analysis, navigation or reasoning over provenance Client side support for submitting provenance data to the provenance service. Client side support for submitting provenance data to the provenance service.

Prototype Overview

Conclusion Provenance is a rather unexplored domain Provenance is a rather unexplored domain Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.

Future work Using heterogeneous data sources Using heterogeneous data sources Large data sources Large data sources Historical measurement Historical measurement Dynamic measurement Dynamic measurement Security and authorization of data provenance Security and authorization of data provenance Manage provenance in diverse domain Manage provenance in diverse domain

References 1) Yogesh L. Simmhan Beth Plale Dennis Gannon, "A Survey of Data Provenance in e- Science," in SIGMOD Record, Vol. 34, No. 3, Sept. 2005 2) 2) "Using Semantic Web Technologies forRepresenting e-Science Provenance" http://theory.csail.mit.edu/~dquan/iswc2004-mygrid.pdf http://theory.csail.mit.edu/~dquan/iswc2004-mygrid.pdf 3) Jan Brase, "Using digital library techniques- Registration of scientific primary data," in ECDL, 2004 http://www.kbs.uni- hannover.de/Arbeiten/Publikationen/2004/brase_TIB_hannover.pdf 4) Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan, "Why nd Where:A Characterization of Data Provenance," in ICDT, 2001 5) Peter Buneman, Sanjeev Khanna and Wang-Chiew Tan, "Data Provenance: Some Basic Issues,"http://db.cis.upenn.edu/DL/fsttcs.pdf 6) Wang-Chiew Tan, "Research Problems in Data Provenance"http://www.soe.ucsc.edu/~wctan/papers/2004/ieee.pdf 7) Raymond K. Pon and Alfonso F. Cárdenas, "Data Quality inference, "http://www.cs.ucla.edu/~rpon/IQIS.pdf 3) 8) Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth International Conference of Data Engineering, Vienna, Austria. 9) Wand, Y. and Wang, R. (1996) “Anchoring Data Quality Dimensions in Ontological Foundations,” Communications of the ACM, November 1996. pp. 86-95

Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim.

Similar presentations

Presentation on theme: "Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim.

Similar presentations

Presentation on theme: "Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim."— Presentation transcript:

Similar presentations

About project

Feedback