# ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.

## Presentation on theme: "ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION."— Presentation transcript:

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION

“Where Did this Data Come from?” Challenge: integrated data may come from many sources and mappings – of different quality or trustworthiness!  How did I get this particular result?  What mappings produced it?  How much should I trust (believe) it? Data provenance (lineage) captures the relationships between tuples in a set of data instances 2

An Example: View Tuple Derivations BC 23 32 43 AB 12 24 RS Source relations ACdirectly derivable by 13 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 22 S(2,3) ⋈ ρ B  A, C  B S(3,2) 33 S(3,2) ⋈ ρ B  A, C  B S(2,3) View V 1 = R ⋈ S ∪ S ⋈ S 3

Formulating a Provenance Model Conceptually, provenance captures the operations and operands going into a result There are many options to do this, and many levels of detail! A “good” provenance model should:  Have a formal semantics  Have equivalence properties such that equivalent query plans produce equivalent provenance  Connect to notions of value, quality or score 4

Outline  The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 5

Provenance as Annotations on Data  Annotate each derivation with an “explanation” in terms of relational algebra and the tuple operands  Lets us “look up” the derivation of a result BC 23 32 43 AB 12 14 R S ACprovenance annotation 13 R(1,2) ⋈ S(2,3) ∪ R(1,4) ⋈ S(4,3) 22 S(2,3) ⋈ ρ B  A, C  B S(3,2) 33 S(3,2) ⋈ ρ B  A, C  B S(2,3) View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) 6

Provenance as a Graph of Relationships  Bipartite graph: tuple nodes connected via “derivation nodes”  Encodes a hypergraph (hyperedges = derivations)  Makes direct derivation relationships more explicit 7

Making the Two Interchangeable  We can make these equivalent by introducing provenance tokens (equiv. node IDs) for each tuple  Derived tuples’ annotations = expressions over tokens BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S AC 13 v 1 = r 1 ⋈ s 1 ∪ r 2 ⋈ s 3 22 v 2 = s 1 ⋈ s 2 33 v 3 = s 2 ⋈ s 1 8 V1V1V1V1 r1r1 r2r2 s1s1 s2s2 s3s3 v1v1 v2v2 v3v3 V 1 V 1 V 1 V 1

Outline The two views of provenance  Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 9

Where Can We Use Provenance? Explanations  Help the user understand why an item exists Scoring  Provide a ranked list of “most relevant” results Reasoning about interactions  Help the user understand data relationships

Examples of Provenance’s Utility Schema mapping debugging: We may have a bad result  Determine why that result exists, what is faulty Bioinformatics data integration: Different sources have different levels of reliability or authoritativeness  Rank results by score! Probabilistic databases: We may need to know that results are correlated  Encode the relationships, use to assign probabilities

Outline The two views of provenance Applications of data provenance  Provenance semirings: one ring to rule them all  Storing provenance 12

The Notion of Provenance as Annotations  Many formalisms were defined for using query computations to produce annotations  Each captured certain subtleties  The key question: Is there one “most powerful” model that captures the properties of the relational algebra*?  Equivalent queries should produce equivalent provenance * over multi-sets or bags, as used by “real” systems

The Provenance Semiring Model To represent provenance, use:  A set of provenance tokens or tuple IDs, K  Abstract operators representing combination of tuples Abstract sum operator, ⊕, for union or projection has identity element 0 (a ⊕ 0 ≡ 0 ⊕ a ≡ 0) Abstract product operator, ⊗, for join  has identity element 1 (a ⊗ 1 ≡ 1 ⊗ a ≡ 1)  also (a ⊗ 0 ≡ 0 ⊗ a ≡ 0) This is formally a commutative semiring 14

The Provenance Semiring Model  We can re-express our example as below, using the semiring operators instead of the relational algebra ones BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S ACAnn 13 v 1 = r 1 ⊗ s 1 ⊕ r 2 ⊗ s 3 22 v 2 = s 1 ⊗ s 2 33 v 3 = s 2 ⊗ s 1 15 V1V1V1V1 r1r1 r2r2 s1s1 s2s2 s3s3 v1v1 v2v2 v3v3 V 1 V 1 V 1 V 1

Tokens for Mappings  Sometimes we would like to assign a token to the actual mapping or rule used – so we can assign it a value BCann 23s1s1 32s2s2 43s3s3 AB 12r1r1 14r2r2 R S ACAnn 13 v 1 = m 1 ⊗ [r 1 ⊗ s 1 ] ⊕ m 2 ⊗ [r 2 ⊗ s 3 ] 22 v 2 = m 2 ⊗ [ s 1 ⊗ s 2 ] 33 v 3 = m 2 ⊗ [ s 2 ⊗ s 1 ] 16 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) Call this m 1 Call this m 2

Example Application: Provenance Visualization Base tuple derivation (token not shown) Tuple nodes Derivation by mapping M5

Example Application: Tuple Scoring  For ranked query results, we may adopt the following model commonly used in ranking:  Assign a score to each base tuple = - log 2 (probability)  Use arithmetic sum as ⊗  Use min as ⊕  Suppose  prob(r 1 ) = 0.5, prob(s 1 ) = 0.5, others are 1.0 ACAnn 13 v 1 = r 1 ⊗ s 1 ⊕ r 2 ⊗ s 3 = min((2+1),(1+1)) = 2 22 v 2 = s 1 ⊗ s 2 = 2+1 = 3 33 v 3 = s 2 ⊗ s 1 = 1+2 = 3 V1V1V1V1

Useful Semirings Use caseBase value Product R ⊗ SSum R ⊕ S DerivabilityTrue R ∧ S R ∨ S TrustTrust condition result R ∧ S R ∨ S Confidentiality level Tuple confidentiality level More_secure(R, S) Less_secure(R,S) Weight / costBase tuple weight R + Smin(R,S) LineageTuple ID R ∪ S R ∩ S Probabilistic event Tuple probabilistic event R ∧ S R ∨ S Number of derivations 1 R ⋅ S R + S 19

Outline The two views of provenance Applications of data provenance Provenance semirings: one ring to rule them all  Storing provenance 20

Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) Relate tuples with table P v1-1 Relate tuples with table P v1-2 R.AR.BS. BS.CV1.AV1.C 122313 144313 S.BS. C S.B ’ S. C’ V1. A V1. C 233222 322333 21 P v1-1 P v1-2

Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) R.AR.BS. BS.CV1.AV1.C 122313 144313 S.BS. C S.B ’ S. C’ V1. A V1. C 233222 322333 22 P v1-1 P v1-2 These are redundant if we know the Datalog

Storing Provenance  Use tuple keys as tokens  Encode provenance graph as relations BC 23 32 43 AB 12 14 R S AC 13 22 33 V1V1V1V1 View V 1 (in Datalog): V 1 (x,z) :- R(x,y), S(y,z) V 1 (x,x) :- S(x,y), S(y,x) ABC 123 143 BCC’ 232 323 23 P v1-1 P v1-2

Data Provenance Wrap-up  Provenance is critical to understanding and assessing the believability of data, and in debugging  Two equivalent representations – annotations vs graph  Provenance semiring model preserves the “expected” equivalences of the relational algebra  We can take semiring provenance and evaluate it with different semirings to get useful scores  We can store provenance using relations  Recent work beyond the scope of the book:  Extending provenance to more complex queries, e.g., with aggregation  Languages for querying provenance (primarily as a graph)

Download ppt "ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION."

Similar presentations