Approximate Lineage for Probabilistic Databases

Approximate Lineage for Probabilistic Databases
Christopher Ré and Dan Suciu University of Washington

Approximate Lineage in One Slide
Lineage (Provenance) In QP used to track correlations Explain query/view results VLDBs have lots of lineage Chokes QP Hard for users to understand Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations

Overview Motivation & Preliminaries
An apx lineage approach: Sufficient Lineage Experiments Conclusions

Lineage from a wide variety of sources – not all trusted the same
Inspired by the Geneontology (GO) Database A Protein Database Standard pDB, e.g. Mystiq, Trio Protein Process l AGO2 “Cell Death” X1 “Embryonic Devel.” “Glands” X2 Aac11 X3 Protein Process AGO2 “Cell Death” “Embryonic Devel.” “Glands” Aac11 Data are from somewhere id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 id Description X1 “Dr. Z told me” X2 “PubMed:123” X3 “Lab Experiment” Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!

Review: Lineage tracking
PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07] Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as Àac11’” How do we derive the lineage ? V(y) :- P(x,y),P(Àac11’, y), x  Àac11’ Protein Process l AGO2 “Cell Death” X1 “Embryonic Devel.” “Glands” X2 Aac11 X3 Protein l AGO2 (X1 ˄ X2) Protein l AGO2 Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) l1 Lineage tracks all derivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!

Problems with Large Lineage in pDB
This talk Lineage is used to: Process Queries Give explanations to users Find influential atoms Large: chokes QP Large: Many redundant explanations Large: Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage

Approximate Lineage Approach
Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Protein a AGO2 0.5*x Protein a AGO2 (X1 ˄ X2) All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database

Sufficient lineage (SL)
Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Represent as? Use as to: Answer queries? Provide explanations? Find influential tuples? Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l Protein a AGO2 (X1 ˄ X2) See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL

Formalizing “good as” E[l – a]  e
Choosing an approximation a for a lineage function, l id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Formalizing this, Atoms E[l – a]  e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance

Illustrating Good Lineage
E[l – a] = E[l] – E[a]  e id Description P X1 “Dr. Z told me” 0.9 X2 “PubMed:123” 0.8 X3 “Lab Experiment” 0.3 e = 0.054 Intuition: Pr[a] high means good lineage Protein l AGO2 (X1 ˄ X2) ˅ (X1 ˄ X3) Protein a AGO2 (X1 ˄ X2) 0.9 *(1 - ( )(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774

1st step: Lineage DNFs to “graphs”
X1 Y1 (X1 ˄ Y1) ˅(X2 ˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF  a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than e and small size, i.e. a best lower bound;

How big a matching could we need?
Assume Pr[Xi ] = Pr[Yj ] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples Size Pr[M] 9 0.9 17 0.99 25 0.999 33 0.9999 If l has a small good matching, take a to be matching. Call this a “good enough matching”

There is not always a good-enough matching
X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is  0.4 , but formula very close to 0.625! nodes in any maximal matching

SL is always small Two Cases: Small-good matching
THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!

Summary of Constructing SL
For SL, good lineage = big lineage Not true in general. Gave an algorithm that always finds small SL Constant in the data Exponential in almost everything else Main trick: Don’t try to find optimal solutions, when sloppy is good enough!

Other fun results in the paper
Sufficient Lineage (SL) Error bounds for QP Finding influential tuples Polynomial Lineage (PL): DNF to polynomial Use Taylor/Fourier approximation of poly Algos for QP, explanations and influential tuples Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).

Experiments Geneontology Database Discuss a single view
Publically available Predefined views Atoms = “evidence codes” Discuss a single view 6 tables 2 sources of evidence 1119 tuples 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”

Compression Ratio v. Error
Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)

(smaller is more conservative)
Effect on QP Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)

Which ls give the biggest gain?
Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)

Conclusion Discussed approximate lineage approach Sufficient Lineage
Goal: Fast QP, Explanations Sufficient Lineage Can be used by standard QPs Improves QP dramatically Apx lineage is more general, e.g. Polynomial

Approximate Lineage for Probabilistic Databases

Similar presentations

Presentation on theme: "Approximate Lineage for Probabilistic Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Lineage for Probabilistic Databases

Similar presentations

Presentation on theme: "Approximate Lineage for Probabilistic Databases"— Presentation transcript:

Similar presentations

About project

Feedback