Presentation is loading. Please wait.

Presentation is loading. Please wait.

Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.

Similar presentations


Presentation on theme: "Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington."— Presentation transcript:

1 Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington

2 Evaluating Complex SQL on PDBs2 12/8/2006 High Level Overview DBMS: Precise answers over clean data Data are often imprecise  Information Integration  Information Extraction Probabilistic DB (PDB) handle imprecision  Many low quality answers  Top-K ranked by probability This talk: Compute Top-K Efficiently

3 Evaluating Complex SQL on PDBs3 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

4 Evaluating Complex SQL on PDBs4 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

5 Evaluating Complex SQL on PDBs5 12/8/2006 Example Application IMDB Lots of interesting data above movies (e.g. actors, directors) Well maintained and clean But no reviews! On the web there are lots of reviews How will I know which movie they are about? Alice needs to do information extraction and object reconcillation. Is a movie good or bad? Alice wants to do sentiment analysis. A probabilistic database can help Alice store and query her uncertain data. Find all years where ‘Anthony Hopkins’ starred in a good movie

6 Evaluating Complex SQL on PDBs6 12/8/2006 Imprecision is out there… Object Reconciliation RIDTitle r12412 Monkeys r155Twelve Monkeys r1752 Monkey r194Monk MIDTitle m23212 Monkeys m143Monkey Love Our Approach: Convert scores to probabilities Data extracted from Reviews Clean IMDB Data Output: (RID,MID) pairs 12/8/2006 Match No Match t’t Felligi-Sunter Approach: Score (s) each (RID,MID)

7 Evaluating Complex SQL on PDBs7 12/8/2006 Imprecision is out there… Object Reconciliation RIDTitle r12412 Monkeys r155Twelve Monkeys r1752 Monkey r194Monk MIDTitle m23212 Monkeys m143Monkey Love RIDMIDProb r175m2320.8 r175m1430.2 Felligi-Sunter Approach: Score (s) each (RID,MID) MatchNo Match t’ t

8 Evaluating Complex SQL on PDBs8 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

9 Evaluating Complex SQL on PDBs9 12/8/2006 Query Processing Background RIDMIDProb r175m2320.8 r175m1430.2 Query Processing builds event expression Intensional Query Processing [FR97] Associate to each tuple an event Probability event is satisfied = query value Technical Point: Projection as last operator implies result is a DNF

10 Evaluating Complex SQL on PDBs10 12/8/2006 DNF Sampling at a High Level Estimate p(t),probability DNF sat satisfied  Do for each output tuple, t  #P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04]  Randomized Approximation [LK84] Simulation reduces uncertainty 0.0 1.0 Uncertain about p(t)

11 Evaluating Complex SQL on PDBs11 12/8/2006 Naïve Query Processing Naïve algorithm (PTIME): Simulate until all small  “Epsilon”-small 0.01.0 Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis 1 3 4 2 Can we do better?

12 Evaluating Complex SQL on PDBs12 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

13 Evaluating Complex SQL on PDBs13 12/8/2006 A Better Method: Multisimulation Separate Top-K with few simulations  Concentrate on intervals in Top-K  Asymptotically, confidence intervals are nested Compare against OPT  “knows” which intervals to simulate Evaluating Complex SQL on PDBs 13 12/8/2006 0.01.0 Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis 1 3 4 2

14 Evaluating Complex SQL on PDBs14 12/8/2006 The Critical Region The critical region is the interval  (kth-highest min, k+1 st higest max)  For k = 2 0.01.0

15 Evaluating Complex SQL on PDBs15 12/8/2006 Three Simple Rules: Rule 1 0.01.0 Pick a “Double Crosser” OPT must pick this too

16 Evaluating Complex SQL on PDBs16 12/8/2006 Three Simple Rules: Rule 2 All lower/upper crossers then maximal  OPT must pick this too 0.01.0

17 Evaluating Complex SQL on PDBs17 12/8/2006 Three Simple Rules: Rule 3 Pick an upper and a lower crosser  OPT may only pick 1 of these two 0.01.0

18 Evaluating Complex SQL on PDBs18 12/8/2006 Multisimulation is a 2-Approx Thm: Multisimulation performs at most twice as many simulations as OPT  And, no deterministic algorithm can do better on every instance. Extensions  Top-K Set (shown)  Anytime (produce from 1 to k)  Rank (produce top k ranked)  All ( rank all intervals )

19 Evaluating Complex SQL on PDBs19 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

20 Evaluating Complex SQL on PDBs20 12/8/2006 Experiment Details: Uncertain tuples Table# Tuples StringMatch339k ActorMatch6,758k DirectorMatch18k Table# Tuples Reviews292k

21 Evaluating Complex SQL on PDBs21 12/8/2006 Running Time

22 Evaluating Complex SQL on PDBs22 12/8/2006 Running Time “Find all years in which Anthony Hopkins was in a highly rated movie” (SS) Small Number of Tuples Output (33) Small DNFs per Output (Avg. 20.4, Max 63)

23 Evaluating Complex SQL on PDBs23 12/8/2006 Running Time “Find all directors who have a highly rated drama but low rated comedy” (LL) Large #Tuples Output (1415) Large DNFs per Output (Avg. 234.8, Max. 9088)

24 Evaluating Complex SQL on PDBs24 12/8/2006 Conclusions Mystiq is a general purpose probabilistic database Multisimulation and Logical Optimization  key to performance on large data sets Advert: Demo on my laptop

25 Evaluating Complex SQL on PDBs25 12/8/2006 Running Time “Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL) Small Number of Tuples Output (33) Large DNFs per Output (Avg. 117.7,Max 685)

26 Evaluating Complex SQL on PDBs26 12/8/2006 Running Time “Find all directors in the 80s who had a highly rated movie” (LS) Large #Tuples Output (3259) Small DNFs per Output (Avg 3.03, Max 30)

27 Evaluating Complex SQL on PDBs27 12/8/2006 0.01.0 Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis

28 Evaluating Complex SQL on PDBs28 12/8/2006 0.01.0 Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis 1 3 4 2

29 Evaluating Complex SQL on PDBs29 12/8/2006 0.01.0

30 Evaluating Complex SQL on PDBs30 12/8/2006 0.01.0

31 Evaluating Complex SQL on PDBs31 12/8/2006 0.01.0

32 Evaluating Complex SQL on PDBs32 12/8/2006 0.01.0

33 Evaluating Complex SQL on PDBs33 12/8/2006 0.01.0


Download ppt "Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington."

Similar presentations


Ads by Google