Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases
E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Uncertainty Lineage Data Bases Very Large Data Bases
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
CS4432: Database Systems II Query Processing- Part 2.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington.
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
1 VLDB, Background What is important for the user.
Learning Bayesian Networks for Complex Relational Data
Clustering Motion or Crunching Clusters, Hiding Logs
A Course on Probabilistic Databases
A paper on Join Synopses for Approximate Query Answering
Computing Full Disjunctions
Approximate Lineage for Probabilistic Databases
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Database Applications (15-415) Relational Calculus Lecture 6, September 6, 2016 Mohammad Hammoud.
Lecture 16: Probabilistic Databases
Assignment 3 Presentation EXAMPLE
Probabilistic Databases
Presentation transcript:

Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington

Evaluating Complex SQL on PDBs2 12/8/2006 High Level Overview DBMS: Precise answers over clean data Data are often imprecise  Information Integration  Information Extraction Probabilistic DB (PDB) handle imprecision  Many low quality answers  Top-K ranked by probability This talk: Compute Top-K Efficiently

Evaluating Complex SQL on PDBs3 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs4 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs5 12/8/2006 Example Application IMDB Lots of interesting data above movies (e.g. actors, directors) Well maintained and clean But no reviews! On the web there are lots of reviews How will I know which movie they are about? Alice needs to do information extraction and object reconcillation. Is a movie good or bad? Alice wants to do sentiment analysis. A probabilistic database can help Alice store and query her uncertain data. Find all years where ‘Anthony Hopkins’ starred in a good movie

Evaluating Complex SQL on PDBs6 12/8/2006 Imprecision is out there… Object Reconciliation RIDTitle r12412 Monkeys r155Twelve Monkeys r1752 Monkey r194Monk MIDTitle m23212 Monkeys m143Monkey Love Our Approach: Convert scores to probabilities Data extracted from Reviews Clean IMDB Data Output: (RID,MID) pairs 12/8/2006 Match No Match t’t Felligi-Sunter Approach: Score (s) each (RID,MID)

Evaluating Complex SQL on PDBs7 12/8/2006 Imprecision is out there… Object Reconciliation RIDTitle r12412 Monkeys r155Twelve Monkeys r1752 Monkey r194Monk MIDTitle m23212 Monkeys m143Monkey Love RIDMIDProb r175m r175m Felligi-Sunter Approach: Score (s) each (RID,MID) MatchNo Match t’ t

Evaluating Complex SQL on PDBs8 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs9 12/8/2006 Query Processing Background RIDMIDProb r175m r175m Query Processing builds event expression Intensional Query Processing [FR97] Associate to each tuple an event Probability event is satisfied = query value Technical Point: Projection as last operator implies result is a DNF

Evaluating Complex SQL on PDBs10 12/8/2006 DNF Sampling at a High Level Estimate p(t),probability DNF sat satisfied  Do for each output tuple, t  #P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04]  Randomized Approximation [LK84] Simulation reduces uncertainty Uncertain about p(t)

Evaluating Complex SQL on PDBs11 12/8/2006 Naïve Query Processing Naïve algorithm (PTIME): Simulate until all small  “Epsilon”-small Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis Can we do better?

Evaluating Complex SQL on PDBs12 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs13 12/8/2006 A Better Method: Multisimulation Separate Top-K with few simulations  Concentrate on intervals in Top-K  Asymptotically, confidence intervals are nested Compare against OPT  “knows” which intervals to simulate Evaluating Complex SQL on PDBs 13 12/8/ Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis

Evaluating Complex SQL on PDBs14 12/8/2006 The Critical Region The critical region is the interval  (kth-highest min, k+1 st higest max)  For k =

Evaluating Complex SQL on PDBs15 12/8/2006 Three Simple Rules: Rule Pick a “Double Crosser” OPT must pick this too

Evaluating Complex SQL on PDBs16 12/8/2006 Three Simple Rules: Rule 2 All lower/upper crossers then maximal  OPT must pick this too

Evaluating Complex SQL on PDBs17 12/8/2006 Three Simple Rules: Rule 3 Pick an upper and a lower crosser  OPT may only pick 1 of these two

Evaluating Complex SQL on PDBs18 12/8/2006 Multisimulation is a 2-Approx Thm: Multisimulation performs at most twice as many simulations as OPT  And, no deterministic algorithm can do better on every instance. Extensions  Top-K Set (shown)  Anytime (produce from 1 to k)  Rank (produce top k ranked)  All ( rank all intervals )

Evaluating Complex SQL on PDBs19 12/8/2006 Overview Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs20 12/8/2006 Experiment Details: Uncertain tuples Table# Tuples StringMatch339k ActorMatch6,758k DirectorMatch18k Table# Tuples Reviews292k

Evaluating Complex SQL on PDBs21 12/8/2006 Running Time

Evaluating Complex SQL on PDBs22 12/8/2006 Running Time “Find all years in which Anthony Hopkins was in a highly rated movie” (SS) Small Number of Tuples Output (33) Small DNFs per Output (Avg. 20.4, Max 63)

Evaluating Complex SQL on PDBs23 12/8/2006 Running Time “Find all directors who have a highly rated drama but low rated comedy” (LL) Large #Tuples Output (1415) Large DNFs per Output (Avg , Max. 9088)

Evaluating Complex SQL on PDBs24 12/8/2006 Conclusions Mystiq is a general purpose probabilistic database Multisimulation and Logical Optimization  key to performance on large data sets Advert: Demo on my laptop

Evaluating Complex SQL on PDBs25 12/8/2006 Running Time “Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL) Small Number of Tuples Output (33) Large DNFs per Output (Avg ,Max 685)

Evaluating Complex SQL on PDBs26 12/8/2006 Running Time “Find all directors in the 80s who had a highly rated movie” (LS) Large #Tuples Output (3259) Small DNFs per Output (Avg 3.03, Max 30)

Evaluating Complex SQL on PDBs27 12/8/ Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis

Evaluating Complex SQL on PDBs28 12/8/ Christopher Walken Harvey Keitel Samuel L. Jackson Bruce Willis

Evaluating Complex SQL on PDBs29 12/8/

Evaluating Complex SQL on PDBs30 12/8/

Evaluating Complex SQL on PDBs31 12/8/

Evaluating Complex SQL on PDBs32 12/8/

Evaluating Complex SQL on PDBs33 12/8/