Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.

Slides:



Advertisements
Similar presentations
Applications Computational LogicLecture 11 Michael Genesereth Spring 2004.
Advertisements

Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania.
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Information Integration Using Logical Views Jeffrey D. Ullman.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Probabilistic Verification of Discrete Event Systems using Acceptance Sampling Håkan L. S. YounesReid G. Simmons Carnegie Mellon University.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Efficient Query Evaluation on Probabilistic Databases
E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.
Querying Probabilistic XML Databases Asma Souihli Oct. 24 th 2012 Network and Computer Science Department.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Verification of Discrete Event Systems Håkan L. S. Younes Reid G. Simmons (initial work performed at HTC, Summer 2001)
Estimation of the Number of Relevant Images in Infinite Databases Presented by: Xiaoling Wang Supervisor: Prof. Clement Leung.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
CSE (c) S. Tanimoto, 2008 Propositional Logic
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 Discrete Structures CS 280 Example application of probability: MAX 3-SAT.
Probabilistic Verification of Discrete Event Systems Håkan L. S. Younes.
Module C9 Simulation Concepts. NEED FOR SIMULATION Mathematical models we have studied thus far have “closed form” solutions –Obtained from formulas --
1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.
Rada Chirkova (North Carolina State University) and Chen Li (University of California, Irvine) Materializing Views With Minimal Size To Answer Queries.
Simulation Basic Concepts. NEED FOR SIMULATION Mathematical models we have studied thus far have “closed form” solutions –Obtained from formulas -- forecasting,
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Reynold Cheng†, Eric Lo‡, Xuan S
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Using the TBox to Optimise SPARQL Queries Birte Glimm Yevgeny Kazakov Ilianna Kollia and Giorgos Stamou CS 848 Paper Critique Vishnu Prathish.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Chapter 12 Probability. Chapter 12 The probability of an occurrence is written as P(A) and is equal to.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
Conditional Probability and Independence. Learning Targets 1. I can use the multiplication rule for independent events to compute probabilities. 2. I.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
A Semantic Caching Method Based on Linear Constraints Yoshiharu Ishikawa and Hiroyuki Kitagawa University of Tsukuba
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
By John Crocker UC Santa Cruz, 2012 CMPS 162 Advanced Computer Graphics and Animation.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Probabilistic Data Management
Queries with Difference on Probabilistic Databases
Probabilistic Data Management
Lecture 16: Probabilistic Databases
Copyright © Cengage Learning. All rights reserved.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Probabilistic Data Management
Probabilistic Databases
Section 11.7 Probability.
Materializing Views With Minimal Size To Answer Queries
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Probabilistic Ranking of Database Query Results
Probabilistic Databases with MarkoViews
Presentation transcript:

Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue

A skeleton of scrubbing probabilistic database query results

Three probabilistic relation examples

Query 1: look for the year(s) where at least one movie was liked by people from northern regions The user gets the following answer from the probabilistic database: User: Where is the probability derived? System: It is based on the two assumptions: Pr(x 4 ) = 0.9 and Pr(x 5 ) = 0.2 User: I think the movie of MovieID = 4 is not actually liked by people from northern regions. Pr(x 4 ) should be 0.1 but not 0.9! System: The new probability is 0.28! How to identify the top-k uncertain assumptions for user clarification? How to recompute the probability?

Pr(ee) =Pr(x4 ∨ x5) =Pr(x4) + Pr(x5) – Pr(x4) * Pr(x5) = – 0.9 * 0.2 = 0.92 Top-k assumptions Pr(ee) =Pr(x 4 ∨ x 5 ) =Pr(x 4 ) + Pr(x 5 ) – Pr(x 4 ) * Pr(x 5 ) = – 0.1 * 0.2 = EventIDProb.Rate x x

Basic algorithm to compute top-k assumptions For an event expression ee, to compute its probability Pr(ee), one can first convert it into an equivalent disjunctive normal form, and then apply the inclusion-exclusion formula. disjunctive norm form: ee = C 1 ∨ C 2 ∨ … ∨ C m where C 1 = e 11 ∧ e 12 ∧ … ∧ e 1 s1, C 2 = e 21 ∧ e 22 ∧ … ∧ e 2 s2,..., C m = e m1 ∧ e m2 ∧ … ∧ e m sm, m ≥1, s1,s2,…,sm≥1 inclusion-exclusion formula:

Basic algorithm to compute top-k assumptions To computeone can rewrite Pr(ee) as Pr(ee)=α*Pr(e i )+β where α and β are two sub-expressions irrelevant to Pr(e i ) and The time complexity is O(2 m ), where m is the number of conjuncts in the disjunctive normal form of ee.

Optimization Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB Journal 16(4) (2007) 523–544 We restrict the event expression ee to the situation where basic events e1,e2, …, en are independent and moreover they do not occur repeatedly in ee, which can be obtained for most of the queries (80% of the TPC/H queries ) by using the well-researched optimization technique adopted in

Three probabilistic relation examples

Query 2: look for the year(s) where at least one movie was liked by people from northern regions but not by people from southern regions The user gets the following answer from the uncertain database:

ee=(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(e 1 )=0.2 Pr(e 2 )=0.7 Pr(e 3 )=0.1 Pr(e 4 )=0.9 Pr(e 5 )=0.7 Pr(e 6 )=0.2 Pr(ee)? Pr(~ee) = 1 –Pr(ee) Pr(ee 1 ∧ ee 2 ) = Pr(ee 1 ) * Pr(ee 2 ) Pr(ee 1 ∨ ee 2 ) = Pr(ee 1 ) + Pr(ee 2 ) – Pr(ee 1 ) * Pr(ee 2 ) Pr(ee)=f(Pr(e 1 ),Pr(e 2 ),…,Pr(e 6 ))

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(e 1 )=0.2 Pr(e 2 )=0.7 Pr(e 3 )=0.1 Pr(e 4 )=0.9 Pr(e 5 )=0.7 Pr(e 6 )=0.2 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.7 =0.3 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.2*0.3 =0.06 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.01 = Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.1*0.1 =0.01 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.2 =0.8 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.7*0.8 =0.56 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.9 =0.1

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(e 1 )=0.2 Pr(e 2 )=0.7 Pr(e 3 )=0.1 Pr(e 4 )=0.9 Pr(e 5 )=0.7 Pr(e 6 )=0.2 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.7 =0.3 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.2*0.3 =0.06 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.01 = Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.1*0.1 =0.01 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.2 =0.8 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.7*0.8 =0.56 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.9 =0.1

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.7 =0.3 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.2*0.3 =0.06 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.01 = Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.1*0.1 =0.01 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.2 =0.8 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.7*0.8 =0.56 Pr(ee(N)) =1-Pr(ee(leftChild(N))) =1-0.9 =0.1

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.2*0.3 =0.06 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.01 = Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.1*0.1 =0.01 Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591 Pr(ee(N)) =Pr(ee(leftChild(N))) *Pr(ee(rightChild(N))) =0.7*0.8 =0.56

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.01 = Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) Pr(ee(N)) =Pr(ee(leftChild(N)))+Pr(ee(rightChild(N))) -Pr(ee(leftChild(N)))*Pr(ee(rightChild(N))) = *0.56 =0.591

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 )

Second Optimization

(e 1 ∧ ~e 2 ) ∨ (e 3 ∧ ~e 4 ) ∨ (e 5 ∧ ~e 6 ) top-2 assumptions

Scrub the query result Recompute Pr((e 1 ∧~ e 2 ) ∨ (e 3 ∧~ e 4 ) ∨ (e 5 ∧~ e 6 )) with modified Pr(e 2 ) and pr(e 5 )

Performance Study

Conclusion