A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.

Slides:



Advertisements
Similar presentations
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Advertisements

Representing and Querying Correlated Tuples in Probabilistic Databases
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
FALL 2004CENG 351 File Structures and Data Managemnet1 Relational Algebra.
1 8. Safe Query Languages Safe program – its semantics can be at least partially computed on any valid database input. Safety is tied to program verification,
FALL 2004CENG 351 File Structures and Data Management1 SQL: Structured Query Language Chapter 5.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
CSE 830: Design and Theory of Algorithms
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Constraints, Triggers Chapter 5.
Introduction to Database Systems 1 Relational Algebra Relational Model: Topic 3.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
SQL Tutorial Introduction to Database. Learning Objectives  Read and write Data Definition grammar of SQL  Read and write data modification statements.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
The Relational Model: Relational Calculus
CSC 411/511: DBMS Design Dr. Nan WangCSC411_L6_SQL(1) 1 SQL: Queries, Constraints, Triggers Chapter 5 – Part 1.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Tutorial 6 SQL Muhammad Sulayman
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
1 CSCE Database Systems Anxiao (Andrew) Jiang The Database Language SQL.
CS4432: Database Systems II Query Processing- Part 2.
1 CS 430 Database Theory Winter 2005 Lecture 5: Relational Algebra.
Lectures 2&3: Introduction to SQL. Lecture 2: SQL Part I Lecture 2.
Query Processing – Implementing Set Operations and Joins Chap. 19.
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples. v If the.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
SQL (Structured Query Language)
Database Systems (資料庫系統)
Slides are reused by the approval of Jeffrey Ullman’s
A Course on Probabilistic Databases
Database Construction and Usage
Relational Algebra Chapter 4, Part A
Basic SQL Lecture 6 Fall
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Database Systems 10/13/2010 Lecture #4.
Relational Algebra.
Lecture 16: Probabilistic Databases
Relational Algebra 1.
LECTURE 3: Relational Algebra
Relational Algebra Chapter 4 - part I.
Lecture 12: SQL Friday, October 20, 2000.
Probabilistic Databases
CENG 351 File Structures and Data Managemnet
Relational Algebra Chapter 4 - part I.
Presentation transcript:

A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1

Outline 1. Motivating Applications 2. The Probabilistic Data ModelChapter 2 3. Extensional Query PlansChapter The Complexity of Query EvaluationChapter 3 5. Extensional EvaluationChapter Intensional EvaluationChapter 5 7. Conclusions June, 2014Probabilistic Databases - Dan Suciu 2 Part 1 Part 2 Part 3 Part 4

Background: Relational Algebra 1. Join 2. Projection (w/ duplicate elimination) 3. Union 1. Selection 2. Difference: will not discuss June, 2014Probabilistic Databases - Dan Suciu 3 ⋈ Π σ ∪ -

Background: Query Plans June, 2014Probabilistic Databases - Dan Suciu 4 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 Q(z) = R(z,x), S(x,y),T(y,u)

Background: Query Plans June, 2014Probabilistic Databases - Dan Suciu 5 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) Q(z) = R(z,x), S(x,y),T(y,u)

Background: Query Plans June, 2014Probabilistic Databases - Dan Suciu 6 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) Q(z) = R(z,x), S(x,y),T(y,u)

Background: Query Plans June, 2014Probabilistic Databases - Dan Suciu 7 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) Q(z) = R(z,x), S(x,y),T(y,u)

Background: Query Plans June, 2014Probabilistic Databases - Dan Suciu 8 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 SELECT DISTINCT R.z FROM R, S, T WHERE R.x = S.x and S.y=T.y and T.u = 123 ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) ⋈y⋈y ΠzΠz ⋈x⋈x σ u=123 R(z,x)S(x,y)T(y,u) Q(z) = R(z,x), S(x,y),T(y,u) These plans are equivalent! The query optimizer will select a plan with minimal cost.

Extensional Plans Main idea: Modify each operator to compute output probabilities Must make some assumption: Probabilities are independent, or Probabilities are disjoint (exclusive) June, 2014Probabilistic Databases - Dan Suciu 9

ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 S(A,B) AP a1p1 a2p2 a3p3 R(A) ⋈ ABP a1b1p1*q1 a1b2p1*q2 a2b3p2*q3 a2b4p2*q4 a2b5p2*q5 Extensional Operators June, 2014Probabilistic Databases - Dan Suciu 10 i Independent join

ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 S(A,B) AP a1p1 a2p2 a3p3 R(A) ⋈ ABP a1b1p1*q1 a1b2p1*q2 a2b3p2*q3 a2b4p2*q4 a2b5p2*q5 Extensional Operators June, 2014Probabilistic Databases - Dan Suciu 11 i S(A,B) AP a11 - (1-q1)*(1-q2) a21 - (1-q3)*(1-q4)*(1-q5) ΠAΠA i ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 Independent join Independent project

ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 S(A,B) AP a1p1 a2p2 a3p3 R(A) ⋈ ABP a1b1p1*q1 a1b2p1*q2 a2b3p2*q3 a2b4p2*q4 a2b5p2*q5 Extensional Operators June, 2014Probabilistic Databases - Dan Suciu 12 i S(A,B) AP a11 - (1-q1)*(1-q2) a21 - (1-q3)*(1-q4)*(1-q5) ΠAΠA i ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 ABP a1b1q1 a1b1q2 a2b2q3 a2b3q4 a2b2q5 S(A,B) σ A=a2 ABP a2b2q3 a2b3q4 a2b2q5 Independent join Independent project Selection

Example June, 2014Probabilistic Databases - Dan Suciu 13 S R SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x Q() = R(x), S(x,y) xP a1a1 p1p1 a2a2 p2p2 a3a3 p3p3 xyP a1a1 b1b1 q1q1 a1a1 b2b2 q2q2 a2a2 b3b3 q3q3 a2a2 b4b4 q4q4 a2a2 b5b5 q5q5 P(Q) = 1 – [1-p 1 *(1-(1-q 1 )*(1-q 2 ))] *[1- p 2 *(1-(1-q 3 )*(1-q 4 )*(1-q 5 ))]

xP a1a1 p1p1 a2a2 p2p2 a3a3 p3p3 ⋈ p 1 q 1 p 1 q 2 p 2 q 3 p 2 q 4 p 2 q 5 ΠΦΠΦ S(x,y)R(x) 1-(1-p 1 q 1 )(1-p 1 q 2 )(1-p 2 q 3 )(1-p 2 q 4 )(1-p 2 q 5 ) ⋈ ΠΦΠΦ S(x,y) R(x) ΠxΠx 1-(1-q 1 )(1-q 2 ) 1-(1-q 4 )(1-q 5 ) (1-q 6 ) 1-{1-p 1 [1-(1-q 1 )(1-q 2 )]}* {1-p 2 [1-(1-q 4 )(1-q 5 ) (1-q 6 )]} Wrong Right June, 2014Probabilistic Databases - Dan Suciu 14 P(Q) = 1 – [1-p 1 *(1-(1-q 1 )*(1-q 2 ))] *[1- p 2 *(1-(1-q 3 )*(1-q 4 )*(1-q 5 ))] SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x Q() = R(x), S(x,y) xyP a1a1 b1b1 q1q1 a1a1 b2b2 q2q2 a2a2 b3b3 q3q3 a2a2 b4b4 q4q4 a2a2 b5b5 q5q5

Safe Plans Fix a schema for the probabilistic database E.g. all relations are tuple-independent, or BID with a given key Query optimization = find a safe plan June, 2014Probabilistic Databases - Dan Suciu 15 Definition: A plan is safe if it computes probabilities correctly

Lesson 1 Equivalent plans may become in-equivalent when interpreted as extensional plans A correct extensional plan is called a safe plan Goal: find a safe plan! Does every query have a safe plan? June, 2014Probabilistic Databases - Dan Suciu 16

Unsafe Queries June, 2014Probabilistic Databases - Dan Suciu XY x1y1 x1y2 x2y2 XP x1p1 x2p2 YP y1q1 y2q2 R T S ⋈  ⋈  RS T p1 p2 p1q1 (1-(1-p1)(1-p2))q2 p1 p2 Wrong #P-hard H 0 :- R(x),S(x,y),T(y) SELECT DISTINCT ‘yes’ FROM R, S, T WHERE R.x = S.x and S.y = T.y SELECT DISTINCT ‘yes’ FROM R, S, T WHERE R.x = S.x and S.y = T.y However, the plan is guaranteed to always return upper bound [Gatterbauer’13] 1-(1-p1q1)(1-(1-(1-p1)(1-p2))q2) 17

Discussion Safe queries admit a safe plan, and can be computed efficiently Unsafe queries do not admit safe plans, and we will prove that they cannot be computed efficiently Every extensional plan (safe or unsafe) can be written directly in SQL – will illustrate with postgres Every query (safe/unsafe) admits extensional plans that compute upper bounds, or lower bounds of its probability June, 2014Probabilistic Databases - Dan Suciu 18

ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 S(A,B) AP a1p1 a2p2 a3p3 R(A) ⋈ ABP a1b1p1*q1 a1b2p1*q2 a2b3p2*q3 a2b4p2*q4 a2b5p2*q5 Extensional Plans in Postgres June, 2014Probabilistic Databases - Dan Suciu SELECT R.A, S.B, R.P*S.P FROMR, S WHERE R.A=S.A 19 i

ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 S(A,B) AP a1p1 a2p2 a3p3 R(A) ⋈ ABP a1b1p1*q1 a1b2p1*q2 a2b3p2*q3 a2b4p2*q4 a2b5p2*q5 Extensional Plans in Postgres June, 2014Probabilistic Databases - Dan Suciu SELECT R.A, S.B, R.P*S.P FROMR, S WHERE R.A=S.A 20 i S(A,B) AP a11 - (1-q1)*(1-q2) a21 - (1-q3)*(1-q4)*(1-q5) ΠAΠA i ABP a1b1q1 a1b2q2 a2b3q3 a2b4q4 a2b5q5 SELECT S.A, 1.0-prod(1.0 - S.p) FROMS GROUP BY S.A create or replace function combine_prod(float, float) returns float as 'select $1 * $2' language SQL; create or replace function final_prod(float) returns float as 'select $1' language SQL; drop aggregate if exists prod (float); create aggregate prod(float) ( sfunc = combine_prod, stype = float, finalfunc = final_prod, initcond = '1.0' );

Extensional Plans in Postgres June, 2014Probabilistic Databases - Dan Suciu ⋈ ΠΦΠΦ S(x,y) R(x) ΠxΠx SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x SELECT DISTINCT ‘true’ FROM R, S WHERE R.x = S.x WITH Temp AS (SELECT S.x, 1.0-prod(1.0 - S.p) as p FROM S GROUP BY S.x) SELECT ‘true’ as z, 1.0-prod(1.0 – R.P * Temp.P) as p FROM R, Temp WHERE R.x = Temp.x 21 i i i

Try this in postgres: June, 2014Probabilistic Databases - Dan Suciu First step: download postgres from -- Second step: run the command "createdb pdb" -- Third step: run the command "psql pdb" then cut/paste commands below define an aggregate function to compute the product create or replace function combine_prod (float, float) returns float as 'select $1 * $2' language SQL; create or replace function final_prod (float) returns float as 'select $1' language SQL; drop aggregate if exists prod (float); create aggregate prod (float) ( sfunc = combine_prod, stype = float, finalfunc = final_prod, initcond = '1.0' ); simple tables, similar to those used in the tutorial create table R(z char(8), x char(8), p float); create table S(x char(8), y char(8), p float); insert into R values('c', 'a1', 0.5); insert into R values('c', 'a2', 0.5); insert into R values('c', 'a3', 0.5); insert into S values('a1', 'b1', 0.5); insert into S values('a1', 'b2', 0.5); insert into S values('a2', 'b2', 0.5); insert into S values('a2', 'b3', 0.5); insert into S values('a2', 'b4', 0.5); -- computing the query Q(z) = R(z,x),S(x,y) -- a safe plan: with Temp as (select S.x, 1.0-prod(1.0-p) as p from S group by S.x) select R.z, 1.0-prod(1-R.p*Temp.p) from R, Temp where R.x=Temp.x group by R.z; -- an unsafe plan; guaranteed to return an upper bound on the probability select R.z, 1.0-prod(1-R.p*S.p) from R, S where R.x=S.x group by R.z;

Extensional Plans in Postgres June, 2014Probabilistic Databases - Dan Suciu 23 SELECT DISTINCT ‘yes’ FROM R, S, T WHERE R.x = S.x and S.y = T.y SELECT DISTINCT ‘yes’ FROM R, S, T WHERE R.x = S.x and S.y = T.y ⋈ ΠΦΠΦ S(x,y) R(x) ΠxΠx i i i ⋈ i T(y) The plan is unsafe, but guaranteed to return a probability that is an upper bound. The same plan is guaranteed to return a lower bound, by modifying appropriately the probabilities in T Gatterbauer,S.’2013 This query is unsafe

Try this in postgres: June, 2014Probabilistic Databases - Dan Suciu The following approximation plans for unsafe queries are from -- Gatterbauer, Suciu: Oblivious Bounds on the Probability of Boolean Functions -- create a third table create table T(y char(8), p float); insert into T values('b1', 0.5); insert into T values('b2', 0.5); insert into T values('b3', 0.5); insert into T values('b4', 0.5); -- computing the query Q(z) = R(z,x),S(x,y),T(y) -- This query has no safe plans -- Next two unsafe plans compute upper bounds on the probability: -- Unsafe plan #1 with Temp as (select S.x, 1.0-prod(1.0-S.p*T.p) as p from S,T where S.y=T.y group by S.x) select R.z, 1.0-prod(1-R.p*Temp.p) from R, Temp where R.x=Temp.x group by R.z; -- Unsafe plan #2 with Temp as (select R.z,S.y,1.0-prod(1.0-R.p*S.p) as p from R,S where R.x=S.x group by R.z,S.y) select Temp.z, 1.0-prod(1-Temp.p*T.p) from Temp, T where Temp.y=T.y group by Temp.z; -- Next two unsafe plans compute lower bounds on the probability: with newT as (select T.y, 1-exp((ln(1-T.p))/count(*)) as p from S,T where S.y=T.y group by T.y, T.p), Temp as (select S.x, 1.0-prod(1.0-S.p*newT.p) as p from S,newT where S.y=newT.y group by S.x) select R.z, 1.0-prod(1-R.p*Temp.p) from R, Temp where R.x=Temp.x group by R.z; with newR as (select R.z, R.x, 1-exp((ln(1-R.p))/count(*)) as p from R,S where R.x=S.x group by R.z,R.x,R.p), Temp as (select newR.z, S.y, 1.0-prod(1.0-newR.p*S.p) as p from newR, S where newR.x=S.x group by newR.z, S.y) select Temp.z, 1.0-prod(1-Temp.p*T.p) from Temp, T where Temp.y=T.y group by Temp.z;

Lesson 2 You don’t need a probabilistic database system in order to use a probabilistic database! What you need is to know really well SQL and probability theory (You also need to read the book on probabilistic databases!) June, 2014Probabilistic Databases - Dan Suciu 25