Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng.

Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng

Outline  Introduction  Estimation for single tables  SRM: Statistical Relational Models  Selectivity Estimation using SRMs  Learning SRM

Introduction  Accurate estimates of the result size of queries are crucial to several query processing components of DBMS.  Cost-based query optimizers: choose the optimal query execution plan.  Query profilers: predicting resource consumption and distribution of query results.  Answer counting queries

Introduction  How to estimate the size of a selection query over multiple attributes for a single table? The result size is determined by the joint frequency distribution of the values of these attributes. Size(Q) = |R|  P D (a 1, …,a k ) Query:select * from R where R.A1 = a1 and … and R.Ak = ak

Introduction  But … exponential in # of attributes v n, representing all combination of attribute values is infeasible.  Attribute value independence assumption: joint distribution is product of single attribute distributions  Problem: overestimate or underestimate the query size. → Bayesian Network

Estimation for single table  Example: Given a simple relation R with three attributes: Education (high-school, college,degree)3 values Income (low, medium, high) 3 values Home-Owner (false, true) 2 values Joint distribution need 18 numbers.  Observation: Some of the correlations between attributes might be indirect ones, mediated by other attributes. → Conditional Independence

Estimation for single table Figure (b) can encode precisely the same joint distribution as in Figure (a)

Bayesian Networks X-Ray Lung Infiltrates Sputum Smear TuberculosisPneumonia Nodes = random variables Edges = direct probabilistic influence Network structure encodes independence assumptions: X-Ray conditionally independent of Pneumonia given Infiltrates

Bayesian Networks X-Ray Lung Infiltrates Sputum Smear TuberculosisPneumonia  Associated with each node X i there is a conditional probability distribution P(X i |Pa i :  ) — distribution over X i for each assignment to parents 0.8 t 0.2 p t p 0.6 0.4 0.010.99 0.2 0.8 tp t p T P P(I |P, T )

BN Semantics Compact & natural representation: nodes have  k parents  2 k n vs. 2 n params conditional independencies in BN structure + local probability models full joint distribution over domain = X I S TP

BNs for Query Estimation  Query:select * from R where R.A 1 = a 1 and … and R.A k = a k  P(a 1,a 2, … a n )=  P(a i |parents(a i )) Use Bayesian inference algorithm to compute P D (a 1, …,a k )  Algorithm complexity depends on BN connectivity; efficient in practice Size(Q) = |R|  P D (a 1, …,a k )

Join Selectivity Estimation Person Purchase Uniform Join Assumption Size(Purchase Person) = | Purchase | Assuming referential integrity Naïve Approach

Join Selectivity Estimation Example query Q: “ person.income=high and purchase.type=luxury ” p = P (person.income=high) q = P (purchase.type=luxury) Size Q = |Purchase|*p*q Problems: Joining Two Tables

Correlated Attributes Person Purchase Income = high Income = low Type = luxury Type = necessity The attributes of the two different tables are often correlated

Skewed Join Person Purchase Income = high Income = low Type = luxury Type = necessity The probability that two tuples join with each other can also be correlated with various attributes.

Join Indicator S R Query: select * from R, S where R.F = S.K and R.A = a and S.B = b P(J F ) = prob. randomly chosen tuple from R joins with a randomly chosen tuple from S size(Q) = | R | | S | P(J F, a, b)

Statistical Relational Models  Model distribution of attributes across multiple tables  Allow attribute values to depend on attributes in the same table (like a BN)  Allow attribute values to depend on attributes in other tables along a foreign key join  Can model the join probability of two tuples using join indicator variable

Statistical Relational Model  A SRM for a relational database is a pair(S,θ),which specifies a local probabilistic model for each of the following variables  A variable R.A for each table R and each attribute A  R.*  A boolean join indicator variable R.J F  For each variable R.X  S specifies a set of parents Pa(R.X)  Θ specifies a CPD P(R.X|Pa(R.X))

Example SRM Person Income Age School Prestige J school Purchase J person Type Attended Bought-by 0.9998, 0.0002 Type=necessity false true false true 0.999, 0.001 Income = high 0.99, 0.01

Universal Foreign Key Closure  Schema: R, S, T.R.F refers to S, S.F refers to T stratification: T < S < R r s t r.F 1 = s.K s.F 2 = t.K  Schema: R, S R.F 1 refers to S, R.F 2 refers to S stratification: S < R r r.F 1 = s 1.K s1s1 s2s2 r.F 2 = s 2.K

Universal Foreign Key Closure  Minimal extension Q + to a query Q: Let Q be a keyjoin query over r 1,r 2, … r k For each r, if there is an attribute R.A with parent R.F.B where R.F points to S, then there is a unique tuple variable s representing the join r.F=s.K  Proposition: Let Q be a query and let Q + be its minimal extension. Then size Q [D]=size Q+ [D]

Answering Queries Using SRMs  Construct Query Evaluation BN for Query: select * from Person, Purchase where Person.id = Purchase.buyer-id and Person.Income = high and Purchase.Type=luxury Person Purchase J person Income Type Age Prestige J school School Compute upward closure of query attributes by including all parents as well

SRM Learning  Learn parameters & qualitative dependency structure  Extend known techniques for learning Bayesian networks from data. Database Patient Strain Contact

Structure selection  Define scoring function: log-likelihood function l ( θ,S | D)=log P(D | S, θ) Finding the model that has maximum log-likelihood given data.  Do the greedy local structure search

Parameter Estimation  The model contains a parameter Θ a|x for each value a of A and each assignment of values x to X. Θ a|x = P(R.A=a |X=x) Θ a|x = F D (R.A=a,X=x)/F D (X=x)

System Architecture Model Constructor Database offline Selectivity Estimator execution time Query Q Size(Q)

Conclusions  SRM is unique in its ability to handle select and join operators  Estimates the high-dimensional joint distribution using a set of lower- dimensional conditional distributions To do:  Incremental maintenance of the SRM as the database changes  Joins over non-key attributes

Selected Publications o“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D. Koller and B. Taskar, JMLR 2002. o“Probabilistic Models of Text and Link Structure for Hypertext Classification”, L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond Classification’, 2001. o“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D. Koller, SIGMOD-01. o“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, 2001. osee also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99. o“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman, D. Koller, and B. Taskar, ICML-01. o“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D. Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational Learning: Crossing the Boundaries, 2000. oNotes from AAAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2000. oNotes from IJCAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2003. See http://www.cs.umd.edu/~getoor

Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng.

Similar presentations

Presentation on theme: "Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng.

Similar presentations

Presentation on theme: "Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng."— Presentation transcript:

Similar presentations

About project

Feedback