Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

Similar presentations


Presentation on theme: "General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison."— Presentation transcript:

1 General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington

2 Study Cardinality Estimation 1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization. “We estimate that distinct # of Employees is 10” 2 Propose a declarative language with statistical assertions

3 Motivating Applications 3 1. Incorporate query feedback records - 3. Data generation and description 2. Optimizers for new domains (DB Kit 2.0) Cloud Computing, Information Extraction Underutilized: No general purpose mechanism

4 Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 4

5 Statistical Assertions An assertion is a CQ Views + sharp (#) statement: V 1 (x) :- R(x,-) “The number of values in the output of V1 is 20” #V 1 = 20 V 2 (y) :- R(-,y),S(y) “The number of values in the output V 2 is 50” #V 2 = 50 A program is a set of assertions V(x) :- R(x,y), …. #V= 10 6 5

6 Model as a Probabilistic Database Intuitively, # is “Expected Value” V 1 (x) :- R(x,-) A model is a probabilistic database s.t. the expected number of tuples in V 1 is 20. Ok, but which pdb? #V 1 = 20 V(x) :- R(x,y), …. #V= 10 6 6 “The number of values in the output of V1 is 20”

7 Desiderata for our solution Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want params analytically V(x) :- R(x,y), …. #V= 10 6 7

8 Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 8

9 Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world 9

10 Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world Essentially, any discrete probability distribution on relations A probabilistic database is a pair ( Inst (n),p) 10

11 The semantics of # V 1 (x) :- R(x,-) # means “expected value” #V 1 = 20 Achieving (D1): Stats must agree NB: In truth, we let n tend to infinity, and settle for asymptotically equal. 11 “The number of values in the output of V1 is 20”

12 Multiple Views Given V 1, V 2, … with #V i = d i for i=1,…,t If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one? Achieving (D1): Stats must agree 12

13 Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions 13

14 Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Z is normalizing constant and  i is positive parameter for i=1,..,t NB: p is only a function of the stats, and so we have achieved (D2) NB: p is only a function of the stats, and so we have achieved (D2) One can show that p has following form: 14

15 Benefits of MaxEnt Every (consistent) statistical program induces a well-defined distribution – Every query has a well-defined cardinality estimate Statistics as a whole, not as individual stats. Can add new statistics to our heart’s content Technical Challenge:  i analytically 15 A statistical program

16 Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 16

17 Two quick Examples I: A material random Graph – Even simple EM solutions have interesting theory II: Intersection Models – Generating function, and – Different, analytic technique 17

18 Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d 18 Random Graph: Add edges independently at random

19 Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d By Linearity, E[V] = xn 2 = d 19 Random Graph: Add edges independently at random

20 Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d Random Graph: Add edges independently at random By Linearity, E[V] = xn 2 = d 20 This is MaxEnt…write:

21 Example II: an intersection model Read: Each element is either in R 1, R 2, or all three V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 21 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

22 Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 22 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

23 Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 23 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

24 Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 24 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

25 Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 25 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

26 Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 26 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

27 Results in the paper Normal Form for statistical programs Syntactic classes that we can solve analytically – “Project-Semijoin” queries (previous slide) A general technique, conditioning: – Start with tuple independent prior, and condition – Introduces inclusion constraints Extensions to handle histograms 27

28 Conclusion Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud 28


Download ppt "General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison."

Similar presentations


Ads by Google