Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz

2 Motivation Problem: determine the result cardinality of a complex relational query Query optimization: cost factors of candidate plans depend on query selectivity Data exploration: query selectivity provides timely feedback Solution: approximate selectivity over data synopses Relational Database Count(Q) Selectivity Expensive Efficient Database synopsis Selectivity Estimate Count(Q)

3 Previous Work Table Level Synopses Examples: Histograms [Poosala+96], Sketches [Dobra+02], Wavelets [Chakrabarti+00], Table samples [Lipton+93] Weakness: do not summarize key values well Schema Level Synopses Examples: Join Synopses [Acharya+99], PRMs [Getoor+01] Weakness: restricted to certain types of schemata R T Z W S RZT S RZW SRSR SZSZ STST SWSW

4 Synopsis Desiderata Schema level  Capture key/foreign-key joins Applicable to general schemata and queries

5 Contributions Tuple Graph Synopsis (TuG) Model: Semi-structured view of relational data Schema level summary Schemata with many-to-many relationships Complex join queries TuG construction algorithm Basis: Tuple clustering Novel heuristics Builds on existing clustering techniques Experimental study TuGs are effective synopses for small space budgets Better accuracy compared to previous techniques

6 Outline The TuG Synopsis Synopsis model Estimation framework TuG Construction TuG Compression Construction Algorithm Experimental Study Conclusions

7 TuG Model: Intuition #1 Relational database ↔ Data graph midyeargenre 12005Action 22004Action 32000Drama aidsex 1Male 2Female 3Male 4 Movies midaid 11 12 23 33 34 CastActors c1c1 Action Drama c2c2 c3c3 c4c4 c5c5 m1m1 m2m2 m3m3 a1a1 a2a2 a3a3 a4a4 Female Male 2005 2004 2000

8 TuG Model: Intuition #2 Join query ↔ Sub-graph matching Selectivity ↔ Count of matching sub-graphs SELECT * FROM M, C, A WHERE M.mid=C.mid AND C.aid=A.aid AND A.sex=Male AND M.genre=Drama c1c1 Action Drama c2c2 c3c3 c4c4 c5c5 m1m1 m2m2 m3m3 a1a1 a2a2 a3a3 a4a4 Female Male 2005 2004 2000 MCA Male Drama

9 3 2 1 1 1 1 TuG Synopsis Model Tuple Node: Set of tuples from the same relation Node count: number of tuples Edge: Join between tuple sets Edge count: result size of join Action Drama Female Male 2005 2004 a1a1 a2a2 a3a3 c1c1 c2c2 c3c3 m3m3 c5c5 m1m1 m2m2 Action Drama Female Male 2005 2004 1 m α (2) m β (1) c α (3) c β (2) a α (3) 2 1 3 2 c4c4 1 Data GraphTuG

10 3 TuG Synopsis Model Value node: A distinct value in the database Value edge: Appearance of value in the tuple set Edge count: frequency of value in the tuple set In practice, distributions are compressed with summaries Action Drama Female Male 2005 2004 a1a1 a2a2 a3a3 c1c1 c2c2 c3c3 m3m3 c5c5 m1m1 m2m2 Action Drama Female Male 2005 2004 1 1 1 1 m α (2) m β (1) c α (3) c β (2) a α (3) Data Graph 2 1 3 2 2 1 c4c4 1 TuG

11 3 Action Drama 2005 2004 1 1 1 1 m α (2) m β (1) 2 1 1 TuG Semantics Assumption 1: Independence across edges Assumption 2: Uniformity along each edge Female Male c α (3) c β (2) a α (3) 3 2 2 1 For each actor: - prob[ c α a α ] = 1/3 - prob[ c β a α ] = 1/3 - prob[sex=Female] = 1/3 - prob[sex=Male] = 2/3

12 3 Action Drama 2005 2004 1 1 1 1 m α (2) m β (1) 2 1 1 Tuple Clustering Tuple node  Cluster Join and value probabilities  Centroid Validity of assumptions  Error of clustering Tight clusters  Valid assumptions  Accurate synopsis Female Male c α (3) c β (2) a α (3) c α c β Female Male 1/3 1/3 1/3 2/3 The centroid of a α : () For each actor: - prob[ c α a α ] = 1/3 - prob[ c β a α ] = 1/3 - prob[sex=Female] = 1/3 - prob[sex=Male] = 2/3 3 2 2 1

13 1 1 2 Selectivity Estimation MCA Male2005 3 Action Drama Female Male 2005 2004 1 1 1 m α (2) m β (1) c α (3) c β (2) a α (3) 2 3 2 1 1 Single pass estimation algorithm Accuracy depends on the validity of our assumptions Tight clustering  Accurate estimates Sug-graph Selectivity =(2 · 3 · 3) · Prob[ Male ]Prob[ m α c α ] Prob[ c α a α ] Prob[ 2005 ]· · · = 1 tuple Prob[ 2005 ^ m α c α ^ c α a α ^ Male ]

15 The Node-Merge Operation Collapse a set of nodes into one new node New node acquires aggregate characteristics New centroid represents the union of the tuple sets 4 c α (6)a α (2)Male a β (2) 2 2 2 6 c α (6)a γ (4)Male 4 Merge a α and a β c α Male 1/3 1 )( c α Male 1/6 1 )( c α Male 1/4 1 )( - When is a merge lossless? - How do we quantify lossy merges?

16 Full Similarity Merge Nodes are fully similar if they have the same centroids 4 c α (12)a α (2)Male a β (4) 8 2 4 12 c β (12)a γ (6)Male 6 Merge a α and a β c α Male 1/6 1 )( )( )(

17 Nodes are all-but-one similar if they have the same centroids with respect to all schema neighbors but one Theorem: a merge of all-but-one similar nodes is lossless Order of merging can affect final compression Potential application in other domains (e.g., XML summarization) All-but-one Similarity 4 c α (12)a α (2)Male a β (4) 2 2 1 Female 3 c α c β Male Female 1/6 1/8 1 0 ( ) c α c β Male Female 1/6 1/8 1/4 3/4 () c β (8) 8 4 12 c α (12)a γ (6)Male 3 Merge a α and a β (Lossless) Female 3 c α c β Male Female 1/6 1/8 1/2 1/2 ( ) c β (8) 6

18 Effect of All-but-one Similarity Data setData graphFull-Similarity Synopsis Ab1-Similarity Synopsis TPCH8 million4.4 million33K IMDB4.7 million4.5 million65K Number of nodes in synopsis graph

19 Question: When is a lossy merge good? Intuition: Similar centroids  Good merge Measure merge quality by error of centroid clustering Radius, Diameter, Manhattan Distance, … Lossy Merges b α (3) b β (2) b γ (5) a α (10) 12 6 6 8 6 5 c α (8) bαbα bβbβ bγbγ Centroids Join prob. to a α Join prob. to c α a α cαcα bαbα (0.40.3) bβbβ (0.30.4) bγbγ (0.10.1)

20 Construction Algorithm Database  Reference synopsis All-but-one similar (lossless) merges Adaptive selection of merge operations Reference Synopsis  Join Compressed TuG Lossy merges Good merges are identified by adaptive clustering Clustering algorithm: BIRCH [Zhang+96] + CM-Sketches [CM04] Join Compressed TuG  Value Compressed TuG Value distributions  Histograms Histograms are shared among nodes with similar distributions Database Reference Synopsis Lossless Merge Compression TuG Lossy Merge Compression TuG Value Compression

22 Techniques TuGs Join Synopses [Acharya+99] Multidimensional wavelets [Chakrabarti+00] Single dimensional histograms [Poosala+96] Generated by commercial database System X

23 Data and Queries Data Sets Workload ~200 randomly generated positive queries 4-8 join predicates 1-7 value predicates DatasetSizeData Graph NodesBudget Space TPCH1 GB8 Million30 KB IMDB139 MB4.7 Million20 KB

24 Evaluation Metric Absolute relative error (ARE) Sanity bound = 10th percentile of the true selectivities of the workload

25 Estimation Error - TPCH TuG error is less than 30% for 56% of the queries in the workload Join Synopsis error is less than 30% for 40% of the queries in the workload Histogram error is less than 30% for 25% of the queries in the workload

26 Estimation Error - IMDB TuGs have significantly less estimation error for most queries in the workload Join Synopses are not applicable for this schema

27 Conclusions TuG Synopses Schema-level relational summaries Model: Semi-structured view of the relational data set Selectivity estimates for complex join queries Support for a large class of practical schemata Effective construction algorithm Experimental Results Accurate selectivity estimates given a small budget Benefits over existing techniques

28 Questions?

29 Construction Times TuG Construction Times TPCH: 55 minutes IMDB : 85 minutes Histograms and Join Synopses can be constructed relatively quickly (e.g. < 10 minutes for our datasets) Multidimensional wavelets are prohibitively expensive to construct over key values Database Reference Synopsis Lossless Merge Compression TuG Lossy Merge Compression TuG Value Compression

30 Estimation Error - IMDB

31 A synopsis should be: Accurate Much smaller than the database Efficient to construct Applicable for any schema and query Many-to-many relationships Join graphs with cycles Movies Cast Actors CustomerRegion Orders

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Similar presentations

Presentation on theme: "Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Similar presentations

Presentation on theme: "Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz."— Presentation transcript:

Similar presentations

About project

Feedback