Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Fast Algorithms For Hierarchical Range Histogram Constructions
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Optimal Workload-Based Weighted Wavelet Synopsis
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Clustering XML Documents for Query Performance Enhancement Wang Lian.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Histograms for Selectivity Estimation
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
GSLPI: a Cost-based Query Progress Indicator
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
A Black-Box Approach to Query Cardinality Estimation
A paper on Join Synopses for Approximate Query Answering
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
ICICLES: Self-tuning Samples for Approximate Query Answering
Query-Friendly Compression of Graph Streams
Random Sampling over Joins Revisited
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presentation transcript:

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz

2 Motivation Problem: determine the result cardinality of a complex relational query Query optimization: cost factors of candidate plans depend on query selectivity Data exploration: query selectivity provides timely feedback Solution: approximate selectivity over data synopses Relational Database Count(Q) Selectivity Expensive Efficient Database synopsis Selectivity Estimate Count(Q)

3 Previous Work Table Level Synopses Examples: Histograms [Poosala+96], Sketches [Dobra+02], Wavelets [Chakrabarti+00], Table samples [Lipton+93] Weakness: do not summarize key values well Schema Level Synopses Examples: Join Synopses [Acharya+99], PRMs [Getoor+01] Weakness: restricted to certain types of schemata R T Z W S RZT S RZW SRSR SZSZ STST SWSW

4 Synopsis Desiderata Schema level  Capture key/foreign-key joins Applicable to general schemata and queries

5 Contributions Tuple Graph Synopsis (TuG) Model: Semi-structured view of relational data Schema level summary Schemata with many-to-many relationships Complex join queries TuG construction algorithm Basis: Tuple clustering Novel heuristics Builds on existing clustering techniques Experimental study TuGs are effective synopses for small space budgets Better accuracy compared to previous techniques

6 Outline The TuG Synopsis Synopsis model Estimation framework TuG Construction TuG Compression Construction Algorithm Experimental Study Conclusions

7 TuG Model: Intuition #1 Relational database ↔ Data graph midyeargenre 12005Action 22004Action 32000Drama aidsex 1Male 2Female 3Male 4 Movies midaid CastActors c1c1 Action Drama c2c2 c3c3 c4c4 c5c5 m1m1 m2m2 m3m3 a1a1 a2a2 a3a3 a4a4 Female Male

8 TuG Model: Intuition #2 Join query ↔ Sub-graph matching Selectivity ↔ Count of matching sub-graphs SELECT * FROM M, C, A WHERE M.mid=C.mid AND C.aid=A.aid AND A.sex=Male AND M.genre=Drama c1c1 Action Drama c2c2 c3c3 c4c4 c5c5 m1m1 m2m2 m3m3 a1a1 a2a2 a3a3 a4a4 Female Male MCA Male Drama

TuG Synopsis Model Tuple Node: Set of tuples from the same relation Node count: number of tuples Edge: Join between tuple sets Edge count: result size of join Action Drama Female Male a1a1 a2a2 a3a3 c1c1 c2c2 c3c3 m3m3 c5c5 m1m1 m2m2 Action Drama Female Male m α (2) m β (1) c α (3) c β (2) a α (3) c4c4 1 Data GraphTuG

10 3 TuG Synopsis Model Value node: A distinct value in the database Value edge: Appearance of value in the tuple set Edge count: frequency of value in the tuple set In practice, distributions are compressed with summaries Action Drama Female Male a1a1 a2a2 a3a3 c1c1 c2c2 c3c3 m3m3 c5c5 m1m1 m2m2 Action Drama Female Male m α (2) m β (1) c α (3) c β (2) a α (3) Data Graph c4c4 1 TuG

11 3 Action Drama m α (2) m β (1) TuG Semantics Assumption 1: Independence across edges Assumption 2: Uniformity along each edge Female Male c α (3) c β (2) a α (3) For each actor: - prob[ c α a α ] = 1/3 - prob[ c β a α ] = 1/3 - prob[sex=Female] = 1/3 - prob[sex=Male] = 2/3

12 3 Action Drama m α (2) m β (1) Tuple Clustering Tuple node  Cluster Join and value probabilities  Centroid Validity of assumptions  Error of clustering Tight clusters  Valid assumptions  Accurate synopsis Female Male c α (3) c β (2) a α (3) c α c β Female Male 1/3 1/3 1/3 2/3 The centroid of a α : () For each actor: - prob[ c α a α ] = 1/3 - prob[ c β a α ] = 1/3 - prob[sex=Female] = 1/3 - prob[sex=Male] = 2/

Selectivity Estimation MCA Male Action Drama Female Male m α (2) m β (1) c α (3) c β (2) a α (3) Single pass estimation algorithm Accuracy depends on the validity of our assumptions Tight clustering  Accurate estimates Sug-graph Selectivity =(2 · 3 · 3) · Prob[ Male ]Prob[ m α c α ] Prob[ c α a α ] Prob[ 2005 ]· · · = 1 tuple Prob[ 2005 ^ m α c α ^ c α a α ^ Male ]

14 Outline The TuG Synopsis Synopsis model Estimation framework TuG Construction TuG Compression Construction Algorithm Experimental Study Conclusions

15 The Node-Merge Operation Collapse a set of nodes into one new node New node acquires aggregate characteristics New centroid represents the union of the tuple sets 4 c α (6)a α (2)Male a β (2) c α (6)a γ (4)Male 4 Merge a α and a β c α Male 1/3 1 )( c α Male 1/6 1 )( c α Male 1/4 1 )( - When is a merge lossless? - How do we quantify lossy merges?

16 Full Similarity Merge Nodes are fully similar if they have the same centroids 4 c α (12)a α (2)Male a β (4) c β (12)a γ (6)Male 6 Merge a α and a β c α Male 1/6 1 )( )( )(

17 Nodes are all-but-one similar if they have the same centroids with respect to all schema neighbors but one Theorem: a merge of all-but-one similar nodes is lossless Order of merging can affect final compression Potential application in other domains (e.g., XML summarization) All-but-one Similarity 4 c α (12)a α (2)Male a β (4) Female 3 c α c β Male Female 1/6 1/8 1 0 ( ) c α c β Male Female 1/6 1/8 1/4 3/4 () c β (8) c α (12)a γ (6)Male 3 Merge a α and a β (Lossless) Female 3 c α c β Male Female 1/6 1/8 1/2 1/2 ( ) c β (8) 6

18 Effect of All-but-one Similarity Data setData graphFull-Similarity Synopsis Ab1-Similarity Synopsis TPCH8 million4.4 million33K IMDB4.7 million4.5 million65K Number of nodes in synopsis graph

19 Question: When is a lossy merge good? Intuition: Similar centroids  Good merge Measure merge quality by error of centroid clustering Radius, Diameter, Manhattan Distance, … Lossy Merges b α (3) b β (2) b γ (5) a α (10) c α (8) bαbα bβbβ bγbγ Centroids Join prob. to a α Join prob. to c α a α cαcα bαbα (0.40.3) bβbβ (0.30.4) bγbγ (0.10.1)

20 Construction Algorithm Database  Reference synopsis All-but-one similar (lossless) merges Adaptive selection of merge operations Reference Synopsis  Join Compressed TuG Lossy merges Good merges are identified by adaptive clustering Clustering algorithm: BIRCH [Zhang+96] + CM-Sketches [CM04] Join Compressed TuG  Value Compressed TuG Value distributions  Histograms Histograms are shared among nodes with similar distributions Database Reference Synopsis Lossless Merge Compression TuG Lossy Merge Compression TuG Value Compression

21 Outline The TuG Synopsis Synopsis model Estimation framework TuG Construction TuG Compression Construction Algorithm Experimental Study Conclusions

22 Techniques TuGs Join Synopses [Acharya+99] Multidimensional wavelets [Chakrabarti+00] Single dimensional histograms [Poosala+96] Generated by commercial database System X

23 Data and Queries Data Sets Workload ~200 randomly generated positive queries 4-8 join predicates 1-7 value predicates DatasetSizeData Graph NodesBudget Space TPCH1 GB8 Million30 KB IMDB139 MB4.7 Million20 KB

24 Evaluation Metric Absolute relative error (ARE) Sanity bound = 10th percentile of the true selectivities of the workload

25 Estimation Error - TPCH TuG error is less than 30% for 56% of the queries in the workload Join Synopsis error is less than 30% for 40% of the queries in the workload Histogram error is less than 30% for 25% of the queries in the workload

26 Estimation Error - IMDB TuGs have significantly less estimation error for most queries in the workload Join Synopses are not applicable for this schema

27 Conclusions TuG Synopses Schema-level relational summaries Model: Semi-structured view of the relational data set Selectivity estimates for complex join queries Support for a large class of practical schemata Effective construction algorithm Experimental Results Accurate selectivity estimates given a small budget Benefits over existing techniques

28 Questions?

29 Construction Times TuG Construction Times TPCH: 55 minutes IMDB : 85 minutes Histograms and Join Synopses can be constructed relatively quickly (e.g. < 10 minutes for our datasets) Multidimensional wavelets are prohibitively expensive to construct over key values Database Reference Synopsis Lossless Merge Compression TuG Lossy Merge Compression TuG Value Compression

30 Estimation Error - IMDB

31 A synopsis should be: Accurate Much smaller than the database Efficient to construct Applicable for any schema and query Many-to-many relationships Join graphs with cycles Movies Cast Actors CustomerRegion Orders