STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Optimal Workload-Based Weighted Wavelet Synopsis
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
Cluster Analysis.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Selectivity Estimation for Optimizing Similarity Query in Multimedia Databases IDEAL 2003 Paper review.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Approximate querying about the Past, the Present, and the Future in Spatio-Temporal Databases Jimeng Sun, Dimitris Papadias, Yufei Tao, Bin Liu.
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
Projective Texture Atlas for 3D Photography Jonas Sossai Júnior Luiz Velho IMPA.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Access Path Selection in a Relational Database Management System Selinger et al.
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Histograms for Selectivity Estimation
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Optimization by Model Fitting Chapter 9 Luke, Essentials of Metaheuristics, 2011 Byung-Hyun Ha R1.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
1 Flexible Data Cube for Range-Sum Queries in Dynamic OLAP Data Cubes Authors: C.-I Lee and Y.-C. Li Speaker: Y.-C. Li Date :Dec. 19, 2002.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Dense-Region Based Compact Data Cube
Spatial Data Management
Data Transformation: Normalization
Data Mining Soongsil University
Parallel Databases.
A Black-Box Approach to Query Cardinality Estimation
Data-Streams and Histograms
Proactive Re-optimization
Sameh Shohdy, Yu Su, and Gagan Agrawal
Panagiotis G. Ipeirotis Luis Gravano
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Wavelet-based histograms for selectivity estimation
Sampling Plans.
Presentation transcript:

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research SIGMOD 2001 * Work done in part while the authors were visiting Microsoft Research.

2 Histograms as Succinct Data Set Summaries n Used for selectivity estimation and approximate query processing. n Data set partitioned into buckets, each approximated by aggregate statistics.

3 Histograms n Each bucket consists of a bounding box and a tuple frequency value. n Uniformity is assumed inside buckets. –Histograms should partition data set in buckets with uniform tuple density. n Multi-dimensional data makes partitioning even more challenging.

4 Outline n Overview of existing multidimensional histogram techniques. n Introduction to STHoles histograms. n System architecture and STHoles construction algorithm. n Experimental evaluation.

5 Gaussian Data Set Histograms Techniques: EquiDepth EquiDepth Histogram [Muralikrishna and DeWitt 1988] n Correctly identifies core of densest clusters. n Partitioning uses “equi-count” instead of “equi- density”

6 Gaussian Data SetMHist Histogram [Poosala and Ioannidis 1997] Histogram Techniques: MHist n Works well for highly skewed data distributions. n Devotes too many buckets to the densest clusters. n Bad initial “choices” are amplified in later steps.

7 Gaussian Data Set GenHist Histogram [Gunopulos et al. 2000] Histogram Techniques: GenHist n More robust than previous techniques (based on multidimensional information). n Difficult to choose right values of various parameters. n Requires at least 5-10 passes over the data.

8 Gaussian Data SetSTGrid Histogram [Aboulnaga and Chaudhuri 1999] Histogram Techniques: STGrid n Incorporates feedback from query execution. n Grid partitioning strategy is sometimes too rigid. n Focuses on efficiency rather than accuracy.

9 Our New Histogram Technique: STHoles n Flexible bucket partitioning. n Exploits workload information to allocate buckets. n Query feedback captures uniformly dense regions. n Does not examine actual data set.

10 STHoles Histograms n Tree structure among buckets. n Buckets with holes: relaxes rectangular regions while using rectangular bucket structures. Non rectangular region

11 System Architecture for STHoles Range Query

12 STHoles Construction Algorithm n Initialize histogram H as an empty histogram. n For each query q in workload: 1- Gather simple statistics from query results. 2- Identify candidate holes and drill (add) them as new buckets in H. 3- Merge superfluous buckets in H.

13 ? Drilling New Candidate Buckets Count how many tuples in result stream lie inside q  b. n Drill q  b as a new bucket (child of b). q For each query q in workload and bucket b in histogram:

14 Shrinking Candidate Buckets n Partition constraint: Bounding boxes must be rectangular. n Apply greedy technique to shrink a candidate hole to a rectangle.

15 Merging Buckets n To avoid exceeding available space. n Merge most “similar” buckets in terms of tuple density.

16 Parent-Child Merges Eliminate buckets too similar to their parents. Example: The interesting region in bc is covered by its child b1.

17 Sibling-Sibling Merges n Consolidate buckets with similar densities that cover close regions. n Extrapolate frequency distributions to yet unseen regions.

18 Gaussian Data SetSTHoles Histogram An Example STHoles Histogram

19 Experimental Setting n Data Sets: –Real: (UCI Repository) Sample of Census data set (200K tuples) Cover data set (500K tuples) –Synthetic: Variations of Gaussian and Zipfian(Array) distributions. 200K to 500K tuples, 2 to 4 dimensions. n Histograms: –1024 available bytes per histogram. –EquiDept, MHist, GenHist, STGrid, STHoles.

20 Experimental Setting (cont.) n Workloads [Pagel et al. 1993]: –1,000 queries. –Query centers follow different distributions: Uniform, Biased, Gaussian. –Query boundaries follow different constraints: area covered, tuples covered. Census data setBiased (tuples) workloadGaussian (area) workload n Accuracy Metric: Absolute Error. (with some normalization; details in paper)

21 Comparison with Other Approaches: Biased Workload Biased workload, query boundaries cover around 1% of the data domain

22 Comparison with Other Approaches: Uniform Workload Uniform workload, query boundaries cover around 1% of the data set tuples.

23 Convergence with Workload Biased workload

24 Handling Data Set Updates From Gaussian to Zipfian data distributions.

25 Other Experiments n Varying: –data skew. –data dimensionality. –histogram size. –workload generation parameters. –number of attributes in queries. n Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8%. n STHoles lead to robust selectivity estimates across data distributions and workloads. n See full paper for details!

26 Summary: STHoles, a Multidimensional Workload-Aware Histogram n Exploits query feedback. n Built without examining data set. n Allows bucket nesting to capture complex shapes using only rectangular bucket structures. n Results in robust and accurate selectivity estimations. n In many cases, outperforms the best techniques that access full data sets.

27 Related Work (Histograms) n Unidimensional: –EquiDepth [Piatetsky-Shapiro and Connell 1984] –MaxDiff [Poosala et al. 1996] –V-Optimal [Jagadish et al. 1998] –Many more! n Multidimensional: –EquiDepth [Muralikrishna and DeWitt 1988] –MHist [Poosala and Ioannidis 1997] –GenHist [Gunopulos et al. 2000] –STGrid [Aboulnaga and Chaudhuri 1999]

28 Related Work (Other Techniques) n Sampling [Olken and Rotem 1990] n Wavelets [Matias et al. 1997] n Discrete transformations [Lee et al. 1999] n Parametric Curve Fitting [Chen and Roussopoulos 1994]

29 Evaluation Metric n Absolute Error: n Normalized Absolute Error:

30 Overhead Evaluation over Microsoft SQL Server 2000

31 Varying Histogram Size Gaussian Data Set Zipfian Data Set Census Data Set

32 Varying Spatial Selectivity Gaussian Data Set Zipfian Data Set Census Data Set