Histograms for Selectivity Estimation

Slides:



Advertisements
Similar presentations
A Privacy Preserving Index for Range Queries
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
CS4432: Database Systems II
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Optimal Workload-Based Weighted Wavelet Synopsis
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
An Integration Framework for Sensor Networks and Data Stream Management Systems.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Access Path Selection in a Relational Database Management System Selinger et al.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Data Transformation: Normalization
Database Management System
Noisy Data Noise: random error or variance in a measured variable.
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
Ripple Joins for Online Aggregation
Evaluation of Relational Operations
Lattice Histograms: A Resilient Synopsis Structure
Chapter 15 QUERY EXECUTION.
File Processing : Query Processing
Implementation of Relational Operations
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Wavelet-based histograms for selectivity estimation
Presentation transcript:

Histograms for Selectivity Estimation Speaker: Ho Wai Shing

Contents Introduction: What is a histogram? How to use a histogram? A taxonomy of single-dimensional histograms Some experimental results Some approaches for multi-dimensional histograms Conclusions Future Work

Introduction Many modules of a DB require selectivity estimation (estimating the query result size) e.g., query optimizer -- determine the nesting in indexed nested loop join user interface -- return a rough answer to the users

Introduction We need to store some statistics of the database to estimate the selectivity Histogram is one of the most common statistics to be stored in practice. Quite accurate, needs reasonably small space.

What is a Histogram? Histograms approximate the frequency distribution of an attribute (or a set of attributes) group attribute values into "buckets" approximate the actual frequencies by the statistical information stored in each bucket.

Histogram: Example Consider the following distribution: This is an "equi-width" histogram: (1)  3

Histogram: the problem a pair of dual problem [1]: Given a data distribution, a limit B on the length of H, and an error metric E(), find the histogram H that minimizes E(H). Given the data distribution, a limit  on the error, and an error metric E(), find the histogram H of smallest length for which E(H) is at most .

Two Goals for Histograms for selections (exact or range queries) for joins focus on histograms for selections in this talk

Taxonomy of Histogram based on the paper by Poosala et al. [2] in SIGMOD'96 on single-dimensional histograms proposed a generalized histogram-generating algorithm different decisions in each step results in different histograms

Generalized Histogram Generating Algorithm consider the data distribution as a two column table T(value, frequency) create a third attribute a3 (sort parameter) based on the first two attributes, sort the table according to a3. specify a subclass of histogram create a 4th attribute a4 (source parameter) partition T into B buckets s.t. it satisfies some constraints on a4.

Example: Equi-Width Histograms a3 = value all histograms are possible a4 = value constraint: every bucket should contain the same number of data values

Example: End-Biased Equi-Depth Histograms a3 = value all but one buckets must be singletons a4 = frequency constraint: all buckets should have the same total frequency counts

Taxonomy Dimensions: partition classes -- serial, end-biased a3 -- value (V), frequency (F), area (A) a4 -- spread (S), frequency (F), cum. freq (C), area (A) constraints -- equi-sum, v-optimal, max-diff, compressed, spline-based

Constraints Equi-sum: each bucket should have the same sum of a4 V-Optimal: divide the buckets so that the variance of the overall frequency approximation is minimized Spline-based: the cumulative freq. satisfies a piece-wise linear approximation.

Constraints (cont.) Max-diff: bucket boundaries are at top-(B-1) adjacent a4 differences. Compressed (comp.): top-n entries with the highest a3 values are stored exactly, others are stored using equi-sum.

Taxonomy

Equi-Width Histograms discussed in Kooi's thesis (1980) [3] denoted by "Equi-Sum(V, S)" in the taxonomy mergeable buckets must have contiguous values merge criteria is about the spread based on equi-sum

Equi-Depth Histograms proposed by Piatetsky-Shapiro and Connell in SIGMOD'84 [4] denoted by "Equi-Sum(V, F)" in the taxonomy mergeable buckets must have contiguous values merge criteria is about the frequencies based on equi-sum

V-Optimal(F, F) Histograms proposed by Ioannidis and Christodoulakis in 1993 [5] mergeable buckets must have contiguous frequencies merge criteria is to minimize sum-squared error on frequencies within a bucket

V-Optimal(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is to minimize sum-squared error on frequencies

Max-Diff(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is to minimize sum-squared error on frequencies

Compressed(V, F) Histograms proposed Poosala et al. in 1996 [2] mergeable buckets must have contiguous values merge criteria is equi-depth except the more frequent n values.

Summary Data Distribution Max-Diff(V, F) equi-width V-optimal(F, F) V-optimal(V, F) Compressed(V, F) equi-depth

Estimation Example (4) = 2 (actual value) equi-width: (4)  4 equi-depth: (4)  2.8 V-optimal(F,F): (4)  9(0)+4(1)+1(0) = 4 V-optimal(V,F): (4)  2.4(1)+9(0)+2.7(0) =2.4 Max-Diff(V,F): (4)  2.4 Compressed(V,F): (4)  2.2

Experimental Results 100000 tuples 200 attribute values 2000 samples for construction

Experimental Results cusp_max value distribution random value & freq. relation frequencies fit to Zipf distribution

Other Experiment Parameters Skew of frequency Skew of data value distribution Sample size in construction Accuracy vs. Storage Data Distributions (freq., values, correlations) Queries

Conclusions Histograms are useful in estimating the selectivity of a query Different techniques to use histogram for approximating the data exist v-optimal or MaxDiff histograms can have good accuracy for 1-D case

Future Work The methods presented can't solve the n-D histogram problem completely Try to apply SF-Tree to store and retrieve the buckets in multi-dimensional histogram efficiently.

References [1] H.V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98 [2] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96 [3] R. P. Kooi, The Optimization of Queries in Relational Databases, PhD Thesis, Case Western Reserver University, 1980

References [4] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88