DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Slides:



Advertisements
Similar presentations
Properties of Least Squares Regression Coefficients
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Evaluating Hypotheses
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
ROC Curves.
Maximum Likelihood (ML), Expectation Maximization (EM)
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Estimating a Population Proportion
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Slide 1 Statistics Workshop Tutorial 4 Probability Probability Distributions.
Estimation 8.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
Simulation Output Analysis
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Fundamentals of Data Analysis Lecture 3 Basics of statistics.
Histograms for Selectivity Estimation
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
12. Principles of Parameter Estimation
Data Mining: Concepts and Techniques
Database Management System
Information Retrieval in Practice
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
Copyright © Cengage Learning. All rights reserved.
Spatial Online Sampling and Aggregation
On Spatial Joins in MapReduce
Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.
Discrete Event Simulation - 4
BITMAP INDEXES E0 261 Jayant Haritsa Computer Science and Automation
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Slides adapted from Donghui Zhang, UC Riverside
Data Transformations targeted at minimizing experimental variance
Lecture 5- Query Optimization (continued)
Nonparametric density estimation and classification
Statistical Thinking and Applications
12. Principles of Parameter Estimation
A Privacy – Preserving Index
CS222P: Principles of Data Management UCI, Fall 2018 Notes #12 Set operations, Aggregation, Cost Estimation Instructor: Chen Li.
Presentation transcript:

DATABASE HISTOGRAMS E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science

MOTIVATION System R’s assumption of uniform distribution of values over data domain rarely holds true in practice  Garbage In, Garbage Out

Solutions Use a “classical” distribution (e.g. Gaussian, Exponential, Zipf, etc.) to model the data Problem is that real-life data is often not like these also ! Further, not all distributions are easily computable ! Approximate the distribution using histograms Several flavors available Maintenance could be a pain Use sampling for dynamic estimation Expensive since done at run-time Space-efficient

HISTOGRAMS Domain of attribute A is partitioned into B buckets, and a uniform distribution is assumed within each bucket. That is, frequency of a value in a bucket is approximated by the average of the frequencies of all domain values assigned to the bucket. Trivial histogram: Single bucket that assigns the same frequency (N ÷ V) to all attribute values. equivalent to System R approach

EQUI-WIDTH HISTOGRAMS Frequency 80 70 60 50 40 w Frequency versus ordered domain values All buckets of same width MaxErr [ Sel <X ] = FreqFraction [XPB ] Easy to construct and maintain Example: CRICKET ! 10 20 30 40 50 60 70 80 90 Domain (Age) PB stands for partial bucket – the last one where it falls inside the range. If we did not do linear interpolation, but instead always said that the estimate for any value in the bucket is equal to the value assigned to the middle value in the bucket, i.e. equal to half the bucket frequency, then the error can be bounded to be 0.5 * FreqFraction (XPB)

EQUI-DEPTH HISTOGRAMS Frequency ≈ N/B 60 10 30 60 70 110 Domain (Age) Frequency versus ordered domain values All buckets of same height (or depth) MaxErr [ Sel <X ] = 1 / B Comparatively difficult to construct and maintain Require to sort the relation first to find the B “quantiles” Or use index if available Cheaper approximate technique based on sampling typically used If we did not do linear interpolation, but instead always said that the estimate for any value in the bucket is equal to the value assigned to the middle value in the bucket, i.e. equal to half the bucket frequency, then the error can be bounded to be 0.5 / B

Sampling Technique Take a random sample of database, sort the sampled tuples, and use the boundaries established by them to form the approximate buckets How many samples to take ? Based on Kolmogorov statistic

Kolmogorov Statistic Let  be the fraction of tuples that satisfy a property (in our case, property is query box). Let  be the fraction of tuples in sample that lie in same box. Then, the K-statistic is that   d (precision) with probability  p (confidence) if sample size is  n . Given p and d, n can be evaluated. For example, if d = 0.05 and p = 0.99, n = 1064 Note, does not depend on number of tuples in database!

Value Set (Age Attribute) SERIAL HISTOGRAMS Value Set (Age Attribute) 30A 40A 70A 10A 80A 20A 50A 60A 10 40 60 70 100 Frequencies Domain values versus ordered frequencies ! Frequency assigned to domain values in a bucket is avg frequency of bucket or middle frequency of bucket depending on information stored. Optimal histogram is a serial histogram paper covered in TIDS (E0 361) course in particular, the serial histogram that minimizes niVi,, where ni is number of attribute values placed in bucket Bi, and Vi is variance of these frequencies Difficult to construct and maintain Queries are usually on value domain, not frequency domain Have to explicitly represent all domain values – ranges not possible Identification of optimal is exponential in number of buckets Explain

END-BIASED HISTOGRAMS Value Set (Age Attribute) 30A, 40A, 60A, 70A, 80A 10A 20A 50A 5 10 60 70 75 Frequencies Special case of serial histograms Highest and lowest sets of frequencies are explicitly maintained in separate individual buckets Remaining (middle) frequencies are approximated together in a single bucket. Optimal end-biased histogram minimizes niVi Performance “close” to that of optimal serial histogram. Easy to construct and maintain Attribute Values corresponding to the “middle” bucket do not have to be stored explicitly: can be identified by negation. Linear time complexity in number of buckets to identify optimal

Serial Histogram Construction Data Collection: select A, Count(*) from R group by A requires sorting Approximate by sampling when building high-end-biased serial histograms

In Practice Histograms used in most commercial systems. Approximate equi-depth histograms built on individual attributes with about 10-20 buckets using a one-pass sampling (usually 1000-2000 samples) approach. Results in large errors for multi-attribute queries, but technique proposed in SIGMOD 99 for efficiently constructing and maintaining multi-dimensional histograms (covered in TIDS course)

END DATABASE HISTOGRAMS