Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Access Path Selection in a Relational Database Management System Selinger et al.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Histograms for Selectivity Estimation
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
+ DO NOW. + Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
1 Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration Fangjiao Jiang Renmin University of China Joint work with Weiyi Meng.
Presented by Ho Wai Shing
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
1 VLDB, Background What is important for the user.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Data Transformation: Normalization
Updating SF-Tree Speaker: Ho Wai Shing.
Parallel Databases.
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
Query-Friendly Compression of Graph Streams
Y. Kotidis, S. Muthukrishnan,
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Chapter 8: Estimating with Confidence
Similarity Search: A Matching Based Approach
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
By: Ran Ben Basat, Technion, Israel
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Presentation transcript:

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms

Contents Supplements to the previous talk Introduction to histograms for multi- dimensional data Global optimization of histograms Experiment results Conclusion

Summary Histograms approximate the frequency distribution of an attribute (or a set of attributes) group attribute values into "buckets" approximate the actual frequencies by the statistical information stored in each bucket. A taxonomy was discussed

Summary of the 1-D Histogram Taxonomy

equi-width equi-depth V-optimal(F, F) V-optimal(V, F) Max-Diff(V, F) Compressed(V, F) Data Distribution

Estimation Procedures Find the histogram buckets that can contain the query range Estimate the counts by studying the overlapped portion of the query and the buckets continuous values assumption point value assumption uniform spread assumption

Uniform Frequency Assumptions continuous values assumption point value assumption uniform spread assumption area = freq = freq = 4

Uniform Frequency Assumptions Experiments Results show that uniform spread assumption gives the best estimation uniform spread assumption is better against "non-uniform spread" data than continuous value assumption (strange!) e.g., if spread is large, many queries should return 0.

Experiments Datasets in form of pairs, distribution of dataValues uniform, zipf_incr, zipf_dec, etc. frequency: zipf, with different skew factors different value-frequency correlations positive, negative, random

Experiments The data value distribution:

Experiments Queries has a form of a < X  b 5 types a = -1, b in the range of values a = -1, b = one of the appeared values random a and b, s.t. selectivity in [0, 0.2] random a and b, s.t. selectivity in [0.8, 1] random a and b, s.t. selectivity in [0, 0.2]  [0.8, 1]

Experiments Histograms all histograms described in the taxonomy the histograms are of the same size (not the same number of buckets) built from 2000 samples (except trivial, equi-depth: precise, and P 2 ) built through scanning the data once

Experiments cusp_max value distribution random value & freq. correlation z = 1 sort param = V usually means better accuracy

Experiments error based on v-optimal(V,A) histogram increasing sample size can give better results

End of Part I

Part II: Global Optimization of Histograms (GOH)

Histograms for n-D data Histograms discussed previously are on single attribute (or 1-D data) Two main approaches for n-D data: Use n 1-D histograms use an n-D histogram

Histograms for n-D data n 1-D histograms need "attribute value independence assumption" can use all 1-D histogram techniques quite good accuracy already representative: GOH [Jagadish et al. SIGMOD'01]

Histograms for n-D data n-D histograms don't need the "avia" which is usually not true difficult to compute, store and maintain a "good" partition of the n-D space into buckets representatives: MHIST [Poosala & Ioannidis VLDB'97], H-Tree [Muralikrishna & DeWitt SIGMOD'88]

Global Optimization of Histograms (GOH) given a space budget of B buckets, find an optimal assignment of buckets among the dimensions to minimize the error i.e., give more buckets to attributes that are used frequently or have skew distributions

GOH -- example e.g., if A1:,,, A2:,, and we have 4 buckets, an A1:1, A2:3 assignment is better than a A1:2, A2:2 assignment.

Computing GOH Exhaustive Algorithm (GOHEA) Based on Dynamic Programming (GOHDP) Greedy Approach (GOHGA) Greedy Approach with Remedy (GOHGR)

GOHEA For every possible bucket assignment, calculate the Error metric and find the minimum clearly too inefficient

GOHDP Define: E(b, k) = the minimum error of using b buckets to store k histograms error(b, k) = the error of using b buckets to store the k th histogram observation: E(b, k) = min( E(b-x, k-1) + error(x, k) )

GOHDP calculate all error(b, k): O(N 2 BM) another algorithm based in DP can calculate all error(b, k) with a given k in O(N 2 B) [Jagadish et al. VLDB'98] fill the E(b, k): O(B 2 M) O(BM) entries to be filled, and O(B) computations for each entry

GOHDP note that if we know the allocation beforehand, we need O(N 2 B) to construct the histograms it's still inefficient if M is large

GOHGA The greedy approach is O(N 2 B) i.e., nearly no penalty than direct construction (use the same number of buckets on all attributes) define: marginal gain: m k (i,j) = error(i,k) – error(j,k), i.e., the reduction in error if we use j buckets instead of i buckets.

GOHGA 1. assign 1 buckets to each dimension 2. allocate buckets one by one to the dimension which has the greatest marginal gain by the new bucket 3. repeat until all buckets are assigned

GOHGA O(B) steps O(N 2 ) per marginal gain calculation: since we can get error(b, k) from b = 1 in O(N 2 b) overall O(N 2 B) but sometimes GOHGA does not return the optimum

GOHGR greedy: look ahead 1 step to see which allocation has the greatest marginal gain remedy: look ahead 2 steps to see if it can find a better allocation e.g., m 1 (3,4)=30, m 1 (4,5)=130, m 2 (3,4)=40, m 2 (4,5)=40

Experiments aim: to show that GOH really has a smaller error by allocating more buckets to skewer data to show that GOH is efficient to be computed

Experiments types abs/rel errors of different attributes abs/rel errors for different bucket budgets abs/rel errors for different distribution skews running time

Experiments dataset (synthesis data): 5 attributes tuples, 500 values per attr frequencies follow Zipf distribution, random association between freq. and value. the 5 attrs has z = 0, 0.01, 0.1, 1, queries: X  a

Experiments

dataset (TPC-D dataset with skew data) as more realistic dataset has similar results as the synthesis data dataset (2 attrs) to evaluate the gain of GOH due to the skew difference between attrs

Experiments TG3: z=0, 2 TG4: z=0.02,1.8 TG5: z=1.8,2

Experiments

Conclusions GOH has smaller errors because it assigns more buckets to skew or frequently used distributions Nearly no time penalty on building GOH using GOHGR

Future Work The methods presented can't solve the n-D histogram problem completely Try to apply SF-Tree to store and retrieve the buckets in multi- dimensional histogram efficiently.

References [Jagadish et al. VLDB'98] H.V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98 [Poosala et al. SIMGOD'96] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96 [Poosala & Ioannidis VLDB'97] Viswanath Poosala and Yannis Ioannidis, Selectivity Estimation Without the Attribute Value Independence Assumption, VLDB'97

References [Muralikrishna & DeWitt SIGMOD'88] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88