Fast Algorithms For Hierarchical Range Histogram Constructions

Slides:

Advertisements

Similar presentations

Algorithm Analysis Input size Time I1 T1 I2 T2 …

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Solving connectivity problems parameterized by treewidth in single exponential time Marek Cygan, Marcin Pilipczuk, Michal Pilipczuk Jesper Nederlof, Dagstuhl.

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

CS4432: Database Systems II

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

Greedy Algorithms Greed is good. (Some of the time)

Introduction to Histograms Presented By: Laukik Chitnis

Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Constant-Time LCA Retrieval

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Ugo Montanari On the optimal approximation of descrete functions with low- dimentional tables.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Approximate Range Searching in the Absolute Error Model Guilherme D. da Fonseca CAPES BEX Advisor: David M. Mount.

1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.

Evaluating Hypotheses

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Data Structures – LECTURE 10 Huffman coding

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

Histograms for Selectivity Estimation

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

CpSc 881: Machine Learning Evaluating Hypotheses.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

Dense-Region Based Compact Data Cube

Data Transformation: Normalization

Chapter 7. Classification and Prediction

Data Mining: Concepts and Techniques

A paper on Join Synopses for Approximate Query Answering

Data-Streams and Histograms

Spatial Online Sampling and Aggregation

Hidden Markov Models Part 2: Algorithms

Minimizing the Aggregate Movements for Interval Coverage

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Data Transformations targeted at minimizing experimental variance

Analysis of Algorithms CS 477/677

CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.

Presentation transcript:

Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’2002s

Layout Introduction Related Works Problem Definition Problem Solution A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

Introduction Data Warehousing and OLAP applications OLAP – Online analytical processing Data has multiple logical dimensions with natural hierarchies defined on it OLAP queries usually involve hierarchical selections on some of the dimensions often aggregate measure attributes

Introduction – Cont.

Histograms Numeric attribute value domain Space-efficient Conditions on a given dimension - hierarchical ranges Range estimation depends on a good solution to the histogram construction problem

The Main Idea Proposes a fast practical algorithms for the problem of constructing hierarchical range histograms

The Main Contributions A novel notion of sparse intervals A proposed algorithm effectively trades space for construction time without compromising the accuracy First practical approach to the problem

Previous Works V-Optimal histograms Minimizes error for equality queries But… Constructed by taking only equality queries into account Koudas et al. - a polynomial-time algorithm For special and general cases But… High polynomiality Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges But.. High polynomiality

Problem Definition An array A[1,n] of non-negative real numbers The average of items A[a],…,A[b]

Histogram Definition A histogram of array A[1,n] using B buckets is specified by B+1 integers Each interval is a bucket Each is a bucket boundary

Histogram Definition – Cont. Stored as a series of bucket boundaries the average of the array values in each bucket bucket sum can be obtained

Histograms – Cont. Mostly support equality queries “give me A[i]” Hierarchical range queries

Hierarchical Range Queries Definition A range query asks for the sum A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are disjoint or contained one in the other

Hierarchical Range Queries – Cont. Generalize equality queries Can be displayed as a tree Each node u has an associated range Node v is a child of node u iff and there is no w such that

Workload Definition A workload W consists of A set S of hierarchical range queries A probability for each query in S this probability can be obtained by monitoring and logging Simple probabilities model

How The Histogram Works A histogram H of array A[1,n] Query An expected answer Left bucket such that Right bucket such that Calculate precise total of the values in the buckets between left and right buckets Estimate the sums for the portions within the left and right buckets

How The Histogram Works – Cont. The sum of A in the interval is estimated by Uniformity assumption The right bucket likewise

The Total Estimate The total estimate left bucket estimation + right bucket estimation + exact sum for buckets in between

Determining the average Construct a prefix sum array for all Given and return the average at constant time

Optimal histogram definition The error of the range query is Given a histogram H and workload W the total expected error for estimating W is over all queries in W

Optimal Histogram Definition – Cont. Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

Fast Histogram (FH) Construction for Hierarchical Range Queries Given an array A[1,n], B buckets and workload W E denotes the total expected error of the optimal histogram Find algorithms that construct HR histograms with an error at most E trading space and construction time

Layout Introduction Related Works Problem Definition Problem Solution A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

FH construction Constructing a set of “sparse intervals” Increases a number of buckets Any arbitrary interval can be represented Dynamic programming algorithm

A Sparse Interval System Given an integer set Level 1 points: Level 2 points: Level j+1 points: Last r+1 level points:

A Sparse Interval System The interval [0,n] is in the sparse system S Any pair of level j points between level j+1 points defines in interval in S

A Sparse Interval System Example n=8 ; r=3 ; l=2 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 4 8 1 2 3 5 6 7 2 4 6 8 1 3 5 7 Level 1 points Level 4 points Level 2 points Level 3 points

Sparse Interval System Properties Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

Claim Any interval [0,x] can be expressed as a partition of at most r intervals from the sparse system

Claim Proof By induction Induction step Any interval where can be expressed as j intervals. Base case true for j=1

Claim Proof – Cont. j+1 Consider We can write the interval as and where t is maximal is a valid interval in the sparse system (in level j+1 - 0 and are adjacent)

Claim Proof – Cont. is essentially similar to since t is maximal. Therefore by induction can be expressed by j intervals Total j+1 Since any interval can be expressed as a union of r intervals

Observation Any interval can be expressed as intervals By cutting it in a point of the form with maximum j By symmetry and can be expressed as a disjoint union

Lemma In a sparse set system with parameter r, the number of intervals containing a point is at most

Lemma Proof Consider the level 1 intervals There are at most such intervals that contain a specific point There are l points between adjacent points of level 2 l points can create at most intervals Level j intervals behave on level j points the same as level 1 points on the original points Extend to r levels…(r+1’th level adds one more interval)

Layout Introduction Related Works Problem Definition Problem Solution A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

Hierarchy Representation By a Tree Ranges define a hierarchy based on the inclusion relationship T is a hierarchy representation by a tree Each node v of T is associated with a range The weight is The error is

Representation By a Binary Tree We allow If a node had children transform it into a node with two children a new node with weight 0 The size of a tree increases only by factor 2 So assume that the tree is binary

Dynamic Programming Algorithm - FH Best(v,left,right,p) denotes the smallest error of the range v – tree node associated with left – overlapping interval on the left right - overlapping interval on the left v contains p intervals completely Formally, left contains and right contains

FH stages Let the children of v to be y and z with ranges and Cases (a) + (b)

Cases (a)+(b) For all possible intervals I that contain and ,compute In the case that I finishes on

Cases (a)+(b)

FH stages – Cont. Return When interval I is fixed, and are automatically defined and can be counted in O(1) time.

Time complexity Time spent evaluating cost(I) is O(p) The running time depends on the number of choices of interval I Let C(S) be the maximum number of intervals in an interval system S that contain a particular element ( ) If all intervals are allowed then

Time complexity – Cont. The running time of the algorithm FH is The number of entries for each tree node v is Since there are C(S)+1 intervals for choices of left (all intervals that contain and ). Similarly for right Work for every tree node

Time complexity – Cont. Total work including preprocessing is When S is a set of all possible intervals The result matches the time complexity of the previous algorithm (for arbitrary intervals)

Time Complexity For a Sparse Set System S – a sparse set system with parameter r Run FH with 2r(B-1) buckets Error - less then or equal to the original B bucket histogram A histogram with B buckets can be expressed as a histogram with 2r(B-1) buckets in sparse system

Time Complexity – Cont. Set In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

Some Notes Get alternate tradeoffs by constructing different sparse set system Complete binary tree on [0,n] Allow intervals such that one end point is an ancestor of the other Any arbitrary interval can be expressed as a disjoint union of two intervals from the sparse set C(S) = O(n) Solution with 2B buckets in time

Experimental Evaluation FH was implemented with r=6 Compared to an algorithm A0 presented by Gilbert et al. Optimized for arbitrary range queries For a data series of length n to be approximated with B buckets, constructs a histogram consisting of 2B buckets in time The only known algorithm with reasonable complexity

Description of Data Sets A: A real data set of length 1000 extracted from an AT&T operational warehouse B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5 C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

Workload Description A normal used to assign the probabilities to a full hierarchy Then normalization to obtain a probability distribution W1 – generated by sampling N(10,10) W2 – generated by sampling N(10,50)

Performance Evaluation Accuracy and construction time Parameters Total space allowed for histogram Total size of the data set

Computing Accuracy Ask 1000 queries Report the total expected sum squared error of the workload execution on the histogram

Results for Data Set A

Results for Set A – Cont. The accuracy of FH is superior to A0 FH is more accurate for smaller variance (W1) As the variance increases, gets closer to uniform (A0 optimized) A0 linear in the space FH is better in construction time for the same range of space

Results for Data Set B

Results for Set B –Cont. Similar to A Accuracy improves much faster with space since the distribution is Zipf The savings in construction time for FH are dramatic since data set B is twice the size of A

Results for Data Set C

Results for Set C – Cont. Data set size increases (x axis) and total space 20 A0 has a plateau Due to the way the data is generated in the experiment (Gaussian tail) Quadratic trend in construction time for A0 FH – near-linear increasing in construction time

Conclusions The first practical approach to the problem of constructing hierarchical range histograms The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy A novel notion of sparse intervals

Future plans A formal study of the dynamic properties of hierarchical range histograms How should one modify these histograms under data or workload modifications?

The END Thanks for listening