# Fast Algorithms For Hierarchical Range Histogram Constructions

## Presentation on theme: "Fast Algorithms For Hierarchical Range Histogram Constructions"— Presentation transcript:

Fast Algorithms For Hierarchical Range Histogram Constructions
Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’2002s

Layout Introduction Related Works Problem Definition Problem Solution
A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

Introduction Data Warehousing and OLAP applications
OLAP – Online analytical processing Data has multiple logical dimensions with natural hierarchies defined on it OLAP queries usually involve hierarchical selections on some of the dimensions often aggregate measure attributes

Introduction – Cont.

Histograms Numeric attribute value domain Space-efficient
Conditions on a given dimension - hierarchical ranges Range estimation depends on a good solution to the histogram construction problem

The Main Idea Proposes a fast practical algorithms for the problem of constructing hierarchical range histograms

The Main Contributions
A novel notion of sparse intervals A proposed algorithm effectively trades space for construction time without compromising the accuracy First practical approach to the problem

Previous Works V-Optimal histograms
Minimizes error for equality queries But… Constructed by taking only equality queries into account Koudas et al. - a polynomial-time algorithm For special and general cases But… High polynomiality Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges But.. High polynomiality

Problem Definition An array A[1,n] of non-negative real numbers
The average of items A[a],…,A[b]

Histogram Definition A histogram of array A[1,n] using B buckets is specified by B+1 integers Each interval is a bucket Each is a bucket boundary

Histogram Definition – Cont.
Stored as a series of bucket boundaries the average of the array values in each bucket bucket sum can be obtained

Histograms – Cont. Mostly support equality queries
“give me A[i]” Hierarchical range queries

Hierarchical Range Queries Definition
A range query asks for the sum A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are disjoint or contained one in the other

Hierarchical Range Queries – Cont.
Generalize equality queries Can be displayed as a tree Each node u has an associated range Node v is a child of node u iff and there is no w such that

A set S of hierarchical range queries A probability for each query in S this probability can be obtained by monitoring and logging Simple probabilities model

How The Histogram Works
A histogram H of array A[1,n] Query An expected answer Left bucket such that Right bucket such that Calculate precise total of the values in the buckets between left and right buckets Estimate the sums for the portions within the left and right buckets

How The Histogram Works – Cont.
The sum of A in the interval is estimated by Uniformity assumption The right bucket likewise

The Total Estimate The total estimate left bucket estimation +
right bucket estimation + exact sum for buckets in between

Determining the average
Construct a prefix sum array for all Given and return the average at constant time

Optimal histogram definition
The error of the range query is Given a histogram H and workload W the total expected error for estimating W is over all queries in W

Optimal Histogram Definition – Cont.
Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

Fast Histogram (FH) Construction for Hierarchical Range Queries
Given an array A[1,n], B buckets and workload W E denotes the total expected error of the optimal histogram Find algorithms that construct HR histograms with an error at most E trading space and construction time

Layout Introduction Related Works Problem Definition Problem Solution
A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

FH construction Constructing a set of “sparse intervals”
Increases a number of buckets Any arbitrary interval can be represented Dynamic programming algorithm

A Sparse Interval System
Given an integer set Level 1 points: Level 2 points: Level j+1 points: Last r+1 level points:

A Sparse Interval System
The interval [0,n] is in the sparse system S Any pair of level j points between level j+1 points defines in interval in S

A Sparse Interval System Example
n=8 ; r=3 ; l=2 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 4 8 1 2 3 5 6 7 2 4 6 8 1 3 5 7 Level 1 points Level 4 points Level 2 points Level 3 points

Sparse Interval System Properties
Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

Claim Any interval [0,x] can be expressed as a partition of at most r intervals from the sparse system

Claim Proof By induction Induction step
Any interval where can be expressed as j intervals. Base case true for j=1

Claim Proof – Cont. j+1 Consider We can write the interval as and
where t is maximal is a valid interval in the sparse system (in level j and are adjacent)

Claim Proof – Cont. is essentially similar to
since t is maximal. Therefore by induction can be expressed by j intervals Total j+1 Since any interval can be expressed as a union of r intervals

Observation Any interval can be expressed as intervals
By cutting it in a point of the form with maximum j By symmetry and can be expressed as a disjoint union

Lemma In a sparse set system with parameter r, the number of intervals containing a point is at most

Lemma Proof Consider the level 1 intervals
There are at most such intervals that contain a specific point There are l points between adjacent points of level 2 l points can create at most intervals Level j intervals behave on level j points the same as level 1 points on the original points Extend to r levels…(r+1’th level adds one more interval)

Layout Introduction Related Works Problem Definition Problem Solution
A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

Hierarchy Representation By a Tree
Ranges define a hierarchy based on the inclusion relationship T is a hierarchy representation by a tree Each node v of T is associated with a range The weight is The error is

Representation By a Binary Tree
We allow If a node had children transform it into a node with two children a new node with weight 0 The size of a tree increases only by factor 2 So assume that the tree is binary

Dynamic Programming Algorithm - FH
Best(v,left,right,p) denotes the smallest error of the range v – tree node associated with left – overlapping interval on the left right - overlapping interval on the left v contains p intervals completely Formally, left contains and right contains

FH stages Let the children of v to be y and z with ranges and
Cases (a) + (b)

Cases (a)+(b) For all possible intervals I that contain and ,compute
In the case that I finishes on

Cases (a)+(b)

FH stages – Cont. Return When interval I is fixed, and are automatically defined and can be counted in O(1) time.

Time complexity Time spent evaluating cost(I) is O(p)
The running time depends on the number of choices of interval I Let C(S) be the maximum number of intervals in an interval system S that contain a particular element ( ) If all intervals are allowed then

Time complexity – Cont. The running time of the algorithm FH is
The number of entries for each tree node v is Since there are C(S)+1 intervals for choices of left (all intervals that contain and ). Similarly for right Work for every tree node

Time complexity – Cont. Total work including preprocessing is
When S is a set of all possible intervals The result matches the time complexity of the previous algorithm (for arbitrary intervals)

Time Complexity For a Sparse Set System
S – a sparse set system with parameter r Run FH with 2r(B-1) buckets Error - less then or equal to the original B bucket histogram A histogram with B buckets can be expressed as a histogram with 2r(B-1) buckets in sparse system

Time Complexity – Cont. Set
In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

Some Notes Get alternate tradeoffs by constructing different sparse set system Complete binary tree on [0,n] Allow intervals such that one end point is an ancestor of the other Any arbitrary interval can be expressed as a disjoint union of two intervals from the sparse set C(S) = O(n) Solution with 2B buckets in time

Experimental Evaluation
FH was implemented with r=6 Compared to an algorithm A0 presented by Gilbert et al. Optimized for arbitrary range queries For a data series of length n to be approximated with B buckets, constructs a histogram consisting of 2B buckets in time The only known algorithm with reasonable complexity

Description of Data Sets
A: A real data set of length 1000 extracted from an AT&T operational warehouse B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5 C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

Workload Description A normal used to assign the probabilities to a full hierarchy Then normalization to obtain a probability distribution W1 – generated by sampling N(10,10) W2 – generated by sampling N(10,50)

Performance Evaluation
Accuracy and construction time Parameters Total space allowed for histogram Total size of the data set

Report the total expected sum squared error of the workload execution on the histogram

Results for Data Set A

Results for Set A – Cont. The accuracy of FH is superior to A0
FH is more accurate for smaller variance (W1) As the variance increases, gets closer to uniform (A0 optimized) A0 linear in the space FH is better in construction time for the same range of space

Results for Data Set B

Results for Set B –Cont. Similar to A
Accuracy improves much faster with space since the distribution is Zipf The savings in construction time for FH are dramatic since data set B is twice the size of A

Results for Data Set C

Results for Set C – Cont. Data set size increases (x axis) and total space 20 A0 has a plateau Due to the way the data is generated in the experiment (Gaussian tail) Quadratic trend in construction time for A0 FH – near-linear increasing in construction time

Conclusions The first practical approach to the problem of constructing hierarchical range histograms The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy A novel notion of sparse intervals

Future plans A formal study of the dynamic properties of hierarchical range histograms How should one modify these histograms under data or workload modifications?

The END Thanks for listening