Download presentation

Published byNichole Longway Modified over 7 years ago

1
**Fast Algorithms For Hierarchical Range Histogram Constructions**

Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’2002s

2
**Layout Introduction Related Works Problem Definition Problem Solution**

A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

3
**Introduction Data Warehousing and OLAP applications**

OLAP – Online analytical processing Data has multiple logical dimensions with natural hierarchies defined on it OLAP queries usually involve hierarchical selections on some of the dimensions often aggregate measure attributes

4
Introduction – Cont.

5
**Histograms Numeric attribute value domain Space-efficient**

Conditions on a given dimension - hierarchical ranges Range estimation depends on a good solution to the histogram construction problem

6
The Main Idea Proposes a fast practical algorithms for the problem of constructing hierarchical range histograms

7
**The Main Contributions**

A novel notion of sparse intervals A proposed algorithm effectively trades space for construction time without compromising the accuracy First practical approach to the problem

8
**Previous Works V-Optimal histograms**

Minimizes error for equality queries But… Constructed by taking only equality queries into account Koudas et al. - a polynomial-time algorithm For special and general cases But… High polynomiality Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges But.. High polynomiality

9
**Problem Definition An array A[1,n] of non-negative real numbers**

The average of items A[a],…,A[b]

10
Histogram Definition A histogram of array A[1,n] using B buckets is specified by B+1 integers Each interval is a bucket Each is a bucket boundary

11
**Histogram Definition – Cont.**

Stored as a series of bucket boundaries the average of the array values in each bucket bucket sum can be obtained

12
**Histograms – Cont. Mostly support equality queries**

“give me A[i]” Hierarchical range queries

13
**Hierarchical Range Queries Definition**

A range query asks for the sum A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are disjoint or contained one in the other

14
**Hierarchical Range Queries – Cont.**

Generalize equality queries Can be displayed as a tree Each node u has an associated range Node v is a child of node u iff and there is no w such that

15
**Workload Definition A workload W consists of**

A set S of hierarchical range queries A probability for each query in S this probability can be obtained by monitoring and logging Simple probabilities model

16
**How The Histogram Works**

A histogram H of array A[1,n] Query An expected answer Left bucket such that Right bucket such that Calculate precise total of the values in the buckets between left and right buckets Estimate the sums for the portions within the left and right buckets

17
**How The Histogram Works – Cont.**

The sum of A in the interval is estimated by Uniformity assumption The right bucket likewise

18
**The Total Estimate The total estimate left bucket estimation +**

right bucket estimation + exact sum for buckets in between

19
**Determining the average**

Construct a prefix sum array for all Given and return the average at constant time

20
**Optimal histogram definition**

The error of the range query is Given a histogram H and workload W the total expected error for estimating W is over all queries in W

21
**Optimal Histogram Definition – Cont.**

Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

22
**Fast Histogram (FH) Construction for Hierarchical Range Queries**

Given an array A[1,n], B buckets and workload W E denotes the total expected error of the optimal histogram Find algorithms that construct HR histograms with an error at most E trading space and construction time

23
**Layout Introduction Related Works Problem Definition Problem Solution**

A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

24
**FH construction Constructing a set of “sparse intervals”**

Increases a number of buckets Any arbitrary interval can be represented Dynamic programming algorithm

25
**A Sparse Interval System**

Given an integer set Level 1 points: Level 2 points: Level j+1 points: Last r+1 level points:

26
**A Sparse Interval System**

The interval [0,n] is in the sparse system S Any pair of level j points between level j+1 points defines in interval in S

27
**A Sparse Interval System Example**

n=8 ; r=3 ; l=2 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 4 8 1 2 3 5 6 7 2 4 6 8 1 3 5 7 Level 1 points Level 4 points Level 2 points Level 3 points

28
**Sparse Interval System Properties**

Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

29
Claim Any interval [0,x] can be expressed as a partition of at most r intervals from the sparse system

30
**Claim Proof By induction Induction step**

Any interval where can be expressed as j intervals. Base case true for j=1

31
**Claim Proof – Cont. j+1 Consider We can write the interval as and**

where t is maximal is a valid interval in the sparse system (in level j and are adjacent)

32
**Claim Proof – Cont. is essentially similar to**

since t is maximal. Therefore by induction can be expressed by j intervals Total j+1 Since any interval can be expressed as a union of r intervals

33
**Observation Any interval can be expressed as intervals**

By cutting it in a point of the form with maximum j By symmetry and can be expressed as a disjoint union

34
Lemma In a sparse set system with parameter r, the number of intervals containing a point is at most

35
**Lemma Proof Consider the level 1 intervals**

There are at most such intervals that contain a specific point There are l points between adjacent points of level 2 l points can create at most intervals Level j intervals behave on level j points the same as level 1 points on the original points Extend to r levels…(r+1’th level adds one more interval)

36
**Layout Introduction Related Works Problem Definition Problem Solution**

A Sparse Interval Set System The Dynamic Programming algorithm Experimental Evaluation Conclusions

37
**Hierarchy Representation By a Tree**

Ranges define a hierarchy based on the inclusion relationship T is a hierarchy representation by a tree Each node v of T is associated with a range The weight is The error is

38
**Representation By a Binary Tree**

We allow If a node had children transform it into a node with two children a new node with weight 0 The size of a tree increases only by factor 2 So assume that the tree is binary

39
**Dynamic Programming Algorithm - FH**

Best(v,left,right,p) denotes the smallest error of the range v – tree node associated with left – overlapping interval on the left right - overlapping interval on the left v contains p intervals completely Formally, left contains and right contains

40
**FH stages Let the children of v to be y and z with ranges and**

Cases (a) + (b)

41
**Cases (a)+(b) For all possible intervals I that contain and ,compute**

In the case that I finishes on

42
Cases (a)+(b)

43
FH stages – Cont. Return When interval I is fixed, and are automatically defined and can be counted in O(1) time.

44
**Time complexity Time spent evaluating cost(I) is O(p)**

The running time depends on the number of choices of interval I Let C(S) be the maximum number of intervals in an interval system S that contain a particular element ( ) If all intervals are allowed then

45
**Time complexity – Cont. The running time of the algorithm FH is**

The number of entries for each tree node v is Since there are C(S)+1 intervals for choices of left (all intervals that contain and ). Similarly for right Work for every tree node

46
**Time complexity – Cont. Total work including preprocessing is**

When S is a set of all possible intervals The result matches the time complexity of the previous algorithm (for arbitrary intervals)

47
**Time Complexity For a Sparse Set System**

S – a sparse set system with parameter r Run FH with 2r(B-1) buckets Error - less then or equal to the original B bucket histogram A histogram with B buckets can be expressed as a histogram with 2r(B-1) buckets in sparse system

48
**Time Complexity – Cont. Set**

In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

49
Some Notes Get alternate tradeoffs by constructing different sparse set system Complete binary tree on [0,n] Allow intervals such that one end point is an ancestor of the other Any arbitrary interval can be expressed as a disjoint union of two intervals from the sparse set C(S) = O(n) Solution with 2B buckets in time

50
**Experimental Evaluation**

FH was implemented with r=6 Compared to an algorithm A0 presented by Gilbert et al. Optimized for arbitrary range queries For a data series of length n to be approximated with B buckets, constructs a histogram consisting of 2B buckets in time The only known algorithm with reasonable complexity

51
**Description of Data Sets**

A: A real data set of length 1000 extracted from an AT&T operational warehouse B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5 C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

52
Workload Description A normal used to assign the probabilities to a full hierarchy Then normalization to obtain a probability distribution W1 – generated by sampling N(10,10) W2 – generated by sampling N(10,50)

53
**Performance Evaluation**

Accuracy and construction time Parameters Total space allowed for histogram Total size of the data set

54
**Computing Accuracy Ask 1000 queries**

Report the total expected sum squared error of the workload execution on the histogram

55
Results for Data Set A

56
**Results for Set A – Cont. The accuracy of FH is superior to A0**

FH is more accurate for smaller variance (W1) As the variance increases, gets closer to uniform (A0 optimized) A0 linear in the space FH is better in construction time for the same range of space

57
Results for Data Set B

58
**Results for Set B –Cont. Similar to A**

Accuracy improves much faster with space since the distribution is Zipf The savings in construction time for FH are dramatic since data set B is twice the size of A

59
Results for Data Set C

60
Results for Set C – Cont. Data set size increases (x axis) and total space 20 A0 has a plateau Due to the way the data is generated in the experiment (Gaussian tail) Quadratic trend in construction time for A0 FH – near-linear increasing in construction time

61
Conclusions The first practical approach to the problem of constructing hierarchical range histograms The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy A novel notion of sparse intervals

62
Future plans A formal study of the dynamic properties of hierarchical range histograms How should one modify these histograms under data or workload modifications?

63
The END Thanks for listening

Similar presentations

© 2022 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google