Download presentation

Presentation is loading. Please wait.

Published byNichole Longway Modified about 1 year ago

1
Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

2
Layout Introduction Related Works Problem Definition Problem Solution –A Sparse Interval Set System –The Dynamic Programming algorithm Experimental Evaluation Conclusions

3
Introduction Data Warehousing and OLAP applications –OLAP – Online analytical processing Data has multiple logical dimensions with natural hierarchies defined on it OLAP queries –usually involve hierarchical selections on some of the dimensions –often aggregate measure attributes

4
Introduction – Cont.

5
Histograms Numeric attribute value domain Space-efficient Conditions on a given dimension - hierarchical ranges Range estimation depends on a good solution to the histogram construction problem

6
The Main Idea Proposes a fast practical algorithms for the problem of constructing hierarchical range histograms

7
The Main Contributions A novel notion of sparse intervals A proposed algorithm effectively trades space for construction time without compromising the accuracy First practical approach to the problem

8
Previous Works V-Optimal histograms –Minimizes error for equality queries –But … Constructed by taking only equality queries into account Koudas et al. - a polynomial-time algorithm –For special and general cases –But … High polynomiality Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges –But.. High polynomiality

9
Problem Definition An array A[1,n] of non-negative real numbers The average of items A[a], …,A[b]

10
A histogram of array A[1,n] using B buckets is specified by B+1 integers Each interval is a bucket Each is a bucket boundary Histogram Definition

11
Histogram Definition – Cont. Stored as –a series of bucket boundaries –the average of the array values in each bucket –bucket sum can be obtained

12
Histograms – Cont. Mostly support equality queries –“ give me A[i] ” Hierarchical range queries

13
Hierarchical Range Queries Definition A range query asks for the sum A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are –disjoint –or contained one in the other

14
Hierarchical Range Queries – Cont. Generalize equality queries Can be displayed as a tree –Each node u has an associated range –Node v is a child of node u iff and there is no w such that

15
Workload Definition A workload W consists of –A set S of hierarchical range queries –A probability for each query in S this probability can be obtained by monitoring and logging Simple probabilities model

16
How The Histogram Works 1.A histogram H of array A[1,n] 2.Query 3.An expected answer 4.Left bucket such that 5.Right bucket such that 6.Calculate precise total of the values in the buckets between left and right buckets 7.Estimate the sums for the portions within the left and right buckets

17
How The Histogram Works – Cont. 8.The sum of A in the interval is estimated by –Uniformity assumption 9.The right bucket likewise

18
The Total Estimate The total estimate left bucket estimation + right bucket estimation + exact sum for buckets in between

19
Determining the average Construct a prefix sum array for all Given and return the average at constant time

20
Optimal histogram definition The error of the range query is Given a histogram H and workload W the total expected error for estimating W is over all queries in W

21
Optimal Histogram Definition – Cont. Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

22
Fast Histogram (FH) Construction for Hierarchical Range Queries Given an array A[1,n], B buckets and workload W E denotes the total expected error of the optimal histogram Find algorithms that construct HR histograms with an error at most E trading space and construction time

23
Layout Introduction Related Works Problem Definition Problem Solution –A Sparse Interval Set System –The Dynamic Programming algorithm Experimental Evaluation Conclusions

24
FH construction Constructing a set of “ sparse intervals ” –Increases a number of buckets –Any arbitrary interval can be represented Dynamic programming algorithm

25
A Sparse Interval System Given an integer set Level 1 points: Level 2 points: Level j+1 points: Last r+1 level points:

26
A Sparse Interval System The interval [0,n] is in the sparse system S Any pair of level j points between level j+1 points defines in interval in S

27
A Sparse Interval System Example n=8 ; r=3 ; l=2 024681357048123567081234567081234567 Level 2 pointsLevel 3 pointsLevel 4 pointsLevel 1 points

28
Sparse Interval System Properties Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

29
Claim Any interval [0,x] can be expressed as a partition of at most r intervals from the sparse system

30
Claim Proof By induction Induction step Any interval where can be expressed as j intervals. Base case true for j=1

31
Claim Proof – Cont. j+1 Consider We can write the interval as and where t is maximal is a valid interval in the sparse system (in level j+1 - 0 and are adjacent)

32
Claim Proof – Cont. is essentially similar to since t is maximal. Therefore by induction can be expressed by j intervals Total j+1 Since any interval can be expressed as a union of r intervals

33
Observation Any interval can be expressed as intervals By cutting it in a point of the form with maximum j By symmetry and can be expressed as a disjoint union

34
Lemma In a sparse set system with parameter r, the number of intervals containing a point is at most

35
Lemma Proof Consider the level 1 intervals There are at most such intervals that contain a specific point –There are l points between adjacent points of level 2 –l points can create at most intervals Level j intervals behave on level j points the same as level 1 points on the original points Extend to r levels … (r+1 ’ th level adds one more interval)

36
Layout Introduction Related Works Problem Definition Problem Solution –A Sparse Interval Set System –The Dynamic Programming algorithm Experimental Evaluation Conclusions

37
Hierarchy Representation By a Tree Ranges define a hierarchy based on the inclusion relationship T is a hierarchy representation by a tree –Each node v of T is associated with a range –The weight is –The error is

38
Representation By a Binary Tree We allow If a node had children transform it into a node with two children – – a new node with weight 0 The size of a tree increases only by factor 2 So assume that the tree is binary

39
Dynamic Programming Algorithm - FH Best(v,left,right,p) denotes the smallest error of the range v – tree node associated with left – overlapping interval on the left right - overlapping interval on the left v contains p intervals completely Formally, left contains and right contains

40
FH stages Let the children of v to be y and z with ranges and Cases (a) + (b)

41
For all possible intervals I that contain and,compute In the case that I finishes on

42
Cases (a)+(b)

43
FH stages – Cont. Return When interval I is fixed, and are automatically defined and can be counted in O(1) time.

44
Time complexity Time spent evaluating cost(I) is O(p) The running time depends on the number of choices of interval I Let C(S) be the maximum number of intervals in an interval system S that contain a particular element ( ) If all intervals are allowed then

45
Time complexity – Cont. The running time of the algorithm FH is The number of entries for each tree node v is –Since there are C(S)+1 intervals for choices of left (all intervals that contain and ). Similarly for right Work for every tree node

46
Time complexity – Cont. Total work including preprocessing is When S is a set of all possible intervals The result matches the time complexity of the previous algorithm (for arbitrary intervals)

47
Time Complexity For a Sparse Set System S – a sparse set system with parameter r Run FH with 2r(B-1) buckets Error - less then or equal to the original B bucket histogram –A histogram with B buckets can be expressed as a histogram with 2r(B-1) buckets in sparse system

48
Time Complexity – Cont. Set –In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

49
Some Notes Get alternate tradeoffs by constructing different sparse set system –Complete binary tree on [0,n] –Allow intervals such that one end point is an ancestor of the other –Any arbitrary interval can be expressed as a disjoint union of two intervals from the sparse set –C(S) = O(n) –Solution with 2B buckets in time

50
Experimental Evaluation FH was implemented with r=6 Compared to an algorithm A0 presented by Gilbert et al. –Optimized for arbitrary range queries –For a data series of length n to be approximated with B buckets, constructs a histogram consisting of 2B buckets in time –The only known algorithm with reasonable complexity

51
Description of Data Sets A: A real data set of length 1000 extracted from an AT&T operational warehouse B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5 C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

52
Workload Description A normal used to assign the probabilities to a full hierarchy Then normalization to obtain a probability distribution W1 – generated by sampling N(10,10) W2 – generated by sampling N(10,50)

53
Performance Evaluation Accuracy and construction time Parameters –Total space allowed for histogram –Total size of the data set

54
Computing Accuracy Ask 1000 queries Report the total expected sum squared error of the workload execution on the histogram

55
Results for Data Set A

56
Results for Set A – Cont. The accuracy of FH is superior to A0 FH is more accurate for smaller variance (W1) As the variance increases, gets closer to uniform (A0 optimized) A0 linear in the space FH is better in construction time for the same range of space

57
Results for Data Set B

58
Results for Set B – Cont. Similar to A Accuracy improves much faster with space –since the distribution is Zipf The savings in construction time for FH are dramatic – since data set B is twice the size of A

59
Results for Data Set C

60
Results for Set C – Cont. Data set size increases (x axis) and total space 20 A0 has a plateau –Due to the way the data is generated in the experiment (Gaussian tail) Quadratic trend in construction time for A0 FH – near-linear increasing in construction time

61
Conclusions The first practical approach to the problem of constructing hierarchical range histograms The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy A novel notion of sparse intervals

62
Future plans A formal study of the dynamic properties of hierarchical range histograms How should one modify these histograms under data or workload modifications?

63
The END Thanks for listening

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google