Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dense-Region Based Compact Data Cube

Similar presentations


Presentation on theme: "Dense-Region Based Compact Data Cube"— Presentation transcript:

1 Dense-Region Based Compact Data Cube
Presented by: Kan Kin Fai

2 Outline Background Introduction to Compact Data Cube
Pros and cons of the Compact Data Cube method Dense-Region Based Compact Data Cube

3 Background Why is a data cube?
Some pre-computed aggregates on the underlying data warehouse. System constraints on materializing data cube(s) Disk space, maintenance cost, etc. Common approach: materialize parts of a data cube. Alternative: use approximation technique Reason: OLAP applications accept approximate answers in many scenarios.

4 Introduction to Compact Data Cube
Compact Data Cube was proposed by Vitter and Wang in Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets (SIGMOD 99). Main Ideas Offline phase: perform Haar wavelet transform on the underlying data (i.e. the base cuboid) and store the k most significant coefficients. Online phase: process any given query based on the k most significant coefficients.

5 Introduction to Compact Data Cube
Basics of Haar wavelet transform Building Compact Data Cube Thresholding and Ranking Answering On-Line Queries

6 Introduction to Compact Data Cube
Basics of Haar wavelet transform e.g. S = [2, 2, 0, 2, 3, 5, 4, 4] Resolution Averages Detail Coefficients 8 [2, 2, 0, 2, 3, 5, 4, 4] 4 [2, 1, 4, 4] [0, -1, -1, 0] 2 [1.5, 4] [0.5, 0] 1 [2.75] [-1.25]

7 Introduction to Compact Data Cube
Basics of Haar wavelet transform For compression reasons, the detail coefficients are normalized. The coefficients at the lower resolutions are weighted more heavily. Approximates the original signal by keeping only the most significant coefficients. Requires only O(N) CPU time and O(N/B) I/Os to compute for a signal of N values. Multidimensional wavelet transform: a series of one-dimensional wavelet transforms.

8 Introduction to Compact Data Cube
Building the Compact Data Cube Problem 1: the size of the multidimensional array representing the underlying data is too large (assume the data are very sparse). Solution: Divide the wavelet transform process into multiple passes.

9 Introduction to Compact Data Cube
Building the Compact Data Cube Problem 2: The density of the intermediate results would increase from pass to pass. Solution: truncate the intermediate multidimensional array by cutting off entries with small magnitude. I/O complexity:

10 Introduction to Compact Data Cube
Thresholding and Ranking Choice 1: keep the C largest (in absolute value) wavelet coefficients. Choice 2: keep the C wavelet coefficients with the largest weights among the C’ largest coefficients (C < C’). The weight of a coefficient equals to the number of its dimensions with value zero.

11 Introduction to Compact Data Cube
Answering On-Line Queries Space: ((d+1)k), CPU time:

12 Pros and cons of the compact data cube method
Requires little disk spaces (a small number of disk blocks). Responds to on-line query fast. Answers OLAP queries more accurately than other approximation techniques like histogram and random sampling. Can progressively refine the approximate answer with no added overhead.

13 Pros and cons of the compact data cube method
Approximates a vast amount of useless empty cells in base cuboid together with useful non-empty cells in base cuboid. Needs to cut off entries with small magnitude at the end of each pass in order to maintain a constant amount of I/O operations from pass to pass.

14 Dense-Region Based Compact Data Cube
Aim: To enhance the ability of the compact data cube method to handle datasets having dense-regions-in-sparse-cube property. Main Idea: To exclude empty cells in base cuboid from approximation. Two-phase approach Compute dense regions in base cuboid. Approximate each dense region independently.

15 Dense-Region Based Compact Data Cube
Question 1: how can we find the dense regions efficiently? Efficient DEnse region Mining (EDEM) algorithm proposed by Cheung et al. in DROLAP -- A Dense-Region-Based Approach to On-line Analytical Processing (DEXA99)

16 Dense-Region Based Compact Data Cube
Basic ideas of EDEM: Build a k-d tree to store the valid cells. Grow dense region covers along boundaries. Search dense regions among the covers. Complexity of EDEM: linear to the number of dimensions and sub-quadratic to the number of data points.

17 Dense-Region Based Compact Data Cube
Question 2: how should we allocate disk space in approximating the dense regions? Choice 1: allocate disk space equally to each dense regions. Choice 2: allocate disk space according to the sizes of dense regions. Choice 3: order the wavelet coefficients of all the dense regions and keep the most significant ones (in absolute value).

18 Dense-Region Based Compact Data Cube
Question 3: how should we treat the data points outside the dense regions? Keep all or keep only significant ones. Question 4: how do we answer on-line queries using the dense-region based approach? Check if a dense region covered by the given query. Check if the stored coefficients contribute to the range sum and compute the amount of contribution if needed.

19 Dense-Region Based Compact Data Cube
One favorable side effect: we may parallelize the construction of compact data cube. More questions: How can we handle updates to the underlying data? How can we approximate iceberg cube? Can we apply the idea of compact data cube to iceberg cube? Can compact data cube be used to answer other types of OLAP queries besides range-sum?


Download ppt "Dense-Region Based Compact Data Cube"

Similar presentations


Ads by Google