Dense-Region Based Compact Data Cube

Dense-Region Based Compact Data Cube
Presented by: Kan Kin Fai

Outline Background Introduction to Compact Data Cube
Pros and cons of the Compact Data Cube method Dense-Region Based Compact Data Cube

Background Why is a data cube?
Some pre-computed aggregates on the underlying data warehouse. System constraints on materializing data cube(s) Disk space, maintenance cost, etc. Common approach: materialize parts of a data cube. Alternative: use approximation technique Reason: OLAP applications accept approximate answers in many scenarios.

Introduction to Compact Data Cube
Compact Data Cube was proposed by Vitter and Wang in Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets (SIGMOD 99). Main Ideas Offline phase: perform Haar wavelet transform on the underlying data (i.e. the base cuboid) and store the k most significant coefficients. Online phase: process any given query based on the k most significant coefficients.

Basics of Haar wavelet transform Building Compact Data Cube Thresholding and Ranking Answering On-Line Queries

Basics of Haar wavelet transform e.g. S = [2, 2, 0, 2, 3, 5, 4, 4] Resolution Averages Detail Coefficients 8 [2, 2, 0, 2, 3, 5, 4, 4] 4 [2, 1, 4, 4] [0, -1, -1, 0] 2 [1.5, 4] [0.5, 0] 1 [2.75] [-1.25]

Basics of Haar wavelet transform For compression reasons, the detail coefficients are normalized. The coefficients at the lower resolutions are weighted more heavily. Approximates the original signal by keeping only the most significant coefficients. Requires only O(N) CPU time and O(N/B) I/Os to compute for a signal of N values. Multidimensional wavelet transform: a series of one-dimensional wavelet transforms.

Building the Compact Data Cube Problem 1: the size of the multidimensional array representing the underlying data is too large (assume the data are very sparse). Solution: Divide the wavelet transform process into multiple passes.

Building the Compact Data Cube Problem 2: The density of the intermediate results would increase from pass to pass. Solution: truncate the intermediate multidimensional array by cutting off entries with small magnitude. I/O complexity:

Thresholding and Ranking Choice 1: keep the C largest (in absolute value) wavelet coefficients. Choice 2: keep the C wavelet coefficients with the largest weights among the C’ largest coefficients (C < C’). The weight of a coefficient equals to the number of its dimensions with value zero.

Answering On-Line Queries Space: ((d+1)k), CPU time:

Pros and cons of the compact data cube method
Requires little disk spaces (a small number of disk blocks). Responds to on-line query fast. Answers OLAP queries more accurately than other approximation techniques like histogram and random sampling. Can progressively refine the approximate answer with no added overhead.

Pros and cons of the compact data cube method
Approximates a vast amount of useless empty cells in base cuboid together with useful non-empty cells in base cuboid. Needs to cut off entries with small magnitude at the end of each pass in order to maintain a constant amount of I/O operations from pass to pass.

Aim: To enhance the ability of the compact data cube method to handle datasets having dense-regions-in-sparse-cube property. Main Idea: To exclude empty cells in base cuboid from approximation. Two-phase approach Compute dense regions in base cuboid. Approximate each dense region independently.

Question 1: how can we find the dense regions efficiently? Efficient DEnse region Mining (EDEM) algorithm proposed by Cheung et al. in DROLAP -- A Dense-Region-Based Approach to On-line Analytical Processing (DEXA99)

Basic ideas of EDEM: Build a k-d tree to store the valid cells. Grow dense region covers along boundaries. Search dense regions among the covers. Complexity of EDEM: linear to the number of dimensions and sub-quadratic to the number of data points.

Question 2: how should we allocate disk space in approximating the dense regions? Choice 1: allocate disk space equally to each dense regions. Choice 2: allocate disk space according to the sizes of dense regions. Choice 3: order the wavelet coefficients of all the dense regions and keep the most significant ones (in absolute value).

Question 3: how should we treat the data points outside the dense regions? Keep all or keep only significant ones. Question 4: how do we answer on-line queries using the dense-region based approach? Check if a dense region covered by the given query. Check if the stored coefficients contribute to the range sum and compute the amount of contribution if needed.

One favorable side effect: we may parallelize the construction of compact data cube. More questions: How can we handle updates to the underlying data? How can we approximate iceberg cube? Can we apply the idea of compact data cube to iceberg cube? Can compact data cube be used to answer other types of OLAP queries besides range-sum?

Dense-Region Based Compact Data Cube

Similar presentations

Presentation on theme: "Dense-Region Based Compact Data Cube"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dense-Region Based Compact Data Cube

Similar presentations

Presentation on theme: "Dense-Region Based Compact Data Cube"— Presentation transcript:

Similar presentations

About project

Feedback