Download presentation

Presentation is loading. Please wait.

Published byMalcolm Rench Modified about 1 year ago

1
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference, Cairo, Egypt Presented By Supriya Sudheendra

2
Outline

3
Introduction o Approximate Query Processing is a viable solution for: Huge amounts of data High query complexities Stringent response-time requirements o Decision Support Systems Support business and organizational decision-making activities Helps decision makers compile useful information from raw data, solve problems and make decisions

4
Introduction… o DSS users pose very complex queries to the DBMS Requires complex operations over GB or TBs of disk- resident data Very long time to execute and produce exact answers Number of scenarios where users prefer a fast, approximate answers

5
Prior Work o Previous Approximate query processing techniques Focused on specific forms of aggregate queries Data reduction mechanism – how to obtain the synopses of data o Sampling-based Techniques A join-operator on 2 uniform random samples results in a non-uniform sample having very few tuples For non-aggregate queries, it produces a small subset of the exact answer which might be empty when joins are involved.

6
Prior Work… o Histogram Based Techniques Problematic for high-dimensional data Storage overhead High construction cost o Wavelet Based Techniques Mathematical tool for hierarchical decomposition of functions Apply wavelet decomposition to input data collection –> data synopsis Avoids high construction costs and storage overhead

7
Contribution of the Paper o Viability and effectiveness of wavelets as a generic tool for high-dimensional DSS o New, I/O-efficient wavelet decomposition algorithm for relational tables o Novel Query processing algebra for Wavelet-Co- Efficient Data Synopses o Extensive Experiments

8
Background o Mathematical tool to hierarchically decompose functions o Coarse overall approximation together with detail coefficients that influence function at various scales o Haar wavelets are conceptually simple, fast to compute o Variety of applications like image editing and querying

9
One-Dimensional Haar Wavelets o How to compute, given a data array: Average the values together pairwise to get a “lower- resolution” representation of data Detailed coefficients-> differences of the averages from the computed pairwise average Reconstruction of the data array possible Why Detail Coefficients

10
One-dimensional Haar Wavelets o Wavelet Transform: Overall average followed by detail coefficients in increasing order of resolution. Each entry->wavelet coefficient o W A = [4, -2, 0, -1] o For vectors containing similar values, most detail coefficients have small values that can be eliminated Introduces only small errors

11
One-dimensional Haar Wavelets o Overall average more important than any detail coefficient o To normalize the final entries of W A, each wavelet coefficient is divided by 2 l l: level of resolution W A = [4, -2, 0, -1/ 2]

12
Multi-dimensional Haar Wavelets o Haar wavelets can be extended to multi-dimensional array Standard Decomposition Fix an ordering for the data dimensions(1,2,…d) Apply complete 1-D wavelet transform for each 1-d row of array cells along dimension k Nonstandard Decomposition Alternates between dimensions during successive steps of pairwise averaging and differencing for each 1-D row of array cells along dimension k Repeated recursively on quadrant containing all averages across all dimensions

13
Non-standard Decomposition Pairwise averaging and differencing for one positioning of 2x2 box with root [2i 1, 2i 2 ] Distribution of the results in the wavelet transform array Process is recursed on lower-left quadrant of W A

14
Example Decomposition of a 4 X 4 Array

15
Multi-dimensional Haar coefficients: Semantics and Representation o D-dimensional Haar basis function corresponding to w is defined by: D-dimensional rectangular support region Quadrant sign information

16
Support Regions for 16 Nonstandard 2-D Haar Basis Function Blank areas – regions of A whose reconstruction is independent of the coefficient WA[0,0] – overall average WA[3,3] – contributes only to upper right quadrant

17
Haar CoEfficients: Semantics and Representation o W = W.R – d-dimensional support hyper-rectangle of W encloses all cells in A to which W contributes Hyper-rectangle – represented by low and high boundaries across each dimension j, 1<= j <=d W.R.boundary[j].lo and W.R.boundary[j].hi W contributes to each data cell A[i1,……id] where W.R.boundary[j].lo <= ij <= W.R.boundary[j].hi for all j

18
o W.S – sign infromation for all d-dimensional quadrants of W.R Denoted by W.S.sign[j].lo and W.S.sign[j].hi corresponding to lower and upper half of W.R’s extent along j Computed as the product of d sign-vector entries that map to that quadrant o W.v – scalar magnitude of W Quantity that W contributes to all data array cells enclosed in W.R

19
Building Wavelet Coefficient Synopses o Relation R with d attributes X 1, X 2, ………X d o Can represent R as a d-dimensional array A R o J th dimension is indexed by the values of attribute X j o Cells contain the count of tuples in R having the corresponding combination of attribute values o A R – joint frequency distribution of all attributes of R

20
Chunk-based organization of relational tables Joint frequency array AR – split into d-dimensional chunks Tuples of R of same chunk are stored contiguously on disk If R is not chunked, one extra pre-processing step to reorganize R on disk

21
ComputeWavelet Algorithm When a chunk is loaded for the first time, ComputeWavelet can perform entire computation for decomposing Pairwise averaging and differencing is performed as soon as 2 d averages are accumulated Memory efficient- no more than one active sub-array at a time for each level of resolution

22
Processing Relational Queries in Wavelet Coefficient Domain Wavelet-Coefficient Synopses W T1, W T2,…W Tk RS of Wavelet Coefficients W S Approx. Result Relation S Wavelet-Coefficient Synopses W T1, W T2,…W Tk Approximate Relations T1, T2,….Tk Approx. Result Relation S Op(W T1,….W Tk ) Render(W S ) Render(WT1…WTk) Op(T1, T2…. Tk)

23
Selection Operator Our selection operator has the general form select pred (W T ), where pred represents a generic conjunctive predicate on a subset of the d attributes in T; that is, pred = (l i1 ≤ X i1 ≤ h i1 ) ∧... ∧ (l ik ≤ X ik ≤ h ik ), where l ij and h ij denote the low and high boundaries of the selected range along each selection dimension D ij, j = 1, 2, · · ·, k, k ≤ d.

24
Selection - Relational Domain o In relational domain, interested in only those cells inside query range o In wavelet domain, interested in only the coefficients that contribute to those cells Dim. D Query Range Dim. D1 Joint Data Distribution Array Relation

25
Projection Operator

26
Projection- Wavelet Domain

27
Join Operator

28
Join Operator- Wavelet Domain

29
Experimental Study o Improved answer quality o Low synopsis construction costs o Fast query execution

30
Query Execution Times

31
SELECT-JOIN-SUM

32
SELECT Query errors on real-life data

33
Conclusion o Multidimensional wavelets as an effective tool for general purpose approximate query processing in modern, high dimensional applications o The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain o Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of the wavelet-based approach compared to both sampling and histograms

34
Thank you

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google