Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science.

Similar presentations


Presentation on theme: "1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science."— Presentation transcript:

1 1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science Los Angeles, CA 90089-0781 shahabi@usc.edu http://infolab.usc.edu

2 2 C. Shahabi Outline Distributed Information Management Laboratory Multidimensional Data Sets & Applications (Examples) Focus Application: On-Line Analytical Processing (OLAP) Traditional Solution PROPOLYNE: Progressive Evaluation of Polynomial Range-Sum Query

3 3 C. Shahabi Location: PHE-306 (and 108) URL: http://infolab.usc.eduhttp://infolab.usc.edu Research Staff: 2 Admin Staff: 1 Ph.D. Students: 9 M.S. Students: 8 Undergraduates: 2 Ph.D. Alumni: 3 M.S. Alumni: Many! Sponsors:

4 4 C. Shahabi Multidimensional Data Sets & Applications (Examples) Similarity search, clustering, …  Stock prices  time-series  Images  color histograms  Shapes  angle sequences  Web navigation  feature vectors Spatial & temporal queries, mining queries, …  GeoSpatial data  latitude, longitude, altitude  Remote sensory data   Immersidata 

5 5 C. Shahabi f (S1) e.g., avg e.g., std Stock Prices S1 Sn day $price 1365 day $price 1365 A point in 365 dimensions (computationally complex) f (Sn) A point in 2 dimensions (not accurate enough) 33 11 22 44 55 g (Sn) g (S1) A point in 5 dimensions transformation-based: FFT, Wavelet [SSDBM’00, 01]

6 6 C. Shahabi More Similarity Search & Clustering 0 255... More accurate Images Red Green Blue 208 125 100 Color Histograms R G B Red Green Blue 80 100 210 C Angle Sequences = [  ]          Shapes [ICDE’99 … ICME’00] Web Navigations (Hit) Feature Vectors [RIDE’97 … WebKDD’01] P1 P2 P3 P4 P5 … 3 870

7 7 C. Shahabi Spatial & Temporal Data Complex Queries Data types: A point: or A line-segment: A line: sequence of line-segments A region: A closed set of lines Moving point: (e.g., car, train, …) Changing region: (e.g., changing temperature of a county) Queries: Rivers Countries Hospitals Cities Taxi 5km of Home 10 min Experiments BrainR [Visual’99] [ACM-GIS’01, VLDB’01]

8 8 C. Shahabi Immersidata and Mining Queries [CIKM’01, UACHI’01]

9 9 C. Shahabi … … Immersidata and Mining Queries … A dynamic sign, e.g., ASL colors 

10 10 C. Shahabi Focus Application: On-Line Analytical Processing (OLAP) Multidimensional data sets:  Dimension attributes (e.g., Store, Product, Data)  Measure attributes (e.g., Sale, Price) Range-sum queries  Average sale of shoes in CA in 2001  Number of jackets sold in Seattle in Sep. 2001 Tougher queries:  Covariance of sale and price of jackets in CA in 2001 (correlation)  Variance of price of jackets in 2001 in Seattle Store Location Product DateSale LA Shoes Jan. 01 $21,500 $85.99 NY Jacket June 01 $28,700 $45.99 Price.............................. Market-Relation  (p=shoe)  (s CA)  (d 2001) Avg (sale) Too Slow!

11 11 C. Shahabi Traditional Solution: Pre-computation Prefix-sum [Agrawal et. al 1997] Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k Age Salary $40k $55k $65k $100k$120k $150k 0 25 40 50 60 80 Salary Age Query: Sum(salary) when (25 < age < 40) and (55k < salary < 150k) Disadvantages: Measure attribute should be pre-selected Aggregation function should be pre-selected Works only for limited # of aggregation functions Updates are expensive (need re-computation) Result: I – II – III + IV Query: Sum(salary) when (25 < age < 40) and (55k < salary < 150k)

12 12 C. Shahabi PROPOLYNE: Progressive Evaluation of Polynomial Range-Sum Query (w/ Rolfe Schmidt) Overview of PROPOLYNE Features of PROPOLYNE Polynomial Range-Sum Queries as Vector Queries Naive Evaluation of Vector Queries Using Wavelets Fast Evaluation of Vector Queries Using Wavelets Progressive/Approximate Evaluation of Vector Queries Using Wavelets Related Work Performance Results Conclusion

13 13 C. Shahabi Overview of PROPOLYNE Define range-sum query as vector product of query vector and data vector Offline: Multidimensional wavelet transform of data At the query time: “lazy” wavelet transform of query vector (very fast) Dot product of query and data vectors in the transformed domain  exact result in O(2 log N) d Choose high-energy query coefficients only  fast approximate result (90% accuracy by retrieving < 10% of data) Choose query coefficients in order of energy  progressive result

14 14 C. Shahabi PROPOLYNE Features All attributes can be treated as either “dimension” or “measure” attributes “Function” can be any polynomial on any combination of attributes, i.e., not only SUM, AVERAGE and COUNT but also COVARIANCE, VARIANCE and SUMSQUARE Independent from how well the data set can be compressed/approximated by wavelet  Because: We show “range-sum queries” can always be approximated well by wavelets (not always HAAR though!) Low update cost: O(log d N) Can be used for exact, approximate and progressive range-sum query evaluation

15 15 C. Shahabi Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I)  I is a finite instance of schema F  R SubSetOf Dom( F ), is the range  f : Dom( F )  R is a polynomial of degree  Example: F = (Age, Salary) R : (25 < age < 40) & (55k < salary < 150k) Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I

16 16 C. Shahabi Polynomial Range-Sum Queries as “Vector Queries” The data frequency distribution of I is the function  I : Dom( F )  Z that maps a point x to the number of times it occurs in I To emphasize the fact that a query is an operator on the data frequency distribution, we write Example:  (25,50)=  (28,55)=…=  (57,120)=1 and  (x)=0 otherwise. Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I where: if Hence: Or: Vector Query querydata

17 17 C. Shahabi Ha[i]’sGa[i]’s a[i]’s H 2 a [i]’sGHa[i]’s H 3 a[i]’s GH 2 a[i]’s H operator: computes a local average of array a at every other point to produce an array of summary coefficients: Ha Example (Haar) h=[1/2,1/2] G operator: measures how much values in the array a vary inside each of the summarized blocks to compute an array of detail coefficients: Ga Example (Haar) g=[1/2,-12] Overview of Wavelets DWT of a Summary coefficients of a at level 2 Detail coefficients of a at level 2 aka wavelet coefficients of a

18 18 C. Shahabi Naive Evaluation of Vector Queries Using Wavelets Hence, vector queries can be computed in the wavelet- transformed space as: Algorithm:  Off-line transformation of data vector (or “data distribution function”, i.e., , to be exact) O (| I | l d log d N) for sparse data, O (| I |) = N d for dense data  Real-time transformation of the query vector at the query evaluation time O ( l d log d N)  Sum-up the products of the corresponding elements of data and query vectors Retrieving elements of data vector: O (N d )

19 19 C. Shahabi Fast Evaluation of Vector Queries Using Wavelets Main intuitions:  “query vector” can be transformed quickly because most of the coefficients are known in advance  “Transformed query vector” has a large number of negligible (e.g., zero) values (independent on how well data can be approximated by wavelet)  Example: Haar filter & COUNT function on R=[5,12] on the domain of integers from 0 to 15: Ga GHaGH 2 a GH 3 a H4aH4a At each step, you know the zeros

20 20 C. Shahabi The Lazy Wavelet Transform Computing Summary Coefficients (Haar Filter, COUNT function) Outside the range, summary coeffs are ½ *0 + ½ * 0 = 0. At boundary of range, summary coeff is ½ *0 + ½ * 1 = ½ Inside range, summary coeffs are ½ * 1 + ½ * 1 = 1 All summary coefficients computed in CONSTANT time! The only “interesting” activity happens on the boundary. Summary coefficient array looks almost exactly like original array.

21 21 C. Shahabi The Lazy Wavelet Transform All detail coefficients computed in CONSTANT time! The only “interesting” activity happens on the boundary. Computing Detail Coefficients (Haar Filter, COUNT function) Outside the range, detail coeffs are ½ *0 - ½ * 0 = 0. At lower boundary of range, detail coeff is ½ *0 - ½ * 1 = -½ Inside range, detail coeffs are ½ * 1 - ½ * 1 = 0 At upper boundary of range, detail coeff is ½ *1 - ½ * 0 = ½ All but 2 detail coefficients at each level are equal to zero!

22 22 C. Shahabi Fast Evaluation of Vector Queries Using Wavelets … Technical Requirements:  Wavelets must satisfy a “moment condition”  Wavelets should have small support (i.e., the shorter the filter, the better)  Supports any Polynomial Range-Sum up to a degree determined by the choice of wavelets E.g., Haar can only support degree 0 (e.g., COUNT), while db4 can support up to degree 1 (e.g., SUM), and db6 for degree 2 (e.g., VARIANCE) Standard DWT:  (N) Our lazy wavelet for transforming query function:  ( l log N) where l is the length of the filter

23 23 C. Shahabi Exact Evaluation of Vector Queries Query: SUM(salary) when (25 < age < 40) & (55k < salary < 150k) # of Wavelet Coefficients: 1250

24 24 C. Shahabi Approximate Evaluation of Vector Queries

25 25 C. Shahabi Progressive Evaluation of Vector Queries

26 26 C. Shahabi Name of TechnologyResearch Group Query Cost Update Cost Storage Cost Aggregate Function Support Query Evaluation Support Measure Known at Population? PROPOLYNE 2001 USC Schmidt & Shahabi lg d N(4  ) d lg d N(2  ) d N d Polynomial Range-Sums of degree  Exact, Approximate, Progressive No PROPOLYNE-FM 2001 USC Schmidt & Shahabi 2 d lg d-1 Nlg d-1 NN d-1 COUNT and SUM Exact, Approximate, Progressive Yes Space-Efficient Dynamic Data Cube 2000 UCSB El-Abbadi & Agrawal et. al 2 d lg d-1 Nlg d-1 N N d-1 COUNT and SUM ExactYes Relative Prefix-Sum 1999 UCSB4 d-1 N (d-1)/2 N d-1 COUNT and SUM ExactYes Prefix-Sum 1997 IBM Agrawal et. al 2 d-1 N d-1 COUNT and SUM ExactYes pCube/MRATree 2000/2001 UCSB and UC Irvine (Mehrota et. al) N d-1 lg N N d COUNT and SUM Exact, Approximate, Progressive Yes Compact Data Cube 1998-2000 Duke and IBM (Vitter et. al) small? COUNT and SUM ApproximateYes Optimal Histograms 2001 AT&T (Muthu et. al) small? COUNT and SUM ApproximateYes Kernel Density Estimators 1999 Microsoft (Fayyad et. al) small? All efficiently computable functions ApproximateNo

27 27 C. Shahabi Experimental Setup PETROL Data Set: Petroleum sales volume 56504 tuples Five dimensions:  Sparseness: 0.16%  Traditional data approximation works well 250 range queries generated  Randomly from all possible ranges with the uniform dist.  Ranges which select fewer than 100 tuples were discarded  Median Relative error, GPS Data Set: Sensor readings from GPS ground stations in CA 3358 tuples Four dimensions:   Velocity of upward movement of the station Sparseness: 0.01%  Data approximation works poorly

28 28 C. Shahabi Performance Results Compact Data Cube (as a representative of data approximation techniques): under 10% error after using less than 10% of wavelet coefficients (wavelet coefficients sorted in the order of energy) PETROL

29 29 C. Shahabi Performance Results CDC needs 5 times as many coefficients as there were tuples in the original table before providing a median relative error of 10% (because data cannot be compressed well) GPS

30 30 C. Shahabi

31 31 C. Shahabi Conclusion A novel MOLAP pre-aggregation strategy Supports conventional aggregates: COUNT, SUM and beyond: COVARIANCE First pre-aggregation technique that does not require measures be specified a priori  Measures treated as functions of the attributes at the query time Provides a data independent progressive and approximate query answering technique With provably poly-logarithmic worst-case query and update costs And storage cost comparable or better than other pre-aggregation methods

32 32 C. Shahabi Future PROPOLYNE future plans:  Use synopsis information about query workloads or data distribution for better sorting of coefficients  Improve random access behavior of PROPOLYNE to data by “clustering” related coeffiecents  More complex queries: OLAP drill-down, general relational algebra queries, … Multidimensional mining research directions:  Efficient ways of finding trends (e.g., correlation between dimensions/attributes)  Efficient ways of finding surprises/outliers  Mining sequence data sets (e.g., genome databases)


Download ppt "1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science."

Similar presentations


Ads by Google