1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science.

Slides:

Advertisements

Similar presentations

On-the-fly Visualization of Scientific Geospatial Data Using Wavelets

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Fast Algorithms For Hierarchical Range Histogram Constructions

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Fast intersection kernel SVMs for Realtime Object Detection

Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.

A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.

Young Deok Chun, Nam Chul Kim, Member, IEEE, and Ick Hoon Jang, Member, IEEE IEEE TRANSACTIONS ON MULTIMEDIA,OCTOBER 2008.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

1 A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric Datasets A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric.

1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.

Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

1 CIDR’03 AIMS: An Immersidata Management System Cyrus Shahabi Computer Science Department & Integrated Media Systems Center University of Southern California.

CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

1 Wavelets for Efficient Querying of Large Multidimensional Datasets Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Module 1: Statistical Issues in Micro simulation Paul Sousa.

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

Histograms for Selectivity Estimation

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.

2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )

Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Presented By Anirban Maiti Chandrashekar Vijayarenu

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.

Content Based Color Image Retrieval vi Wavelet Transformations Information Retrieval Class Presentation May 2, 2012 Author: Mrs. Y.M. Latha Presenter:

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.

SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

Dense-Region Based Compact Data Cube

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Data Transformation: Normalization

Data Mining Soongsil University

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Image Segmentation Techniques

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Wavelet-based histograms for selectivity estimation

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science Los Angeles, CA

2 C. Shahabi Outline Distributed Information Management Laboratory Multidimensional Data Sets & Applications (Examples) Focus Application: On-Line Analytical Processing (OLAP) Traditional Solution PROPOLYNE: Progressive Evaluation of Polynomial Range-Sum Query

3 C. Shahabi Location: PHE-306 (and 108) URL: Research Staff: 2 Admin Staff: 1 Ph.D. Students: 9 M.S. Students: 8 Undergraduates: 2 Ph.D. Alumni: 3 M.S. Alumni: Many! Sponsors:

4 C. Shahabi Multidimensional Data Sets & Applications (Examples) Similarity search, clustering, …  Stock prices  time-series  Images  color histograms  Shapes  angle sequences  Web navigation  feature vectors Spatial & temporal queries, mining queries, …  GeoSpatial data  latitude, longitude, altitude  Remote sensory data   Immersidata 

5 C. Shahabi f (S1) e.g., avg e.g., std Stock Prices S1 Sn day $price 1365 day $price 1365 A point in 365 dimensions (computationally complex) f (Sn) A point in 2 dimensions (not accurate enough) 33 11 22 44 55 g (Sn) g (S1) A point in 5 dimensions transformation-based: FFT, Wavelet [SSDBM’00, 01]

6 C. Shahabi More Similarity Search & Clustering More accurate Images Red Green Blue Color Histograms R G B Red Green Blue C Angle Sequences = [  ]          Shapes [ICDE’99 … ICME’00] Web Navigations (Hit) Feature Vectors [RIDE’97 … WebKDD’01] P1 P2 P3 P4 P5 … 3 870

7 C. Shahabi Spatial & Temporal Data Complex Queries Data types: A point: or A line-segment: A line: sequence of line-segments A region: A closed set of lines Moving point: (e.g., car, train, …) Changing region: (e.g., changing temperature of a county) Queries: Rivers Countries Hospitals Cities Taxi 5km of Home 10 min Experiments BrainR [Visual’99] [ACM-GIS’01, VLDB’01]

8 C. Shahabi Immersidata and Mining Queries [CIKM’01, UACHI’01]

9 C. Shahabi … … Immersidata and Mining Queries … A dynamic sign, e.g., ASL colors 

10 C. Shahabi Focus Application: On-Line Analytical Processing (OLAP) Multidimensional data sets:  Dimension attributes (e.g., Store, Product, Data)  Measure attributes (e.g., Sale, Price) Range-sum queries  Average sale of shoes in CA in 2001  Number of jackets sold in Seattle in Sep Tougher queries:  Covariance of sale and price of jackets in CA in 2001 (correlation)  Variance of price of jackets in 2001 in Seattle Store Location Product DateSale LA Shoes Jan. 01 $21,500 $85.99 NY Jacket June 01 $28,700 $45.99 Price Market-Relation  (p=shoe)  (s CA)  (d 2001) Avg (sale) Too Slow!

11 C. Shahabi Traditional Solution: Pre-computation Prefix-sum [Agrawal et. al 1997] Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k Age Salary $40k $55k $65k $100k$120k $150k Salary Age Query: Sum(salary) when (25 < age < 40) and (55k < salary < 150k) Disadvantages: Measure attribute should be pre-selected Aggregation function should be pre-selected Works only for limited # of aggregation functions Updates are expensive (need re-computation) Result: I – II – III + IV Query: Sum(salary) when (25 < age < 40) and (55k < salary < 150k)

12 C. Shahabi PROPOLYNE: Progressive Evaluation of Polynomial Range-Sum Query (w/ Rolfe Schmidt) Overview of PROPOLYNE Features of PROPOLYNE Polynomial Range-Sum Queries as Vector Queries Naive Evaluation of Vector Queries Using Wavelets Fast Evaluation of Vector Queries Using Wavelets Progressive/Approximate Evaluation of Vector Queries Using Wavelets Related Work Performance Results Conclusion

13 C. Shahabi Overview of PROPOLYNE Define range-sum query as vector product of query vector and data vector Offline: Multidimensional wavelet transform of data At the query time: “lazy” wavelet transform of query vector (very fast) Dot product of query and data vectors in the transformed domain  exact result in O(2 log N) d Choose high-energy query coefficients only  fast approximate result (90% accuracy by retrieving < 10% of data) Choose query coefficients in order of energy  progressive result

14 C. Shahabi PROPOLYNE Features All attributes can be treated as either “dimension” or “measure” attributes “Function” can be any polynomial on any combination of attributes, i.e., not only SUM, AVERAGE and COUNT but also COVARIANCE, VARIANCE and SUMSQUARE Independent from how well the data set can be compressed/approximated by wavelet  Because: We show “range-sum queries” can always be approximated well by wavelets (not always HAAR though!) Low update cost: O(log d N) Can be used for exact, approximate and progressive range-sum query evaluation

15 C. Shahabi Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I)  I is a finite instance of schema F  R SubSetOf Dom( F ), is the range  f : Dom( F )  R is a polynomial of degree  Example: F = (Age, Salary) R : (25 < age < 40) & (55k < salary < 150k) Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I

16 C. Shahabi Polynomial Range-Sum Queries as “Vector Queries” The data frequency distribution of I is the function  I : Dom( F )  Z that maps a point x to the number of times it occurs in I To emphasize the fact that a query is an operator on the data frequency distribution, we write Example:  (25,50)=  (28,55)=…=  (57,120)=1 and  (x)=0 otherwise. Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I where: if Hence: Or: Vector Query querydata

17 C. Shahabi Ha[i]’sGa[i]’s a[i]’s H 2 a [i]’sGHa[i]’s H 3 a[i]’s GH 2 a[i]’s H operator: computes a local average of array a at every other point to produce an array of summary coefficients: Ha Example (Haar) h=[1/2,1/2] G operator: measures how much values in the array a vary inside each of the summarized blocks to compute an array of detail coefficients: Ga Example (Haar) g=[1/2,-12] Overview of Wavelets DWT of a Summary coefficients of a at level 2 Detail coefficients of a at level 2 aka wavelet coefficients of a

18 C. Shahabi Naive Evaluation of Vector Queries Using Wavelets Hence, vector queries can be computed in the wavelet- transformed space as: Algorithm:  Off-line transformation of data vector (or “data distribution function”, i.e., , to be exact) O (| I | l d log d N) for sparse data, O (| I |) = N d for dense data  Real-time transformation of the query vector at the query evaluation time O ( l d log d N)  Sum-up the products of the corresponding elements of data and query vectors Retrieving elements of data vector: O (N d )

19 C. Shahabi Fast Evaluation of Vector Queries Using Wavelets Main intuitions:  “query vector” can be transformed quickly because most of the coefficients are known in advance  “Transformed query vector” has a large number of negligible (e.g., zero) values (independent on how well data can be approximated by wavelet)  Example: Haar filter & COUNT function on R=[5,12] on the domain of integers from 0 to 15: Ga GHaGH 2 a GH 3 a H4aH4a At each step, you know the zeros

20 C. Shahabi The Lazy Wavelet Transform Computing Summary Coefficients (Haar Filter, COUNT function) Outside the range, summary coeffs are ½ *0 + ½ * 0 = 0. At boundary of range, summary coeff is ½ *0 + ½ * 1 = ½ Inside range, summary coeffs are ½ * 1 + ½ * 1 = 1 All summary coefficients computed in CONSTANT time! The only “interesting” activity happens on the boundary. Summary coefficient array looks almost exactly like original array.

21 C. Shahabi The Lazy Wavelet Transform All detail coefficients computed in CONSTANT time! The only “interesting” activity happens on the boundary. Computing Detail Coefficients (Haar Filter, COUNT function) Outside the range, detail coeffs are ½ *0 - ½ * 0 = 0. At lower boundary of range, detail coeff is ½ *0 - ½ * 1 = -½ Inside range, detail coeffs are ½ * 1 - ½ * 1 = 0 At upper boundary of range, detail coeff is ½ *1 - ½ * 0 = ½ All but 2 detail coefficients at each level are equal to zero!

22 C. Shahabi Fast Evaluation of Vector Queries Using Wavelets … Technical Requirements:  Wavelets must satisfy a “moment condition”  Wavelets should have small support (i.e., the shorter the filter, the better)  Supports any Polynomial Range-Sum up to a degree determined by the choice of wavelets E.g., Haar can only support degree 0 (e.g., COUNT), while db4 can support up to degree 1 (e.g., SUM), and db6 for degree 2 (e.g., VARIANCE) Standard DWT:  (N) Our lazy wavelet for transforming query function:  ( l log N) where l is the length of the filter

23 C. Shahabi Exact Evaluation of Vector Queries Query: SUM(salary) when (25 < age < 40) & (55k < salary < 150k) # of Wavelet Coefficients: 1250

24 C. Shahabi Approximate Evaluation of Vector Queries

25 C. Shahabi Progressive Evaluation of Vector Queries

26 C. Shahabi Name of TechnologyResearch Group Query Cost Update Cost Storage Cost Aggregate Function Support Query Evaluation Support Measure Known at Population? PROPOLYNE 2001 USC Schmidt & Shahabi lg d N(4  ) d lg d N(2  ) d N d Polynomial Range-Sums of degree  Exact, Approximate, Progressive No PROPOLYNE-FM 2001 USC Schmidt & Shahabi 2 d lg d-1 Nlg d-1 NN d-1 COUNT and SUM Exact, Approximate, Progressive Yes Space-Efficient Dynamic Data Cube 2000 UCSB El-Abbadi & Agrawal et. al 2 d lg d-1 Nlg d-1 N N d-1 COUNT and SUM ExactYes Relative Prefix-Sum 1999 UCSB4 d-1 N (d-1)/2 N d-1 COUNT and SUM ExactYes Prefix-Sum 1997 IBM Agrawal et. al 2 d-1 N d-1 COUNT and SUM ExactYes pCube/MRATree 2000/2001 UCSB and UC Irvine (Mehrota et. al) N d-1 lg N N d COUNT and SUM Exact, Approximate, Progressive Yes Compact Data Cube Duke and IBM (Vitter et. al) small? COUNT and SUM ApproximateYes Optimal Histograms 2001 AT&T (Muthu et. al) small? COUNT and SUM ApproximateYes Kernel Density Estimators 1999 Microsoft (Fayyad et. al) small? All efficiently computable functions ApproximateNo

27 C. Shahabi Experimental Setup PETROL Data Set: Petroleum sales volume tuples Five dimensions:  Sparseness: 0.16%  Traditional data approximation works well 250 range queries generated  Randomly from all possible ranges with the uniform dist.  Ranges which select fewer than 100 tuples were discarded  Median Relative error, GPS Data Set: Sensor readings from GPS ground stations in CA 3358 tuples Four dimensions:   Velocity of upward movement of the station Sparseness: 0.01%  Data approximation works poorly

28 C. Shahabi Performance Results Compact Data Cube (as a representative of data approximation techniques): under 10% error after using less than 10% of wavelet coefficients (wavelet coefficients sorted in the order of energy) PETROL

29 C. Shahabi Performance Results CDC needs 5 times as many coefficients as there were tuples in the original table before providing a median relative error of 10% (because data cannot be compressed well) GPS

30 C. Shahabi

31 C. Shahabi Conclusion A novel MOLAP pre-aggregation strategy Supports conventional aggregates: COUNT, SUM and beyond: COVARIANCE First pre-aggregation technique that does not require measures be specified a priori  Measures treated as functions of the attributes at the query time Provides a data independent progressive and approximate query answering technique With provably poly-logarithmic worst-case query and update costs And storage cost comparable or better than other pre-aggregation methods

32 C. Shahabi Future PROPOLYNE future plans:  Use synopsis information about query workloads or data distribution for better sorting of coefficients  Improve random access behavior of PROPOLYNE to data by “clustering” related coeffiecents  More complex queries: OLAP drill-down, general relational algebra queries, … Multidimensional mining research directions:  Efficient ways of finding trends (e.g., correlation between dimensions/attributes)  Efficient ways of finding surprises/outliers  Mining sequence data sets (e.g., genome databases)