23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

Slides:



Advertisements
Similar presentations
Scalability, from a database systems perspective Dave Abel.
Advertisements

High-dimensional Similarity Join
Trees for spatial indexing
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
電腦視覺 Computer and Robot Vision I
Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)
Access Methods for Advanced Database Applications.
Searching on Multi-Dimensional Data
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.
Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Spatial Mining.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 2) Efficient Processing of Spatial Joins Using R-trees Rollo Chan Chu Chung Man Mak Wai Yip Vivian Lee Eric.
Spatio-temporal Databases Time Parameterized Queries.
Spatio-Temporal Databases
Spatial Queries Nearest Neighbor and Join Queries.
Chapter 3: Data Storage and Access Methods
Spatial Queries Nearest Neighbor Queries.
Spatial Queries. R-tree: variations What about static datasets? (no ins/del) Hilbert What about other bounding shapes?
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
SEMILARITY JOIN COP6731 Advanced Database Systems.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Density-Based Clustering Algorithms
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Spatial DBMS Spatial Database Management Systems.
Observer Relative Data Extraction Linas Bukauskas 3DVDM group Aalborg University, Denmark 2001.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Presented by Ho Wai Shing
Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
Spatial Data Management
Spatial Queries Nearest Neighbor and Join Queries.
A Black-Box Approach to Query Cardinality Estimation
Spatial Indexing.
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
Query Processing in Databases Dr. M. Gavrilova
Sameh Shohdy, Yu Su, and Gagan Agrawal
Joining Massive High-Dimensional Datasets
Content-Based Image Retrieval
Content-Based Image Retrieval
CMSC 635 Ray Tracing.
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join

23 2 Feature Based Similarity

23 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

23 4 Join Applications: Catalogue Matching  Catalogue matching E.g. Astronomy catalogues R S

23 5 Join Applications: Clustering  Clustering (e.g. DBSCAN)  Similarity self-join

23 6 R-Tree Similarity Join  Depth-first traversal of two trees [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]  R S

23 7 The  -kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]  Assumption: 2 adjacent  -stripes fit in main mem.  Unrealistic for large data sets which are... clustered, skewed and high-dimensional data

23 8 Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

23 9 Common Properties  Decomposition of data/space into regions  Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q)    for each pair of points (p,q) on (P,Q) test dist (p,q)   ;  Most CPU-effort in distance test between vectors:  Idea: Speed-up distance test

23 10 Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]  Observations: More efficient to use x-axis as sweep direction. Projection of polygons to y-axis yield high overlap Decide by projections of the bounding boxes (integrate a pdf)

23 11  Distance computation between feature vectors p,q for (i=0 ; i  2 ) break ; }  Order dimensions by Mating Probability (increasing) Feature Vectors in the Similarity Join d0d0 d1d1

23 12 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis d0d0 d1d1

23 13 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d0d0 d0d0 d0d0 d0d0 d0d0 d0d0 d0d0

23 14 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d 0 -Projection of each point pair located in this event space d0[P]d0[P] d0[Q]d0[Q]

23 15 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d0[P]d0[P] d 0 -Projection of each point pair located in this event space mating point pairs on  -stripe  d0[Q]d0[Q] y  x  y  x + 

23 16 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space   Mating Probability for d 0 d0[P]d0[P] d0[Q]d0[Q]

23 17 Optimal Dimension Order  For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability  Algorithm: for each pair (P,Q) of partitions having dist (P,Q)    determine ODO ; for each pair of points (p,q) on (P,Q) test dist (p,q)   using ODO ;

23 18 Shape of the Intersection Area  20 different shapes are possible, e.g  Easy proof of completeness and efficient case distinction by assigning codes to the corners 1: Corner is left or above the  -stripe 2: Corner is on the  -stripe 3: Corner is right or below the  -stripe  Easy formulas (only 45° and 90° angles)

23 19 Experimental Evaluation: R-tree Sim. Join  8-dimensional data, uniformly distributed

23 20 Experimental Evaluation: R-tree Sim. Join  16-dimensional data, from CAD-similarity search

23 21 Experimental Evaluation: Scalability MuX, uniform dataZ-RSJ, uniform data

23 22 Experimental Evaluation: Scalability EGO, CAD data

23 Conclusion  Conclusion: Similarity join is an important database primitive for knowledge discovery in databases Many different basic algorithms Most accelerable by our optimal dimension order  Future Work: New applications of the similarity join Further optimization (multi-parameter) of the sim. join Parallel and distributed environments