23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join
23 2 Feature Based Similarity
23 3 Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
23 4 Join Applications: Catalogue Matching Catalogue matching E.g. Astronomy catalogues R S
23 5 Join Applications: Clustering Clustering (e.g. DBSCAN) Similarity self-join
23 6 R-Tree Similarity Join Depth-first traversal of two trees [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993] R S
23 7 The -kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are... clustered, skewed and high-dimensional data
23 8 Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]
23 9 Common Properties Decomposition of data/space into regions Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) for each pair of points (p,q) on (P,Q) test dist (p,q) ; Most CPU-effort in distance test between vectors: Idea: Speed-up distance test
23 10 Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] Observations: More efficient to use x-axis as sweep direction. Projection of polygons to y-axis yield high overlap Decide by projections of the bounding boxes (integrate a pdf)
23 11 Distance computation between feature vectors p,q for (i=0 ; i 2 ) break ; } Order dimensions by Mating Probability (increasing) Feature Vectors in the Similarity Join d0d0 d1d1
23 12 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis d0d0 d1d1
23 13 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d0d0 d0d0 d0d0 d0d0 d0d0 d0d0 d0d0
23 14 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d 0 -Projection of each point pair located in this event space d0[P]d0[P] d0[Q]d0[Q]
23 15 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d0[P]d0[P] d 0 -Projection of each point pair located in this event space mating point pairs on -stripe d0[Q]d0[Q] y x y x +
23 16 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space Mating Probability for d 0 d0[P]d0[P] d0[Q]d0[Q]
23 17 Optimal Dimension Order For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability Algorithm: for each pair (P,Q) of partitions having dist (P,Q) determine ODO ; for each pair of points (p,q) on (P,Q) test dist (p,q) using ODO ;
23 18 Shape of the Intersection Area 20 different shapes are possible, e.g Easy proof of completeness and efficient case distinction by assigning codes to the corners 1: Corner is left or above the -stripe 2: Corner is on the -stripe 3: Corner is right or below the -stripe Easy formulas (only 45° and 90° angles)
23 19 Experimental Evaluation: R-tree Sim. Join 8-dimensional data, uniformly distributed
23 20 Experimental Evaluation: R-tree Sim. Join 16-dimensional data, from CAD-similarity search
23 21 Experimental Evaluation: Scalability MuX, uniform dataZ-RSJ, uniform data
23 22 Experimental Evaluation: Scalability EGO, CAD data
23 Conclusion Conclusion: Similarity join is an important database primitive for knowledge discovery in databases Many different basic algorithms Most accelerable by our optimal dimension order Future Work: New applications of the similarity join Further optimization (multi-parameter) of the sim. join Parallel and distributed environments