Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal Charles University in Prague Faculty of Mathematics and Physics.

Similar presentations


Presentation on theme: "On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal Charles University in Prague Faculty of Mathematics and Physics."— Presentation transcript:

1 On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal (tomas@skopal.net) Charles University in Prague Faculty of Mathematics and Physics Department of Software Engineering Prague, Czech Republic

2 EDBT 2006, Munich2 Presentation Outline introduction  motivation of non-metric similarity search  metric access methods, intrinsic dimensionality our objective: fast non-metric search  turning non-metric into metric  the TriGen algorithm  experimental results  conclusions and future work

3 EDBT 2006, Munich3 Similarity Search in Multimedia Databases non-structured data instances  multimedia objects, texts, sequences, time series, etc. distance function d: U  U  R  d(O 1,O 2 ) interpreted as a dissimilarity score of two objects  metric properties (  O i, O j, O k  U) reflexivity d(O i, O j ) = 0  O i = O j positivity d(O i, O j ) > 0  O i  O j symmetry d(O i, O j ) = d(O j, O i ) triangular inequality d(O i, O j ) + d(O j, O k )  d(O i, O k )  triangular triplet (a,b,c) = a + b  c & a + c  b & b + c  a  when triangular inequality satisfied by d, then for every O 1,O 2,O 3  U (d(O 1,O 2 ), d(O 2,O 3 ), d(O 1,O 3 )) is a triangular triplet semi-metric metric

4 EDBT 2006, Munich4 Metric Access Methods given a metric d and a dataset S  U, metric access methods (MAMs) can be used to organize objects of S  Reason: fast query processing (range & k-nearest neighbor queries)  Principle of MAMs: structured decomposition of objects into equivalence classes, such that only some “candidate” classes have to be searched when querying the filtering of non-relevant classes is possible due to the metric properties (esp. triangular inequality)  Examples: M-tree, PM-tree, D-index, gh-tree, vp-tree, LAESA, etc.

5 EDBT 2006, Munich5 Metric Access Methods, cont. intrinsic dimensionality  definition (as proposed in [4]) :  (S,d) =  2 / 2  2 (  is mean and  2 is variance of distance distribution in S)  indicates how effeciently (quickly) could be a dataset S queried using a metric d low  (e.g. below 10) means the dataset is well-structured – i.e. there exist tight clusters of objects high  means the dataset is poorly structured – i.e. objects are almost equaly distant in consequence, intrinsically high-dimensional datasets are hard to organize, so that querying becomes inefficient (sequential scan)  example: an M-tree hierarchy built on a high-dimensional dataset low intrinsic dimensionalityhigh intrinsic dimensionality

6 EDBT 2006, Munich6 Metric vs. non-metric measures non-metric measures are often robust (resistant to outliers, errors in objects, etc.)  the symmetry and mainly the triangular inequality are often violated cannot be directly used with MAMs a b a > b + c c a b a ≠ b

7 EDBT 2006, Munich7 Examples of Non-metric measures various k-median distances  measure distance between the two (k-th) most similar portions in objects COSIMIR  back-propagation network with single output neuron serving as a distance, allows training Dynamic Time Warping distance  sequence alignment technique  minimizes the sum of distances between sequence elements fractional L p distances  generalization of Minkowski distances (p<1)  more robust to extreme differences in coordinates

8 EDBT 2006, Munich8 Turning Non-metric into Metric the reflexivity & positivity  by setting a minimum distance lowerbound d - < 0, i.e. O 1 ≠ O 2  d rp (O 1, O 2 ) = d(O 1, O 2 ) + |d - | + some small value, otherwise d rp (O 1, O 2 ) = 0 the symmetry  e.g. d s (O 1, O 2 ) = min(d(O 1, O 2 ), d(O 2, O 1 ))  query is processed using d s, and the query result is re-filtered using d how to satisfy the triangular inequality ?  we apply a modifying function f on d, making semi-metric a metric

9 EDBT 2006, Munich9 SP-modifiers Let f be a function f: R  R, such that f(0) = 0 and f is increasing (i.e. f(x) > f(y)  x > y). For similarity search purposes f(d( ,  )) – further denoted as d f – can be safely used instead of just d. (In case of range query (Q, r Q ) the query radius r Q is modified to f(r Q ).) Proof: All similarity orderings are preserved. 1) Consider the set of all pairs of objects from U. 2) Create ordering of the pairs with respect to distances of the two objects in the pair. 3) The ordering does not change after the application of any f on the distances, because f is increasing. We call such function f as similarity-preserving modifier (simply SP-modifier.) f1()f1() f2()f2() f3()f3() f4()f4()

10 EDBT 2006, Munich10 We want to find such SP-modifier, that forces d to satisfy the triangular inequality: any concave SP-modifier f is metric-preserving (proof in [3])  when applied on any metric d( ,  ), d f is metric as well  when applied on a triangular triplet (a,b,c), (f(a),f(b),f(c)) is triangular triplet as well any concave SP-modifier is triangle-generating (TG-modifier)  when applied on all possible triplets, some of them become triangular (theory of concave functions) the more concave f, the more triplets become triangular  once a triplet becomes triangular, after application of any other TG-modifier it remains triangular Theorem: Every semi-metric can be turned by a single TG-modifier into a metric. TG-modifiers

11 EDBT 2006, Munich11 Proof: Incremental Triplet Stretching c a b a > b + c f1(a)f1(a) f1(b)f1(b) f1(c)f1(c) f 1 (a) > f 1 (b) + f 1 (c) (low-concave f 1 ) f2(a)f2(a) f2(b)f2(b) f2(c)f2(c) f 2 (a) ≤ f 2 (b) + f 2 (c) (high-concave f 2 ) f 3 (a)  f 3 (b)  f 3 (c) (extremely high-concave f 3 ) f3(a)f3(a) f3(b)f3(b) f3(c)f3(c) We repeatedly apply TG-modifiers on all triangular triplets (generated by d( ,  ) on S), starting with a less concave TG-modifier, proceeding with more concave ones. We continue with applying more and more concave TG-modifiers (e.g. by nesting them) until we turn all the triplets into triangular ones.

12 EDBT 2006, Munich12 Optimal TG-modifier There exist infinitely many TG-modifiers that turn a given semi-metric into a metric. However, not all are suitable for fast similarity search. The optimal TG-modifier should:  turn every non-triangular triplet generated by d (considering the objects from S) into a triangular one (i.e. enforce the triangular inequality)  keep the intrinsic dimensionality of S with respect to d f as low as possible Example: This TG-modifier turns every semi- metric into metric, but is useless for searching by MAMs. All classes maintained by a MAM overlap every query, so the search deteriorates to sequential scan.

13 EDBT 2006, Munich13 Scaling the concavity How to find an optimal TG-modifier for a given d (and S)? We make use of some predefined TG-bases:  TG-base is an extended TG-modifier such that it uses a concavity weight w  0 as second parameter, i.e. f: R  R  R  for w = 0, the TG-base turns into identity, i.e. f(x,0) = x  with increasing w, the TG-modifier f(x,w) becomes more concave  the greater w (thus more concave f),  the more triplets become triangular  the higher the intrinsic dimensionality is  we can relax the strict condition of needing all triplets to become triangular by introducing a TG-error tolerance  (a ratio of triangular triplets to non-triangular triplets) to be satisfied  a choice of exact or approximate search (  = 0 or  > 0)

14 EDBT 2006, Munich14 Proposed TG-bases  general-purpose TG-bases  Fractional Power TG-base (FP-base)  Rational Bezier Quadric TG-bases (RBQ-bases)  each such TG-base is additionally provided by the second Bezier point (a,b)  choosing different (a,b) allows to predefine the “place of maximum concavity” in the TG-base  we need to find an optimal w for a TG-base f, such that d f becomes metric, but w is as low as possible – the TriGen algorithm

15 EDBT 2006, Munich15 The TriGen algorithm The algorithm finds a TG-modifier (formed by a TG-base and the appropriate concavity weight w), which turns a given semi-metric d into an (approximated) metric, while the intrinsic dimensionality is kept as low as possible. The algorithm makes use of halving the concavity interval, when searching for the optimal concavity weight.

16 EDBT 2006, Munich16 Experimental Results The testbed: two dataset (1 real – images (histograms), 1 synthetic - polygons) 10 non-metric measures (6 for images, 4 for polygons)  TriGen was used to create the modification of a semi-metric into metric 2 MAMs – M-tree and PM-tree Testing of: 1) intrinsic dimensionalities of the datasets (with respect to df, where f is the TG-modifier found by TriGen) 2) performance k-NN queries – performance, retrieval error (when the TG-error tolerance  > 0 )

17 EDBT 2006, Munich17 Experiments – intrinsic dimensionalities

18 EDBT 2006, Munich18 Experiments – k-NN queries

19 EDBT 2006, Munich19 Experiments – k-NN queries

20 EDBT 2006, Munich20 Conclusions and Future Work We have presented: a way of fast searching in non-metric datasets by metric access methods  in particular, the Trigen algorithm for turning any semi-metric into a metric future work:  a generalized framework for fast exact and approximate similarity search (either metric or non- metric) – a combination with previous work [2]

21 EDBT 2006, Munich21 References [1] T. Skopal, J. Pokorný, V. Snášel Nearest Neighbours Search using the PM-tree. In DASFAA 2005, Beijing, China, pages 803–815. LNCS 3453, Springer. [2] T. Skopal, P. Moravec, Jaroslav Pokorný, V. Snášel Metric Indexing for the Vector Model in Text Retrieval, In SPIRE 2004, Padova, Italy, pages 183-195, LNCS 3246, Springer. [3] P. Corazza Introduction to metric-preserving functions, American Mathematical Monthly 104(4), 1999. [4] E. Chávez, G. Navarro A Probabilistic Spell for the Curse of Dimensionality, In ALENEX 2001, LNCS 2153, Springer.


Download ppt "On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal Charles University in Prague Faculty of Mathematics and Physics."

Similar presentations


Ads by Google