Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Similar presentations


Presentation on theme: "Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen."— Presentation transcript:

1 Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen

2 Metric space  A tuple(M,d) M: the domain of objects d: a distance function defines the similarity between the objects in M  Function d has four properties: 1. symmetry: d(q,o)=d(o,q) 2. non-negativity: d(q,o)≥ 0 3. identity: d(q,o)=0 if and only if q=o 4. triangle inequality: d(q,o)≤d(q,p)+d(o,p)

3 Metric space  For example, using edit distance as the distance function, any English word set can be a metric space.  Edit distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

4 Problem formulation

5 MAM metric access methods  Compact partitioning methods divide the space into compact regions and try to discard unqualified regions during search  Pivot-based methods store pre-computed distances from each object in the database to a set of pivots

6 SPB-tree Space-filling curving and Pivot-based B + - tree  Generic it does not rely on the detailed representations of objects, and it can support any distance notion that satisfies the triangle inequality.  SPB-Tree Integrate the compact partitioning with a pivot-based approach by utilizing a space-filling curve and a B + - tree  Efficient similarity search algorithms effective pivots to reduce significantly the number of distance computations during the search

7 Construction framework  Pivot mapping  Space-filling curve mapping

8 Pivot mapping and triangle inequality

9 Space-filling curve mapping  If the range in metric space is discrete integers, the SFC can directly map to an integer  Considering the range of d( ) in a metric space may be continuous real numbers, -approximation is utilized to partition the real range into discrete integers. can be approximated as where the whole vector space can be partitioned into cells

10

11 Pivot selection  The number of pivots the appropriate number of pivots is related to the intrinsic dimensionality of the dataset.  Use HF based Incremental pivot selection algorithm (HFI) to find outliers

12 Indexing structure  A pivot table stores selected objects (e.g., o 1 and o 6 ) to map a metric space into a vector space.  A B+-tree is employed to index the SFC values of objects after a pivot mapping.  A RAF to keep objects separately and supports both random access and sequential scan

13

14 Bulk-loading Operation

15 Similarity search

16 kNN search

17 Cost Models estimated number of distance computations -- EDC  The overall distribution of distances from objects in O to a pivot pi is defined as: F pi (r) = Pr {d(o, p i ) ≤ r} F(r 1, r 2,..., r |P| ) = Pr {d(o, p 1 ) ≤ r 1, d(o, p2) ≤ r2,..., d(o, p|P|) ≤ r|P|}  EDC = |P| + |O|* Pr (d(q, o) is needed to compute)

18 Cost Models expected number of page accesses -- EPA  the expected number of page accesses (EPA) of a similar query can be calculated as

19 Experiments effect of parameters  Pivot number


Download ppt "Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen."

Similar presentations


Ads by Google