Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Slides:



Advertisements
Similar presentations
Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
Advertisements

Indexing DNA Sequences Using q-Grams
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Searching on Multi-Dimensional Data
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.
Spatial Mining.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Dimensionality Reduction
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software.
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Multi-object Similarity Query Evaluation Michal Batko.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Spatial Data Management
Strategies for Spatial Joins
SIMILARITY SEARCH The Metric Space Approach
Multidimensional Access Structures
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Indexing and Hashing Basic Concepts Ordered Indices
Locality Sensitive Hashing
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Efficient Record Linkage in Large Data Sets
Similarity Search: A Matching Based Approach
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
Group 9 – Data Mining: Data
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen

Metric space  A tuple(M,d) M: the domain of objects d: a distance function defines the similarity between the objects in M  Function d has four properties: 1. symmetry: d(q,o)=d(o,q) 2. non-negativity: d(q,o)≥ 0 3. identity: d(q,o)=0 if and only if q=o 4. triangle inequality: d(q,o)≤d(q,p)+d(o,p)

Metric space  For example, using edit distance as the distance function, any English word set can be a metric space.  Edit distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Problem formulation

MAM metric access methods  Compact partitioning methods divide the space into compact regions and try to discard unqualified regions during search  Pivot-based methods store pre-computed distances from each object in the database to a set of pivots

SPB-tree Space-filling curving and Pivot-based B + - tree  Generic it does not rely on the detailed representations of objects, and it can support any distance notion that satisfies the triangle inequality.  SPB-Tree Integrate the compact partitioning with a pivot-based approach by utilizing a space-filling curve and a B + - tree  Efficient similarity search algorithms effective pivots to reduce significantly the number of distance computations during the search

Construction framework  Pivot mapping  Space-filling curve mapping

Pivot mapping and triangle inequality

Space-filling curve mapping  If the range in metric space is discrete integers, the SFC can directly map to an integer  Considering the range of d( ) in a metric space may be continuous real numbers, -approximation is utilized to partition the real range into discrete integers. can be approximated as where the whole vector space can be partitioned into cells

Pivot selection  The number of pivots the appropriate number of pivots is related to the intrinsic dimensionality of the dataset.  Use HF based Incremental pivot selection algorithm (HFI) to find outliers

Indexing structure  A pivot table stores selected objects (e.g., o 1 and o 6 ) to map a metric space into a vector space.  A B+-tree is employed to index the SFC values of objects after a pivot mapping.  A RAF to keep objects separately and supports both random access and sequential scan

Bulk-loading Operation

Similarity search

kNN search

Cost Models estimated number of distance computations -- EDC  The overall distribution of distances from objects in O to a pivot pi is defined as: F pi (r) = Pr {d(o, p i ) ≤ r} F(r 1, r 2,..., r |P| ) = Pr {d(o, p 1 ) ≤ r 1, d(o, p2) ≤ r2,..., d(o, p|P|) ≤ r|P|}  EDC = |P| + |O|* Pr (d(q, o) is needed to compute)

Cost Models expected number of page accesses -- EPA  the expected number of page accesses (EPA) of a similar query can be calculated as

Experiments effect of parameters  Pivot number