M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
B-Trees. Motivation When data is too large to fit in the main memory, then the number of disk accesses becomes important. A disk access is unbelievably.
On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Multimedia Database Systems
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Multidimensional Indexing
Searching on Multi-Dimensional Data
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 Lecture 8: Data structures for databases II Jose M. Peña
Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.
2-dimensional indexing structure
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Spatial Access Methods Chapter 26 of book Read only 26.1, 26.2, 26.6 Dr Eamonn Keogh Computer Science & Engineering Department University of California.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Chapter 3: Data Storage and Access Methods
Spatial Indexing I Point Access Methods.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Spatial Data Management Chapter 28.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
B-Trees And B+-Trees Jay Yim CS 157B Dr. Lee.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
COSC 2007 Data Structures II Chapter 15 External Methods.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Antonin Guttman In Proceedings of the 1984 ACM SIGMOD international conference on Management of data (SIGMOD '84). ACM, New York, NY, USA.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Spatial Data Management
Mehdi Kargar Department of Computer Science and Engineering
Tree-Structured Indexes
Chapter 25: Advanced Data Types and New Applications
Spatial Indexing I Point Access Methods.
B+ Tree.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Spatio-Temporal Databases
15-826: Multimedia Databases and Data Mining
Multidimensional Indexes
Spatial Indexing I R-trees
Multidimensional Search Structures
Presentation transcript:

M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

References 1.Paolo Ciaccia and Marco Patella and Pavel Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB Skopal T, Pokorný J, Snasel V. PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases[C]//ADBIS (Local Proceedings)

Introduction Need to manage various types of data stored in large computer repositories has drastically increased and resulted in the development of multimedia database systems aiming at a uniform management of voice, video, image, text, and numerical data. It is of vital importance to effectively and efficiently support the retrieval process devised to determine which portions of the database are relevant to users’ requests.

Introduction Since multimedia applications typically require complex distance functions to quantify similarities of multi-dimensional features, such as shape, texture, color [FEF+94], image patterns [VM95], sound [WBKW96], text, fuzzy values, set values [HNP95], sequence data [AFS93, FRM94], etc., multi- dimensional (spatial) access methods (SAMs), such as R-tree [Gut84] and its variants [SRF87, BKSS90], have been considered to index such data.

Introduction No attempt in the design of these structures has been done to reduce the number of distance computations

Preliminaries A metric space means to provide an efficient support for answering similarity queries, i.e. queries whose purpose is to retrieve DB objects which are “similar” to a reference (query) object, and where the (dis)similarity between objects is measured by a specific metric distance function d.

M-tree M-tree, is proposed to organize and search large data sets from a generic “metric space”.

Metric space A metric space is a pair, M = (D, d), where D is a domain of feature values – the indexing keys – and d is a total (distance) function with the following properties: 1.d(Ox,Oy) = d(Oy,Ox) (symmetry) 2. d(Ox,Oy) > 0 (Ox = Oy) and d(Ox,Ox) = 0 (non negativity) 3. d(Ox,Oy) ≤ d(Ox,Oz) + d(Oz,Oy) (triangle inequality)

The Structure of M-tree Nodes Leaf nodes of any M-tree store all indexed (database) objects, represented by their keys or features, whereas internal nodes store the so-called routing objects. A routing object is a database object to which a routing role is assigned by a specific promotion algorithm.

Routing object For each routing object Or there is an associated pointer, denoted ptr(T(Or)), which references the root of a sub-tree, T(Or), called the covering tree of Or. All objects in the covering tree of Or are within the distance r(Or) from Or, r(Or) > 0, which is called the covering radius of Or and forms a part of the Or entry in a particular M-tree node. Finally, a routing object Or is associated with a distance to P(Or), its parent object, that is the routing object which references the node where the Or entry is stored.

Routing object

Database object Database object Oj in a leaf is quite similar to that of a routing object, but no covering radius is needed, and the pointer field stores the actual object identifier (oid), which is used to provide access to the whole object possibly resident on a separate data file.

M-tree

M-tree

Building the M-tree The Insert algorithm recursively descends the M-tree to locate the most suitable leaf node for accommodating a new object, On, possibly triggering a split if the leaf is full. The basic rationale used to determine the “most suitable” leaf node is to descend, at each level of the tree, along a sub-tree, T(Or), for which no enlargement of the covering radius is needed, i.e. d(Or,On) ≤ r(Or).

Building the M-tree If no routing object for which d(Or,On) ≤ r(Or) exists, the choice is to minimize the increase of the covering radius, d(Or,On) − r(Or).

Insert algorithm

Split Management As any other dynamic balanced tree, M-tree grows in a bottom-up fashion. The overflow of a node N is managed by allocating a new node, N’, at the same level of N, partitioning the entries among these two nodes, and posting (promoting) to the parent node, Np, two routing objects to reference the two nodes. When the root splits, a new root is created and the M-tree grows by one level up.

Split Management A specific implementation of the Promote and Partition methods defines what we call a split policy.

Split Management

Split Policies The “ideal” split policy should promote Op1 and Op2 objects, and partition other objects so that the two soobtained regions would have minimum “volume” and minimum “overlap”.

Choosing the Routing Objects M_RAD, mM_RAD, M_LB_DIST, RANDOM, SAMPLING. M_LB_DIST: The acronym stands for “Maximum Lower Bound on DISTance”. This policy differs from previous ones in that it only uses the precomputed stored distances. In the confirmed version, where Op1≡ Op, the algorithm determines Op2 as the farthest object from Op, that is d(Op2,Op) = max_j {d(Oj,Op)}

Distribution of the Entries Given a set of entries N and two routing objects Op1 and Op2, the problem is how to efficiently partition N into two subsets, N1 and N2. Generalized Hyperplane: Assign each object Oj ∈ N to the nearest routing object: if d(Oj,Op1 ) ≤ d(Oj,Op2 ) then assign Oj to N1, else assign Oj to N2. Balanced: Compute d(Oj,Op1) and d(Oj,Op2 ) for all Oj ∈ N. Repeat until N is empty: Assign to N1 the nearest neighbor of Op1 in N and remove it from N; Assign to N2 the nearest neighbor of Op2 in N and remove it from N.

Thank you Rongxing’s Homepage: PPT Ximeng’s Homepage: