Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno MUFIN: Multi Feature Indexing Network
Outline of the talk Why similarity Principles of metric similarity searching The MUFIN approach Demo applications Future directions MUFIN: Multi Feature Indexing Network2
Real-Life Motivation The social psychology view Any event in the history of organism is, in a sense, unique. Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc MUFIN: Multi Feature Indexing Network
Contemporary Networked Media The digital data view Almost everything that we see, read, hear, write, measure, or observe can be digital. Users autonomously contribute to production of global media and the growth is exponential. Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. The elements of networked media are related by numerous multi-facet links of similarity MUFIN: Multi Feature Indexing Network
Examples with Similarity Does the computer disk of a suspected criminal contain illegal multimedia material? What are the stocks with similar price histories? Which companies advertise their logos in the direct TV transmission of football match? Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past? MUFIN: Multi Feature Indexing Network
Challenge Networked media is getting close to the human “fact- bases” – the gap between physical and digital has blurred Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. WHY? It is the similarity which is in the world revealing MUFIN: Multi Feature Indexing Network
Limitations: Data Types We have Attributes – Numbers, strings, etc. Text (text-based) – Documents, annotations We need Multimedia – Image, video, audio Security – Biometrics Medicine – EKG, EEG, EMG, EMR, CT, etc. Scientific data – Biology, chemistry, physics, life sciences, economics Others – Motion, emotion, events, etc MUFIN: Multi Feature Indexing Network
Limitations: Models of Similarity We have Simple geometric models, typically vector spaces We need More complex model Non metric models Asymmetric similarity Subjective similarity Context aware similarity Complex similarity Etc MUFIN: Multi Feature Indexing Network
Limitations: Queries We have Simple query – Nearest neighbor – Range We need More query types – Reverse NN, distinct NN, similarity join Other similarity-based operations – Filtering, classification, event detection, clustering, etc. Similarity algebra – May become the basis of a “Similarity Data Management System” MUFIN: Multi Feature Indexing Network
Limitations: Implementation Strategies We have Centralized or parallel processing We need Scalable and distributed architectures MapReduce like approaches P2P architectures Cloud computing Self-organized architectures Etc MUFIN: Multi Feature Indexing Network
Search Strategy Evolution Scalability ● data volume - exponential ● number of users (queries) ● variety of data types ● multi-lingual, -feature –modal queries Determinism exact match ► similarity precise ► approximate same answer ► good answer; recommendation fixed query ► personalized; context aware fixed infrastr. ► dynamic mapping; mobile dev. grade high low well establishedcutting-edgeresearch peer-to-peer centralized parallel distributed self-organized MUFIN: Multi Feature Indexing Network
similarity effectiveness efficiency stimuli matching extraction evaluation execution algebra Similarity Data Management System Similarity Data Management System MUFIN: Multi Feature Indexing Network
Metric Search Grows in Popularity Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, MUFIN: Multi Feature Indexing Network
The MUFIN Approach MUFINMUFIN: MUlti-Feature Indexing Network SEARCH data & queries infrastructure index structure Scalability P2P structure Extensibility metric space Independence Infrastructure as a service MUFIN: Multi Feature Indexing Network
Extensibility: Metric Abstraction of Similarity Metric space: M = ( D,d) – D – domain – distance function d(x,y) x,y,z D d(x,y) > 0- non-negativity d(x,y) = 0 x = y- identity d(x,y) = d(y,x)- symmetry d(x,y) ≤ d(x,z) + d(z,y)- triangle inequality MUFIN: Multi Feature Indexing Network
Examples of Distance Functions L p Minkovski distance (for vectors) L 1 – city-block distance L 2 – Euclidean distance L – infinity Edit distance (for strings) minimal number of insertions, deletions and substitutions d(‘application’, ‘applet’) = 6 Jaccard’s coefficient (for sets A,B) MUFIN: Multi Feature Indexing Network
Examples of Distance Functions Mahalanobis distance – for vectors with correlated dimensions Hausdorff distance – for sets with elements related by another distance Earth movers distance – primarily for histograms (sets of weighted features) and many others MUFIN: Multi Feature Indexing Network
Similarity Search Problem For X D in metric space M, pre-process X so that the similarity queries are executed efficiently. No total ordering exists! MUFIN: Multi Feature Indexing Network
MUFIN: Multi Feature Indexing Network19 Similarity Queries Range query Nearest neighbor query Similarity join Combined queries Complex queries
MUFIN: Multi Feature Indexing Network20 Similarity Range Query range query – R(q,r) = { x X | d(q,x) ≤ r } … all museums up to 2km from my hotel … r q
MUFIN: Multi Feature Indexing Network21 Nearest Neighbor Query the nearest neighbor query – NN(q) = x – x X, y X, d(q,x) ≤ d(q,y) k-nearest neighbor query – k-NN(q,k) = A – A X, |A| = k – x A, y X – A, d(q,x) ≤ d(q,y) … five closest museums to my hotel … q k=5
MUFIN: Multi Feature Indexing Network22 Similarity Join Queries similarity join of two data sets similarity self join X = Y …pairs of hotels and museums which are five minutes walk apart …
MUFIN: Multi Feature Indexing Network23 Combined Queries Range + Nearest neighbors Nearest neighbor + similarity joins – by analogy
MUFIN: Multi Feature Indexing Network24 Complex Queries Find the best matches of circular shape objects with red color The best match for circular shape or red color needs not be the best match combined A 0 algorithm Threshold algorithm
MUFIN: Multi Feature Indexing Network25 Partitioning Principles Given a set X D in M =( D,d), basic partitioning principles have been defined: – Ball partitioning – Generalized hyper-plane partitioning – Excluded middle partitioning – Clustering
MUFIN: Multi Feature Indexing Network26 Ball Partitioning Inner set: { x X | d(p,x) ≤ d m } Outer set: { x X | d(p,x) > d m } p dmdm
MUFIN: Multi Feature Indexing Network27 Generalized Hyper-plane { x X | d(p 1,x) ≤ d(p 2,x) } { x X | d(p 1,x) > d(p 2,x) } p2p2 p1p1
MUFIN: Multi Feature Indexing Network28 Excluded Middle Partitioning Inner set: { x X | d(p,x) ≤ d m - } Outer set: { x X | d(p,x) > d m + } Excluded set: otherwise p dmdm 22 p dmdm
MUFIN: Multi Feature Indexing Network29 Clustering Cluster data into sets – bounded by a ball region – { x X | d(p i,x) ≤ r i c }
Scalability: Peer-to-Peer Indexing Local search: M-tree, D-Index, M-Index Native metric techniques: GHT*, VPT* Transformation techniques: M-CAN, M-Chord MUFIN: Multi Feature Indexing Network
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997] 1)Paged organization 2)Dynamic 3) Suitable for arbitrary metric spaces 4) I/O and CPU optimization - computing d can be time-consuming MUFIN: Multi Feature Indexing Network
The M-tree Idea Depending on the metric, the “shape” of index regions changes C D E F A B B F D E A C Metric: L 2 (Euclidean) L 1 (city-block) L (max-metric) weighted-Euclidean quadratic form MUFIN: Multi Feature Indexing Network
MUFIN: Multi Feature Indexing Network33 o7o7 M-tree: Example o1o1 o6o6 o 10 o3o3 o2o2 o5o5 o4o4 o9o9 o8o8 o 11 o1o o2o o1o o o7o o2o o4o o2o2 0.0o8o8 2.9o1o1 0.0o6o6 1.4o o3o3 1.2o7o7 0.0o5o5 1.3o o4o4 0.0o9o9 1.6 Covering radius Distance to parent Leaf entries
M-tree family Bulk loading Slim-tree Multi-way insertion PM-tree M 2 -tree etc MUFIN: Multi Feature Indexing Network
D-Index [Dohnal, Gennaro, Zezula, MTA 2002] 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure MUFIN: Multi Feature Indexing Network
D-index: Insertion MUFIN: Multi Feature Indexing Network
D-index: Range Search q r q r q r q r q r q r MUFIN: Multi Feature Indexing Network
Implementation Postulates of Distributed Indexes dynamism – nodes can be added and removed no hot-spots – no centralized nodes, no flooding by messages (transactions) update independence – network update at one site does not require an immediate change propagation to all the other sites MUFIN: Multi Feature Indexing Network
Distributed Similarity Search Structures Native metric structures: – GHT* (Generalized Hyperplane Tree) – VPT* (Vantage Point Tree) Transformation approaches: – M-CAN (Metric Content Addressable Network) – M-Chord (Metric Chord) MUFIN: Multi Feature Indexing Network
MUFIN: Multi Feature Indexing Network40 GHT* Address Search Tree Based on the Generalized Hyperplane Tree [Uhl91] – two pivots for binary partitioning p6p6 p5p5 p3p3 p4p4 p1p1 p2p2 p1p1 p2p2 p5p5 p6p6 p3p3 p4p4
MUFIN: Multi Feature Indexing Network41 GHT* Address Search Tree Inner node – two pivots (reference objects) Leaf node – BID pointer to a bucket if data stored on the current peer – NNID pointer to a peer if data stored on a different peer p1p1 p2p2 p5p5 p6p6 p3p3 p4p4 BID 1 BID 2 BID 3 NNID 2 Peer 2
MUFIN: Multi Feature Indexing Network42 GHT* Address Search Tree
MUFIN: Multi Feature Indexing Network43 BID 1 BID 2 BID 3 NNID 2 Peer 2 p1p1 p2p2 p5p5 p6p6 p3p3 p4p4 BID 3 NNID 2 p5p5 p6p6 p1p1 p2p2 GHT* Range Query Range query R(q,r) – traverse peer’s own AST – search buckets for all BIDs found – forward query to all NNIDs found p6p6 p5p5 p3p3 p4p4 r q p1p1 p2p2
MUFIN: Multi Feature Indexing Network44 AST: Logarithmic replication Full AST on every peer is space consuming – replication of pivots grows in a linear way Store only a part of the AST: – all paths to local buckets Deleted sub-trees: – replaced by NNID of the leftmost peer p 13 p 14 p 11 p 12 p5p5 p6p6 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 p9p9 p 10 NNID 2 NNID 3 BID 1 NNID 4 NNID 5 NNID 6 NNID 7 NNID 8 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 BID 1 NNID 3 NNID 5
MUFIN: Multi Feature Indexing Network45 AST: Logarithmic Replication (cont.) Resulting tree – replication of pivots grows in a logarithmic way p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 NNID 2 NNID 3 BID 1 NNID 5 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 BID 1
MUFIN: Multi Feature Indexing Network46 p1p1 r1r1 p3p3 r3r3 VPT* Structure Similar to the GHT* - ball partitioning is used for AST Based on the Vantage Point Tree [Yia93] inner nodes have one pivot and a radius different traversing conditions p2p2 r2r2 p 1 (r 1 ) p 2 (r 2 )p 3 (r 3 )
M-Chord: The Metric Chord Transform metric space to one-dimensional domain – Use M-Index - a generalized version of the iDistance Divide the domain into intervals – assign each interval to a peer Use the Chord P2P protocol for navigation The Skip graphs distributed protocol can be used, alternatively MUFIN: Multi Feature Indexing Network
–range query R(q,r): identify intervals of interest Generalization to metric spaces –select pivots –then partition: Voronoi-style M-Chord: Indexing the Distance iDistance – indexing technique for vector domains –cluster analysis = centers = reference points p i –assign iDistance keys to objects MUFIN: Multi Feature Indexing Network
M-Chord: Chord Protocol Peer-to-Peer navigation protocol Peers are responsible for intervals of keys hops to localize a node storing a key M-Chord set the iDistance domain make it uniform: function h Use Chord on this domain MUFIN: Multi Feature Indexing Network
M-Chord: Range Query Node N q initiates the search Determine intervals –generalized iDistance Forward requests to peers on intervals Search in the nodes –using local organization Merge the received partial answers MUFIN: Multi Feature Indexing Network
MUFIN: Multi Feature Indexing Network51 M-CAN: The Metric CAN Based on the Content-Addressable Network (CAN) – a DHT navigating in an N-dimensional vector space The Idea: 1.Map the metric space to a vector space – given N pivots: p 1, p 2, …, p N, transform every o into vector F(o) 2.Use CAN to – distribute the vector space zones among the nodes – navigate in the network
MUFIN: Multi Feature Indexing Network52 CAN: Principles & Navigation CAN – the principles – the space is divided in zones – each node “owns” a zone – nodes know their neighbors CAN – the navigation – greedy routing – in every step, move to the neighbor closer to the target location 2-dimensional vector space x,y
MUFIN: Multi Feature Indexing Network53 M-CAN: Contractiveness & Filtering Use the L ∞ as a distance measure – the mapping F is contractive More pivots better filtering – but, CAN routing is better for less dimensions Additional filtering – some pivots are only used for filtering data (inside the explored nodes) – they are not used for mapping into CAN vector space
Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework Metric space (D,d) OperationsStorage Centralized index structures Distributed index structures Communication Net Vectors L p and quadratic form Strings (weighted) edit and protein sequence Insert, delete, range query, k-NN query, Incremental k-NN Volatile memory Persistent memory Performance statistics MUFIN: Multi Feature Indexing Network
Metric index structures Object Bucket Index structure Distributed index structure Sequential scan M-Tree, D-Index, M-Index GHT*, VPT*, M-Chord, MCAN Insert Delete Queries MUFIN Overlays MUFIN: Multi Feature Indexing Network
External index Feature extraction MUFIN Overview Peer-to-Peer Networks Multi-overlay structure Forms range k-nearest complex Strategies precise approximate social insert delete features Web service Universal batch, telnet, GUI Specialized image web interface MUFIN: Multi Feature Indexing Network
Applications: a Word Cloud MUFIN: Multi Feature Indexing Network
Concepts of the Image search Image base similar? MUFIN: Multi Feature Indexing Network
Images and their Descriptors Image level R B G Descriptor level MUFIN: Multi Feature Indexing Network
Largest publicly available collection of high-quality images metadata: 106 million images Each image contains: Five MPEG-7 VDs: Scalable Color, Color Structure, Color Layout, Edge Histogram, Homogeneous Texture Other textual information: title, tags, comments, etc. Photos have been crawled from the Flickr photo-sharing site. images + metadata + MPEG-7 VDs CoPhIR: Content-based Photo Image Retrieval MUFIN: Multi Feature Indexing Network
MUFIN SEARCH ENGINE data & queries infrastructure index structure Scalability M-Chord + M-Index Extensibility COPHIR edge histogram color structure scalable color homogeneous texture color layout 6 x IBM server x3400 – 2 servers used Image Search Demo MUFIN: Multi Feature Indexing Network
MUFIN demos MUFIN: Multi Feature Indexing Network
MUFIN Future Research Directions MUFIN - a universal similarity search technology Research directions in: – Core technology – Applications – A style of computing MUFIN Search Engine data & queries infrastructure index structure Scalability P2P structures Extensibility metric space Performance Tuning MUFIN: Multi Feature Indexing Network
MUFIN Future Research Directions October 28, 2011 MUFIN Search Engine data & queries infrastructure index structure More scalable, reliable, robust Multi-layer architectures Self-organizing architectures New query types Flexible sub-sequence matching Efficient multi-feature processing New style of computing Cloud Computing Similarity Search as Service MUFIN: Multi Feature Indexing Network
Major Applications – Images: Sub-image retrieval Ranking Annotation Categorization Benchmarking – Biometrics: Face recognition Fingerprint recognition Gait recognition – Signals: Audio recognition Time series similarity – Videos: Event detection MUFIN: Multi Feature Indexing Network
A New Style of Computing From the project-oriented approach towards similarity cloud for multimedia findability through similarity searching Advantages: – Cloud makes similarity search accessible to common users – Computational resources are shared – users don’t need to maintain any hardware infrastructure – Users don’t need to care for the OS, security, software platform, etc MUFIN: Multi Feature Indexing Network