Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno 23.1.20121MUFIN:

Similar presentations


Presentation on theme: "Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno 23.1.20121MUFIN:"— Presentation transcript:

1 Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno 23.1.20121MUFIN: Multi Feature Indexing Network

2 Outline of the talk Why similarity Principles of metric similarity searching The MUFIN approach Demo applications Future directions 23.1.2012MUFIN: Multi Feature Indexing Network2

3 Real-Life Motivation The social psychology view Any event in the history of organism is, in a sense, unique. Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc. 23.1.20123MUFIN: Multi Feature Indexing Network

4 Contemporary Networked Media The digital data view Almost everything that we see, read, hear, write, measure, or observe can be digital. Users autonomously contribute to production of global media and the growth is exponential. Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. The elements of networked media are related by numerous multi-facet links of similarity. 23.1.20124MUFIN: Multi Feature Indexing Network

5 Examples with Similarity Does the computer disk of a suspected criminal contain illegal multimedia material? What are the stocks with similar price histories? Which companies advertise their logos in the direct TV transmission of football match? Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past? 23.1.20125MUFIN: Multi Feature Indexing Network

6 Challenge Networked media is getting close to the human “fact- bases” – the gap between physical and digital has blurred Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. WHY? It is the similarity which is in the world revealing. 23.1.20126MUFIN: Multi Feature Indexing Network

7 Limitations: Data Types We have Attributes – Numbers, strings, etc. Text (text-based) – Documents, annotations We need Multimedia – Image, video, audio Security – Biometrics Medicine – EKG, EEG, EMG, EMR, CT, etc. Scientific data – Biology, chemistry, physics, life sciences, economics Others – Motion, emotion, events, etc. 23.1.20127MUFIN: Multi Feature Indexing Network

8 Limitations: Models of Similarity We have Simple geometric models, typically vector spaces We need More complex model Non metric models Asymmetric similarity Subjective similarity Context aware similarity Complex similarity Etc. 23.1.20128MUFIN: Multi Feature Indexing Network

9 Limitations: Queries We have Simple query – Nearest neighbor – Range We need More query types – Reverse NN, distinct NN, similarity join Other similarity-based operations – Filtering, classification, event detection, clustering, etc. Similarity algebra – May become the basis of a “Similarity Data Management System” 23.1.20129MUFIN: Multi Feature Indexing Network

10 Limitations: Implementation Strategies We have Centralized or parallel processing We need Scalable and distributed architectures MapReduce like approaches P2P architectures Cloud computing Self-organized architectures Etc. 23.1.201210MUFIN: Multi Feature Indexing Network

11 Search Strategy Evolution Scalability ● data volume - exponential ● number of users (queries) ● variety of data types ● multi-lingual, -feature –modal queries Determinism exact match ► similarity precise ► approximate same answer ► good answer; recommendation fixed query ► personalized; context aware fixed infrastr. ► dynamic mapping; mobile dev. grade high low well establishedcutting-edgeresearch peer-to-peer centralized parallel distributed self-organized 23.1.201211MUFIN: Multi Feature Indexing Network

12 similarity effectiveness efficiency stimuli matching extraction evaluation execution algebra Similarity Data Management System Similarity Data Management System 23.1.201212MUFIN: Multi Feature Indexing Network

13 Metric Search Grows in Popularity Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2006 23.1.201213MUFIN: Multi Feature Indexing Network

14 The MUFIN Approach MUFINMUFIN: MUlti-Feature Indexing Network SEARCH data & queries infrastructure index structure Scalability P2P structure Extensibility metric space Independence Infrastructure as a service 23.1.201214MUFIN: Multi Feature Indexing Network

15 Extensibility: Metric Abstraction of Similarity Metric space: M = ( D,d) – D – domain – distance function d(x,y)  x,y,z  D d(x,y) > 0- non-negativity d(x,y) = 0  x = y- identity d(x,y) = d(y,x)- symmetry d(x,y) ≤ d(x,z) + d(z,y)- triangle inequality 23.1.201215MUFIN: Multi Feature Indexing Network

16 Examples of Distance Functions L p Minkovski distance (for vectors) L 1 – city-block distance L 2 – Euclidean distance L  – infinity Edit distance (for strings) minimal number of insertions, deletions and substitutions d(‘application’, ‘applet’) = 6 Jaccard’s coefficient (for sets A,B) 23.1.201216MUFIN: Multi Feature Indexing Network

17 Examples of Distance Functions Mahalanobis distance – for vectors with correlated dimensions Hausdorff distance – for sets with elements related by another distance Earth movers distance – primarily for histograms (sets of weighted features) and many others 23.1.201217MUFIN: Multi Feature Indexing Network

18 Similarity Search Problem For X  D in metric space M, pre-process X so that the similarity queries are executed efficiently. No total ordering exists! 23.1.201218MUFIN: Multi Feature Indexing Network

19 23.1.2012MUFIN: Multi Feature Indexing Network19 Similarity Queries Range query Nearest neighbor query Similarity join Combined queries Complex queries

20 23.1.2012MUFIN: Multi Feature Indexing Network20 Similarity Range Query range query – R(q,r) = { x  X | d(q,x) ≤ r } … all museums up to 2km from my hotel … r q

21 23.1.2012MUFIN: Multi Feature Indexing Network21 Nearest Neighbor Query the nearest neighbor query – NN(q) = x – x  X,  y  X, d(q,x) ≤ d(q,y) k-nearest neighbor query – k-NN(q,k) = A – A  X, |A| = k –  x  A, y  X – A, d(q,x) ≤ d(q,y) … five closest museums to my hotel … q k=5

22 23.1.2012 MUFIN: Multi Feature Indexing Network22 Similarity Join Queries similarity join of two data sets similarity self join  X = Y …pairs of hotels and museums which are five minutes walk apart … 

23 23.1.2012MUFIN: Multi Feature Indexing Network23 Combined Queries Range + Nearest neighbors Nearest neighbor + similarity joins – by analogy

24 23.1.2012MUFIN: Multi Feature Indexing Network24 Complex Queries Find the best matches of circular shape objects with red color The best match for circular shape or red color needs not be the best match combined A 0 algorithm Threshold algorithm

25 23.1.2012MUFIN: Multi Feature Indexing Network25 Partitioning Principles Given a set X  D in M =( D,d), basic partitioning principles have been defined: – Ball partitioning – Generalized hyper-plane partitioning – Excluded middle partitioning – Clustering

26 23.1.2012MUFIN: Multi Feature Indexing Network26 Ball Partitioning Inner set: { x  X | d(p,x) ≤ d m } Outer set: { x  X | d(p,x) > d m } p dmdm

27 23.1.2012MUFIN: Multi Feature Indexing Network27 Generalized Hyper-plane { x  X | d(p 1,x) ≤ d(p 2,x) } { x  X | d(p 1,x) > d(p 2,x) } p2p2 p1p1

28 23.1.2012MUFIN: Multi Feature Indexing Network28 Excluded Middle Partitioning Inner set: { x  X | d(p,x) ≤ d m -  } Outer set: { x  X | d(p,x) > d m +  } Excluded set: otherwise p dmdm 22 p dmdm

29 23.1.2012MUFIN: Multi Feature Indexing Network29 Clustering Cluster data into sets – bounded by a ball region – { x  X | d(p i,x) ≤ r i c }

30 Scalability: Peer-to-Peer Indexing Local search: M-tree, D-Index, M-Index Native metric techniques: GHT*, VPT* Transformation techniques: M-CAN, M-Chord 23.1.201230MUFIN: Multi Feature Indexing Network

31 The M-tree [Ciaccia, Patella, Zezula, VLDB 1997] 1)Paged organization 2)Dynamic 3) Suitable for arbitrary metric spaces 4) I/O and CPU optimization - computing d can be time-consuming 23.1.201231MUFIN: Multi Feature Indexing Network

32 The M-tree Idea Depending on the metric, the “shape” of index regions changes C D E F A B B F D E A C Metric: L 2 (Euclidean) L 1 (city-block) L  (max-metric) weighted-Euclidean quadratic form 23.1.201232MUFIN: Multi Feature Indexing Network

33 23.1.2012MUFIN: Multi Feature Indexing Network33 o7o7 M-tree: Example o1o1 o6o6 o 10 o3o3 o2o2 o5o5 o4o4 o9o9 o8o8 o 11 o1o1 4.5-.-o2o2 6.9-.-o1o1 1.40.0o 10 1.23.3o7o7 1.33.8o2o2 2.90.0o4o4 1.65.3 o2o2 0.0o8o8 2.9o1o1 0.0o6o6 1.4o 10 0.0o3o3 1.2o7o7 0.0o5o5 1.3o 11 1.0o4o4 0.0o9o9 1.6 Covering radius Distance to parent Leaf entries

34 M-tree family Bulk loading Slim-tree Multi-way insertion PM-tree M 2 -tree etc. 23.1.201234MUFIN: Multi Feature Indexing Network

35 D-Index [Dohnal, Gennaro, Zezula, MTA 2002] 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure 23.1.201235MUFIN: Multi Feature Indexing Network

36 D-index: Insertion 23.1.201236MUFIN: Multi Feature Indexing Network

37 D-index: Range Search q r q r q r q r q r q r 23.1.201237MUFIN: Multi Feature Indexing Network

38 Implementation Postulates of Distributed Indexes dynamism – nodes can be added and removed no hot-spots – no centralized nodes, no flooding by messages (transactions) update independence – network update at one site does not require an immediate change propagation to all the other sites 23.1.201238MUFIN: Multi Feature Indexing Network

39 Distributed Similarity Search Structures Native metric structures: – GHT* (Generalized Hyperplane Tree) – VPT* (Vantage Point Tree) Transformation approaches: – M-CAN (Metric Content Addressable Network) – M-Chord (Metric Chord) 23.1.201239MUFIN: Multi Feature Indexing Network

40 23.1.2012MUFIN: Multi Feature Indexing Network40 GHT* Address Search Tree Based on the Generalized Hyperplane Tree [Uhl91] – two pivots for binary partitioning p6p6 p5p5 p3p3 p4p4 p1p1 p2p2 p1p1 p2p2 p5p5 p6p6 p3p3 p4p4

41 23.1.2012MUFIN: Multi Feature Indexing Network41 GHT* Address Search Tree Inner node – two pivots (reference objects) Leaf node – BID pointer to a bucket if data stored on the current peer – NNID pointer to a peer if data stored on a different peer p1p1 p2p2 p5p5 p6p6 p3p3 p4p4 BID 1 BID 2 BID 3 NNID 2 Peer 2

42 23.1.2012MUFIN: Multi Feature Indexing Network42 GHT* Address Search Tree

43 23.1.2012MUFIN: Multi Feature Indexing Network43 BID 1 BID 2 BID 3 NNID 2 Peer 2 p1p1 p2p2 p5p5 p6p6 p3p3 p4p4 BID 3 NNID 2 p5p5 p6p6 p1p1 p2p2 GHT* Range Query Range query R(q,r) – traverse peer’s own AST – search buckets for all BIDs found – forward query to all NNIDs found p6p6 p5p5 p3p3 p4p4 r q p1p1 p2p2

44 23.1.2012MUFIN: Multi Feature Indexing Network44 AST: Logarithmic replication Full AST on every peer is space consuming – replication of pivots grows in a linear way Store only a part of the AST: – all paths to local buckets Deleted sub-trees: – replaced by NNID of the leftmost peer p 13 p 14 p 11 p 12 p5p5 p6p6 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 p9p9 p 10 NNID 2 NNID 3 BID 1 NNID 4 NNID 5 NNID 6 NNID 7 NNID 8 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 BID 1 NNID 3 NNID 5

45 23.1.2012MUFIN: Multi Feature Indexing Network45 AST: Logarithmic Replication (cont.) Resulting tree – replication of pivots grows in a logarithmic way p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 NNID 2 NNID 3 BID 1 NNID 5 p1p1 p2p2 p3p3 p4p4 p7p7 p8p8 BID 1

46 23.1.2012MUFIN: Multi Feature Indexing Network46 p1p1 r1r1 p3p3 r3r3 VPT* Structure Similar to the GHT* - ball partitioning is used for AST Based on the Vantage Point Tree [Yia93] inner nodes have one pivot and a radius different traversing conditions p2p2 r2r2 p 1 (r 1 ) p 2 (r 2 )p 3 (r 3 )

47 M-Chord: The Metric Chord Transform metric space to one-dimensional domain – Use M-Index - a generalized version of the iDistance Divide the domain into intervals – assign each interval to a peer Use the Chord P2P protocol for navigation The Skip graphs distributed protocol can be used, alternatively 23.1.201247MUFIN: Multi Feature Indexing Network

48 –range query R(q,r): identify intervals of interest Generalization to metric spaces –select pivots –then partition: Voronoi-style M-Chord: Indexing the Distance iDistance – indexing technique for vector domains –cluster analysis = centers = reference points p i –assign iDistance keys to objects 23.1.2012 48MUFIN: Multi Feature Indexing Network

49 M-Chord: Chord Protocol Peer-to-Peer navigation protocol Peers are responsible for intervals of keys hops to localize a node storing a key M-Chord set the iDistance domain make it uniform: function h Use Chord on this domain 23.1.201249MUFIN: Multi Feature Indexing Network

50 M-Chord: Range Query Node N q initiates the search Determine intervals –generalized iDistance Forward requests to peers on intervals Search in the nodes –using local organization Merge the received partial answers 23.1.2012 50MUFIN: Multi Feature Indexing Network

51 23.1.2012MUFIN: Multi Feature Indexing Network51 M-CAN: The Metric CAN Based on the Content-Addressable Network (CAN) – a DHT navigating in an N-dimensional vector space The Idea: 1.Map the metric space to a vector space – given N pivots: p 1, p 2, …, p N, transform every o into vector F(o) 2.Use CAN to – distribute the vector space zones among the nodes – navigate in the network

52 23.1.2012MUFIN: Multi Feature Indexing Network52 CAN: Principles & Navigation CAN – the principles – the space is divided in zones – each node “owns” a zone – nodes know their neighbors CAN – the navigation – greedy routing – in every step, move to the neighbor closer to the target location 2-dimensional vector space 1 62 53 4 x,y

53 23.1.2012MUFIN: Multi Feature Indexing Network53 M-CAN: Contractiveness & Filtering Use the L ∞ as a distance measure – the mapping F is contractive More pivots  better filtering – but, CAN routing is better for less dimensions Additional filtering – some pivots are only used for filtering data (inside the explored nodes) – they are not used for mapping into CAN vector space

54 Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework Metric space (D,d) OperationsStorage Centralized index structures Distributed index structures Communication Net Vectors L p and quadratic form Strings (weighted) edit and protein sequence Insert, delete, range query, k-NN query, Incremental k-NN Volatile memory Persistent memory Performance statistics 23.1.201254MUFIN: Multi Feature Indexing Network

55 Metric index structures Object Bucket Index structure Distributed index structure Sequential scan M-Tree, D-Index, M-Index GHT*, VPT*, M-Chord, MCAN Insert Delete Queries MUFIN Overlays 23.1.201255MUFIN: Multi Feature Indexing Network

56 External index Feature extraction MUFIN Overview Peer-to-Peer Networks Multi-overlay structure Forms range k-nearest complex Strategies precise approximate social insert delete features Web service Universal batch, telnet, GUI Specialized image web interface 23.1.201256MUFIN: Multi Feature Indexing Network

57 Applications: a Word Cloud 23.1.201257MUFIN: Multi Feature Indexing Network

58 Concepts of the Image search Image base similar? 23.1.201258MUFIN: Multi Feature Indexing Network

59 Images and their Descriptors Image level R B G Descriptor level 23.1.201259MUFIN: Multi Feature Indexing Network

60 Largest publicly available collection of high-quality images metadata: 106 million images Each image contains: Five MPEG-7 VDs: Scalable Color, Color Structure, Color Layout, Edge Histogram, Homogeneous Texture Other textual information: title, tags, comments, etc. Photos have been crawled from the Flickr photo-sharing site. http://cophir.isti.cnr.it/100M images + metadata + MPEG-7 VDs CoPhIR: Content-based Photo Image Retrieval 23.1.201260MUFIN: Multi Feature Indexing Network

61 MUFIN SEARCH ENGINE data & queries infrastructure index structure Scalability M-Chord + M-Index Extensibility COPHIR edge histogram color structure scalable color homogeneous texture color layout 6 x IBM server x3400 – 2 servers used Image Search Demo http://mufin.fi.muni.cz/imgsearch/ http://mufin.fi.muni.cz/imgsearch/ 23.1.201261MUFIN: Multi Feature Indexing Network

62 MUFIN demos http://mufin.fi.muni.cz/imgsearch/similar http://www.pixmac.com/ http://www.pixmac.com/ http://mufin.fi.muni.cz/twenga/random http://mufin.fi.muni.cz/fingerprints/random http://mufin.fi.muni.cz/subseq/random http://mufin.fi.muni.cz/plugins/annotation 23.1.201262MUFIN: Multi Feature Indexing Network

63 MUFIN Future Research Directions MUFIN - a universal similarity search technology Research directions in: – Core technology – Applications – A style of computing MUFIN Search Engine data & queries infrastructure index structure Scalability P2P structures Extensibility metric space Performance Tuning 23.1.201263MUFIN: Multi Feature Indexing Network

64 MUFIN Future Research Directions October 28, 2011 MUFIN Search Engine data & queries infrastructure index structure More scalable, reliable, robust Multi-layer architectures Self-organizing architectures New query types Flexible sub-sequence matching Efficient multi-feature processing New style of computing Cloud Computing Similarity Search as Service 23.1.201264MUFIN: Multi Feature Indexing Network

65 Major Applications – Images: Sub-image retrieval Ranking Annotation Categorization Benchmarking – Biometrics: Face recognition Fingerprint recognition Gait recognition – Signals: Audio recognition Time series similarity – Videos: Event detection 23.1.201265MUFIN: Multi Feature Indexing Network

66 A New Style of Computing From the project-oriented approach towards similarity cloud for multimedia findability through similarity searching Advantages: – Cloud makes similarity search accessible to common users – Computational resources are shared – users don’t need to maintain any hardware infrastructure – Users don’t need to care for the OS, security, software platform, etc. 23.1.201266MUFIN: Multi Feature Indexing Network


Download ppt "Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno 23.1.20121MUFIN:"

Similar presentations


Ads by Google