Presentation is loading. Please wait.

Presentation is loading. Please wait.

Near Duplicate Image Detection: min-Hash and tf-idf weighting

Similar presentations


Presentation on theme: "Near Duplicate Image Detection: min-Hash and tf-idf weighting"— Presentation transcript:

1 Near Duplicate Image Detection: min-Hash and tf-idf weighting
Ondřej Chum Center for Machine Perception Czech Technical University in Prague co-authors: James Philbin and Andrew Zisserman

2 Outline Near duplicate detection and large databases
(find all groups of near duplicate images in a database) min-Hash review Novel similarity measures Results on TrecVid 2006 Results on the University of Kentucky database (Nister & Stewenius) Beyond near duplicates

3 Scalable Near Duplicate Image Detection
Images perceptually (almost) identical but not identical (noise, compression level, small motion, small occlusion) Similar images of the same object / scene Large databases Fast – linear in the number of duplicates Store small constant amount of data per image

4 Image Representation Feature detector SIFT descriptor [Lowe’04]
Vector quantization 1 ... 4 2 ... Bag of words Visual vocabulary Set of words 4

5 P{m(A1) == m(A2)} = sim (A1 , A2)
min-Hash Min-Hash is a locality sensitive hashing (LSH) function m that selects elements m(A1) from set A1 and m(A2) from set A2 so that P{m(A1) == m(A2)} = sim (A1 , A2) Image similarity measured as a set overlap (using min-Hash algorithm) Spatially related images share visual words A1 ∩ A2 A1 A2 A1 U A2 5

6 min-Hash A B C D E F A B C B C D A E F C F A B C A B E Vocabulary
Set A Set B Set C A B C D E F A B C B C D A E F Ordering min-Hash f1: 1 4 5 2 6 3 0.07 0.75 0.59 0.22 0.90 0.41 C F ~ Un (0,1) f2: 4 5 3 6 2 1 0.63 0.88 0.55 0.94 0.31 0.19 A B ~ Un (0,1) f3: 5 4 6 1 2 3 C A f4: 2 1 6 5 3 4 B E overlap (A,B) = 3/4 (1/2) overlap (A,C) = 1/4 (1/5) overlap (B,C) = 0 (0)

7 ... } } } } } } min-Hash Retrieval sim(A, B)s 1 – (1 - sim(A, B)s)k A
Sketch collision } } A A sketch s-tuple of min-Hashes s – size of the sketch k – number of hash tables Probability of sketch collision Q C ... } } V V sim(A, B)s E E } } J Z Probability of retrieval (at least one sketch collision) Y Q 1 – (1 - sim(A, B)s)k k hash tables

8 Probability of Retrieving an Image Pair
Images of the same object Near duplicate images s = 3, k = 512 probability of retrieval Unrelated images similarity (set overlap) 8

9 More Complex Similarity Measures

10 Document / Image / Object Retrieval
Term Frequency – Inverse Document Frequency (tf-idf) weighting scheme [1] Baeza-Yates, Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999. [2] Sivic, Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV’03. [3] Nister, Stewenius. Scalable recognition with a vocabulary tree. CVPR’06. [4] Philbin, Chum, Isard, Sivic, Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR’07. idfW = log # docs containing XW # documents Words common to many documents are less informative 4 2 ... t Frequency of the words is recorded (good for repeated structures, textures, etc…)

11 More Complex Similarity Measures
Set of words representation Different importance of visual words importance dw of word Xw Bag of words representation (frequency is recorded) Histogram intersection similarity measure Different importance of visual words importance dw of word Xw

12 Word Weighting for min-Hash
For hash function (set overlap similarity) all words Xw have the same chance to be a min-Hash For hash function the probability of Xw being a min-Hash is proportional to dw A Q V E R J C Z A U B: Y dA dC dE dV dJ dQ dY dZ dR

13 Histogram Intersection Using min-Hash
Idea: represent a histogram as a set, use min-Hash set machinery Visual words: A C D B Bag of words A / set A’ Bag of words B / set B’ A1 C1 B1 A2 C2 C3 C1 D1 B1 B2 C2 C3 tA = (2,1,3,0) tB = (0,2,3,1) min-Hash vocabulary: A1 C1 D1 B1 A2 B2 C2 C3 A’ U B’: A1 C1 D1 B1 A2 B2 C2 C3 Set overlap of A’ of B’ is a histogram intersection of A and B

14 Results Quality of the retrieval
Speed – the number of documents considered as near-duplicates

15 TRECVid Challange 165 hours of news footage, different channels, different countries 146,588 key-frames, 352×240 pixels No ground truth on near duplicates

16 Min-Hash on TrecVid DoG features vocabulary of 64,635 visual words
192 min-Hashes, 3 min-Hashes per a sketch, 64 sketches similarity threshold 35% Examples of images with 24 – 45 near duplicates # common results / set overlap only / weighted set overlap only Quality of the retrieval appears to be similar

17 Comparison of Similarity Measures
Images only sharing uninformative visual words do not generate sketch collisions for the proposed similarity measures Number of sketch collisions Set overlap Weighted set overlap Weighted histogram Image pair similarity

18 University of Kentucky Dataset
10,200 images in groups of four Querying by each image in turn Average number of correct retrievals in top 4 is measured

19 Evaluation Vocabulary sizes 30k and 100k
Number of min-Hashes 512, 640, 768, and 896 2 min-Hashes per sketch Number of sketches 0.5, 1, 2, and 3 times the number of min-Hashes Score on average: weighted histogram intersection 4.6 % better than weighted set overlap weighted set overlap 1.5 % better than set overlap Number of considered documents on average: weighted histogram intersection 1.7 times less than weighted set overlap weighted set overlap 1.5 times less than set overlap Absolute numbers for weighted histogram intersection: min-Hashes sketches score 30k score 100k docs 30k docs 100k Usable 640 2.928 2.889 488.2 117.6 Best 896 2688 3.090 3.166 1790.8 452.8 Retrieval tf-idf flat scoring [Nister & Stewenius] score 3.16 Number of considered documents (non-zero tf-idf) 10,089.9 (30k) and 9,659.4 (100k)

20 Set overlap, weighted set overlap, weighted histogram intersection
Query Examples Query image: Results Set overlap, weighted set overlap, weighted histogram intersection

21 Beyond Near Duplicate Detection

22 Discovery of Spatially Related Images
Find and match ALL groups (clusters) of spatially related images in a large database, using only visual information, i.e. not using (flicker) tags, EXIF info, GPS, …. Chum, Matas: Large Scale Discovery of Spatially Related Images, TR May 2008 available at 22

23 Probability of Retrieving an Image Pair
Images of the same object Near duplicate images probability of retrieval similarity (set overlap) 23

24 Image Clusters as Connected Components
Randomized clustering method: Seed Generation – hashing (fast, low recall) characterize images by pseudo-random numbers stored in a hash table time complexity equal to the sum of second moments of Poisson random variable -- linear for database size D ≈ 240 2. Seed Growing – retrieval (thorough – high recall) complete the clusters only for cluster members c << D, complexity O(cD) 24

25 Clustering of 100k Images Images downloaded from FLICKR
Includes 11 Oxford Landmarks with manually labelled ground truth All Soul's Hertford Ashmolean Keble Balliol Magdalen Bodleian Pitt Rivers Christ Church Radcliffe Camera Cornmarket 25

26 Results on 100k Images Number of images: 104,844
Component Recall (CR) Number of images: 104,844 Timing: 17 min + 16 min = sec / image Good OK Unrelated CR All Souls 24 54 97.44 Ashmolean 12 13 68.00 Balliol 5 7 33.33 Bodleian 11 1 95.83 Christ Church 51 27 89.74 Cornmarket 4 66.67 Hertford 35 19 96.30 Keble 6 85.71 Magdalen 41 5.56 Pitt Rivers 3 100 Radcliffe Camera 105 116 98.64 Chum, Matas TR, May 2008 26

27 Philbin, Sivic, Zisserman
Results on 100k Images Component Recall (CR) Number of images: 104,844 Timing: 17 min + 16 min = sec / image 5,062 ? Good OK Unrelated CR All Souls 24 54 97.44 Ashmolean 12 13 68.00 Balliol 5 7 33.33 Bodleian 11 1 95.83 Christ Church 51 27 89.74 Cornmarket 4 66.67 Hertford 35 19 96.30 Keble 6 85.71 Magdalen 41 5.56 Pitt Rivers 3 100 Radcliffe Camera 105 116 98.64 CR 96 60 33 71 67 65 57 20 100 98 Chum, Matas TR, May 2008 Philbin, Sivic, Zisserman BMVC 2008 27

28 Conclusions New similarity measures were derived for the min-Hash framework Weighted set overlap Histogram intersection Weighted histogram intersection Experiments show that the similarity measures are superior to the state of the art in the quality of the retrieval (up to 7% on University of Kentucky dataset) in the speed of the retrieval (up to 2.5 times) min-Hash is a very useful tool for randomized image clustering

29 Thank you!


Download ppt "Near Duplicate Image Detection: min-Hash and tf-idf weighting"

Similar presentations


Ads by Google