Presentation is loading. Please wait.

Presentation is loading. Please wait.

ALADDIN A Locality Aligned Deep Model for Instance Search

Similar presentations


Presentation on theme: "ALADDIN A Locality Aligned Deep Model for Instance Search"— Presentation transcript:

1 ALADDIN A Locality Aligned Deep Model for Instance Search
2018/5/18 ALADDIN A Locality Aligned Deep Model for Instance Search Good afternoon everyone, thank you for coming. Today Let me introduce our work ALADDIN, a locality aligned deep model for instance search. At the mention of ALADDIN, you may first think of the amazing magic lamp in the book of “one thousand and one nights”. We name our system ALADDIN because our system also shows some amazing results on instance search task. Coincidently, ALADDIN is the combination of the key characters of our system’s full name. Wenhui Jiang, Zhicheng Zhao, Fei Su, Anni Cai Beijing University of Posts and Telecommunications

2 Introduction Instance search means 2018/5/18 query query
Firstly let me explain what is instance search. Instance search is to retrieve all the images that include object or scene similar to the given query region. For example, given a query of no smoking LOGO, we try to return all the images contain the same LOGO. query

3 Content-based image search
2018/5/18 Challenges Asymmetrical Similarity This task is challenging mainly due to two problems: The first problem is the asymmetrical similarity between query and dataset images: As we can see on the left, the query object may cover only a small part of a dataset image, while the rest of the image is background. In contrast, as shown on the right, similar image search or content-based image search aims at retrieving globally similar images. So instance search is much harder than similar image search. Similar image search / Content-based image search Instance search

4 Challenges Asymmetrical Similarity Robust feature representation
2018/5/18 Challenges Asymmetrical Similarity Robust feature representation The second problem is on robust visual representation: The same object may appear quite different because of viewpoint, illumination, occlusion and so on. This is also a major problem in most computer vision tasks.

5 Local features based systems
2018/5/18 Related Work Bag of visual words (Sivic et al, ICCV 2003 ) Vocabulary tree (Nister et al, CVPR 2006) Hamming Embedding (Jegou et al, ECCV 2008 ) Bag of boundaries (Arandjelovic et al, ICCV 2011) Randomized Visual Phrases (Jiang et al, CVPR 2012, TIP 2015) Point-indexing (Tao et al, CVPR 2014) Local features based systems In the past ten years, a lot of works have been proposed to deal with the above two problems. Most recent instance search rely on local features such as SIFT to deal with appearance variations, and then design image similarity functions to discount for the clutter. Although these local features-based methods are successful in several applications, their performance is quite limited due to the representation ability of the hand-crafted features. Besides, local features are incapable of describing small or smooth objects, which largely exist among natural images. Hand-crafted features Incapable of describing small or smooth objects

6 Combining different features (local & global)
2018/5/18 Related Work Graph Fusion (Zhang et al, ECCV 2012) Co-indexing (Zhang et al, ICCV 2013, PAMI 2015) Query-Adaptive Fusion (Zheng et al, CVPR 2015) Generic Attributes and Categories (Tao et al, CVPR 2015) Combining different features (local & global) Being aware of that local features is not sufficient for object representation, some recent works try to combine multiple complementary features, including CNN features like DeCAF. As we know, DeCAF feature only captures category-level distinction. For instance task, however, we wish the CNN feature could also capture instance-level distinction. Obviously very few works have been proposed for this purpose. DeCAF: category-level distinction only

7 Deep only systems should solve …
2018/5/18 Deep only systems should solve … Similarity measurement Discount for background clutter CNN architecture Capture instance-level distinction Collect large scale training data No labelled training data for instance search benchmarks In this paper, we propose a deep only system for instance search. Our system attempts to address three major issues. First, how to deal with the asymmetrical search. Second, how to design a deep model to capture instance-level discriminative information. Finally, how to collect relevant training data. As we know, training data is critical for learning CNNs, but for instance search, there are no labelled examples.

8 Architecture overview
2018/5/18 Architecture overview Towards these goals, we propose ALADDIN, A Locality Aligned Deep moDel for INstance search. A brief illustration of ALADDIN is shown this figure. In a nutshell, we propose to search objects at the scale of candidate object regions instead of the entire image, so the asymmetrical object retrieval is converted into symmetrical object matching. That’s why we name our model “locality aligned”. Each object region is described by specifically designed CNN features, so the matching is more precise. More specifically, we divide the whole system into offline part and online part.

9 Architecture overview
2018/5/18 Architecture overview In the offline part, we first utilize EdgeBox to decompose a dataset image D into a small set of regions, such that each object appearing in D is approximately aligned to one of the regions. Step 1: Decompose a dataset image D into a small set of candidate regions Locality aligned

10 Architecture overview
2018/5/18 Architecture overview Secondly, we extract discriminative CNN features for these candidate regions. The deep network is learned to project aligned object regions into a new feature subspace, under which patches depicting the same object are closer to each other than patches of other objects. Step 2: Given pre-trained CNN model, extract feature vectors for regions. Instance-level distinction

11 Architecture overview
2018/5/18 Architecture overview Finally, to avoid exhaustive region search, we encode each feature vector using product quantization. The similarity between query feature and one encoded feature could be calculated very fast using a lookup table. We further incorporate these encoded features into an inverted file system. This makes our retrieval system very efficient. Step 3: Encode feature vectors using PQ, organized with inverted file system. Highly efficient

12 Architecture overview
2018/5/18 Architecture overview Step 1: Given pre-trained CNN model, extract feature vectors for regions. Step 2: Search and rank in IFS In the online part, given a query region, we also extract its CNN feature. Then the similarity between the query image Q and a dataset image D is determined by the maximum scored object region.

13 CNN Network Triplets as inputs Query Positive Negative
2018/5/18 CNN Network Triplets as inputs Query Positive Negative To guarantee accurate object matching, the design of deep network becomes a crucial component. We wish the learned model keeps patches from the same object to be closer than those of different objects by a large margin. Towards this goal, we construct a set of triplets (shown in red) In a triplet, Query and Positive (show them on the screen) represent a pair of image regions depicting the same object, and Query and Negative (show them on the screen) denote two patches of different objects. Then we can define a ranking-based loss function (show them on the screen) . The deep model takes a set of triplets as input, and the ranking loss objective guides the learning of desired representation. The training goal is to keep patches from the same object to be closer than those of different objects by a large margin.

14 2018/5/18 CNN Network While in the testing step, the test image is fed into the trained CNN directly for feature extraction.

15 Collect training data Hard negative mining manner 2018/5/18
To train the network, we need a large set of triplets for training. However, as we have explained before, for instance search task, we don’t have labelled training data. To this end, we propose a “hard negative mining” manner to collect training data. We define “hard negatives” as the triplets in which DeCAF fails to tell the positive element from the negative one (shown in the red boxes). The method proceeds in the following steps.

16 2018/5/18 Collect training data Collect K most likely object regions according to objectness score as seeds. For each seed region Search for the best matching region based on DeCAF feature Top 50 returned regions are regarded as positive regions Verify the correctness of the candidates using BoF + RANSAC , return verified positive set and negative set Divide positive set into two halves, one as queries and the other one as positive Generate a set of triplets First, we collect K most likely object regions according to objectness score as seeds. For each seed region, we search for the best matching region based on DeCAF feature, and the top 50 returned regions are regarded as positive regions. Next, we verify the correctness of the candidates using BoF + RANSAC , return verified positive set and negative set. Finally, divide positive set into two halves, one as queries and the other one as positive. In this case, we can generate a set of triplets

17 2018/5/18 Experiments We test the results on Oxford5k, Paris6k, and a more challenging dataset – Sculpture6k which is featured for smooth and texture-less objects. We evaluate our method with two groups of baselines. The first group includes four bag-of-feature (BoF) based models. The other group includes state-of-the-art methods based on deep learning. The performance is evaluated by mean average precision.

18 Experiments Deep feature is generic. 2018/5/18 SIFT-like features
Let’s first compare deep models with bag-of-feature based models. BoW works well on Oxford5k and Paris6k, but shows limited performance on Sculpture6k. That is because SIFT-like features are incapable of describing smooth and texture-less objects. BoB is not tested on Oxford5k and Paris6k since it is designed specifically for smooth objects. In contrast, deep models are capable of describing more generic objects. Deep feature is generic.

19 Experiments Locality aligned scheme is important ! 2018/5/18
DeCAF on entire image Now let’s compare different deep models. We can see from the table that Dacaf can’t deal with the asymmetrical similarity associated in the task of instance search. In contrast, by decomposing the entire image into a set of candidate object regions, the performance boosts significantly on three datasets. Locality aligned + DeCAF Locality aligned scheme is important !

20 Experiments Instance-level distinction is important ! 2018/5/18
Further more, it is clear that by leveraging instance-level distinction, the retrieval performance boosts by 6.74% on three datasets in average. Instance-level distinction is important !

21 Experiments Learning instance-level distinction on patches is easier.
2018/5/18 Experiments Ranking-based loss trained on entire images We also compare the results of learning a ranking model on top of entire images (ReDSL) and on aligned object patches (ALADDIN). It is obvious that learning a ranking model on top of aligned object patches is better, that is because learning CNNs from scratch is insufficient to capture the symmetrical similarity, but by decomposing the image into aligned object regions, we make the subsequent training of convolutional networks drastically easier. Ranking-based loss trained on patches Learning instance-level distinction on patches is easier.

22 Experiments The best performing deep-only instance search system.
2018/5/18 Experiments In summary, our proposed model outperforms state-of-the-art deep models on instance search task. The best performing deep-only instance search system.

23 2018/5/18 Experiments In addition, our model is also very efficient. In oxford5k and paris6k, it takes only 1 millisecond for each query.

24 Conclusions A deep ONLY system for instance retrieval
2018/5/18 Conclusions A deep ONLY system for instance retrieval Region proposal for problem decomposition Capture both category-level and instance-level distinction Automatic training data collection In conclusion, we proposed a deep only system for instance search, which systematically address three key issues of instance search: asymmetrical similarity, instance-level discriminative representation and automatic training data collection. Our model significantly outperforms the best CNNs-based method in both accuracy and efficiency.

25 Thank you for your attention !
2018/5/18 Thank you for your attention ! That’s all of our work. Thank you for your attention. Wenhui Jiang, Zhicheng Zhao, Fei Su, Anni Cai. “ALADDIN: A Locality Aligned Deep Model for Instance Search.” ICASSP 2016.


Download ppt "ALADDIN A Locality Aligned Deep Model for Instance Search"

Similar presentations


Ads by Google