Presenter: Chu-Song Chen

Presenter: Chu-Song Chen
Supervised Learning of Semantics-Preserving Hash via Deep Convolutional Neural Networks To appear in IEEE Transactions on Pattern Analysis and Machin Intelligence (TPAMI) Presenter: Chu-Song Chen May 6, 2017

Given a query image, retrieve a similar image in the database
Image retrieval Given a query image, retrieve a similar image in the database Applications E-commerce Home robot Visual surveillance

To search a database, we convert an image into features
Search Approach To search a database, we convert an image into features Nearest neighbor search: compare the features of the query image and database images, and find the closest one as the returned image. A better feature representation is helpful for searching more relevant images.

Deeply learned features
Deep network can learn both feature representations and classifiers Deep network: a long feed-forward architecture containing many layers End-to-end learning – achieves joint feature-representation and classifier learning No clear boundary between the features and the classifiers Usually, the layer output right before the final classification layer is used as features, referred to as neural codes. for classificatoin

Deeply learned features (cont.)
The features are, however, suitable for classification but not retrieval Problems of the deeply learned features Inefficient for retrieval: they are floating-point vectors; slow for comparison. When applied to retrieval, the performance is still not as good as they are used for classiﬁcation. Our solution Deep learning of binary hash codes for fast retrieval.

Binary Hash Codes Binary hash codes: for fast image retrieval.
… Fast image search can be achieved via binary pattern matching with Hamming distance metric. We propose a supervised learning approach for hash codes learning with deep CNNs.

Binary Hash Codes for fast matching
Instead of floating-point, learn binary (i.e., 0-1) codes as the features Query Image Images in Database features Advantage of binary codes: fast search is achievable via Hamming distance comparison (XOR operation) or hash techniques.

Recent advances on learning hash codes with deep learning
Previous methods [44][47][43] employ pair- wised or triplet similarity to learn hash functions; it is demanding when dealing with large-scale datasets. Deep networks are also used in supervised DH (SDH) [44] for learning compact binary codes. CNNH and CNNH+ [47] employ a two-stage learning approach; first, decomposes a pairwise similarity matrix into approximate hash codes; second, trains a CNN for learning the hash functions. The method in [43] and deep semantic ranking based hashing (DSRH) [48] adopt a ranking loss defined on a set of triplets for code construction.

Fine-tuning for hash codes learning
Our idea Fine-tuning for hash codes learning Fine-tuning: The success of deep CNN on classification and detection tasks is encouraging. It reveals that fine- tuning a CNN pre-trained on a large-scale and diverse- category dataset (such as ImageNet) provides a promising way for domain adaptation and transfer learning. Fine-tuning for hash function learning? For retrieval, a question worthy of study thus arises: Beyond classification, is the “pre-train + fine-tune” scheme also capable of learning binary hash codes for efficient retrieval? If it is, how to modify the architecture of a pre-trained CNN to this end?

Our Approach Supervised semantics-preserving deep hashing (SSDH)
A point-wised approach is proposed, which is established based on existing deep architectures for classification.

How to learn binary codes via Deep Net?
We provide a simple but effective approach We assume the classification outputs rely on a set of h hidden attributes with each attribute on or off (i.e., 1 or 0, respectively) So, we add a new fully connected layer, called the latent layer H between the neural-codes layer and the classification layer. The neurons in the latent layer H are activated by sigmoid functions (with some regularizations), so that the activations are approximated to {0,1}.

learn binary codes via Deep Net
The neurons in H are activated by sigmoid functions Overall learning objective classification error + forcing the output to be 0 or 1 + equal chance of being 0 or 1 per bit

Explained with AlexNet
yet our approach can be used for any classification networks, eg. VGG We assume the classiﬁcation relies on a set of h hidden attributes with each on (1) or off (0). We add a latent layer H right betweens the layers F7 and F8 to learn the binary codes

Connection to traditional approaches Relation to “AlexNet feature + LSH”
Relationship between our approach and an naive combination, AlexNet feature + locality sensitivity hash (LSH) is worth of mentioning: As random weights are used for initializing the weights of the latent layer, our network can be regarded as initialized with LSH (i.e., random weights) to map the deep features learned in ImageNet (for which AlexNet feature pre- trained) to binary codes. Through back-propagation, the weights of the pre- trained, latent, and classification layers simultaneously evolve a multi-layer function more suitable for the new domain.

Supervised Hash Goal Functions
Summary of recent studies Pair-wised training: ℎ 𝑥 1 −ℎ( 𝑥 2 ) is small if 𝑥 1 , 𝑥 2 are of the same label, and large vice versa. Triplet training: ℎ 𝑥 1 −ℎ( 𝑥 2 ) is small and ℎ 𝑥 1 −ℎ( 𝑥 𝟑 ) is large for the triple 𝑥 1 , 𝑥 2 , 𝑥 3 , where 𝑥 1 , 𝑥 2 are of the same label and 𝑥 1 , 𝑥 3 are of different labels. Latent concept: The classification results depend on the on-off hidden concepts of the hash function ℎ 𝑥 . -- Our approach (the first that takes advantage of the deep features that are binarized)

Loss Function Multi-class classification (single-labeled data) Like most deep CNN approaches, we use the soft-max output and cross-entropy loss for single-labeled dataset: ( 𝑦 𝑛𝑚 and 𝑦 𝑛𝑚 are the desired output and prediction of the 𝑚-th output unit of the 𝑛-th sample) for supervised learning.

Loss Function Multi-class classification (multi-labeled data) The soft-max cross-entropy loss can only deal with single-labeled data. To deal with multi-labeled data, we introduce a maximum-margin loss function acting like L2- norm SVM:

Multi-labeled data Derivative of the gradients
The derivatives are conducted, so that back- propagation can be employed for the goal optimization. In our implementation, Caffe is used, and we have to provide such derivatives in Caffe.

Connection to Linear SVM
Note that to train a large scale linear SVM, the state-of-the- art methods [51], [52] employ the coordinate descent optimization in the dual domain (DCD), which is proven to be equivalent to stochastic gradient descent (SGD) in the primal domain [51]. As SGD is a standard procedure for training neural networks, when our network is trained only for our SVM layer and the weights of the other layers are fixed, it is equivalent to solving the convex quadratic programming of SVM (with SGD’s learning rate corresponding to some SVM’s model parameter C). When training the entire network, the parameters then evolve to more favorable feature representations (in the deep CNN architecture), latent binary representations (in the hidden layer), and binary classifiers (in the SVMs layer) simultaneously.

Advantage of our approach
Why is it useful? Lightweight implementation: can be easily implemented with existing networks Scalable to large data: do not use any pairwise comparison objective functions, and is scalable to large-scale datasets. Unifying retrieval and classiﬁcation: the learned features are suitable for both fast retrieval without affecting the classification performance

Retrieval results Evaluation Protocols
Precision at k: the percentage of true neighbors among the top k retrieved samples. Mean average precision (mAP): the ratio of correctly retrieved samples among the top k retrieved samples is referred to as averaged precision (AP), and mAP is the average of the APs of different k’s. Precision within Hamming radius r: compute the precision of the images in the buckets that fall within the Hamming radius r.

Datasets CIFAR10 MNIST SUN3977 Yahoo-1M 23

Retrieval performance
far better than existing approaches CIFAR 10

MNIST

Retrieval example CIFAR 10

Larger datasets SUN397

Larger datasets Yahoo 1M Comparison to AlexNet fine-tuned feature + traditional non-deep learning method (ITQ)

Larger datasets ILSVRC 2012 Comparison to AlexNet fine-tuned feature + traditional non-deep learning method (ITQ)

Multilable Dataset NUS-Wide dataset NUS-Wide is a multi-label dataset A retrieval is considered correct if any label is matched for an image.

Multi-lable Dataset UT-ZAP50K
A more rigorous criterion: a retrieval is considered correct if all labels are correct. When searching shopping items, one may want the retrieved images not only in the same category but also for the same gender to the query. A multi-label dataset without previous results. Comparison to AlexNet fine-tuned feature + traditional non-deep learning method (ITQ)

Classification Results
on single-labeled dataset Learn the binary feature representations without sacrificing the classification performance

Classification Results
on single-labeled dataset

Computational Speed

Mobile Clothing Search
Applications Mobile Clothing Search Collaboration with Yahoo Taiwan

Codes available Thank you
Thank you

Presenter: Chu-Song Chen

Similar presentations

Presentation on theme: "Presenter: Chu-Song Chen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presenter: Chu-Song Chen

Similar presentations

Presentation on theme: "Presenter: Chu-Song Chen"— Presentation transcript:

Similar presentations

About project

Feedback