Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment
Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting Zhuang Zhejiang University

Table of Content Introduction Algorithm Experiments Conclusion

Introduction - Cross-modal Learning to Rank
Learn a model to sort the retrieved documents according to their relevance to a given query, where the documents and the queries are from the different modalities. Three stages of Cross-modal Learning to Rank 1. Learn a common-representation for multi-modal data 2. Evaluate the relevance between documents and queries in the learned common space 3. Devise a ranking function to generate a ranking list and preserve the order of the relevance between multi-modal data.

Introduction - Two Types of Rrepresentations
Two types of representations It is more attractive to exploit the local structure and global structure in pairs of images and text to learn a better common space for multimodal data. This paper proposes to learn two type of common spaces to embed the collaboratively grounded semantics in both local structure and global structure respectively. Local Alignment The semantics of visual objects and the textual words in local structure is embedded into a local common space, where the visual objects and their relevant textual words are aligned with each other Global Alignment The global alignment embeds the image-level and text-level compositional semantics into a global common space, where each image and its relevant text are aligned with each other.

Introduction - The proposed Methods (C2MLR)
Common Representations 1. Jointly uses both local and global alignment 2. Generates the compositional semantics embeddings of an image and text from the isolated semantics embeddings of visual objects in the image and textual words in the text 3. Predicts the ranking list by evaluating the relevance between an image and text based on both local and global alignment. Evaluate Relevance The relevance between image and text are evaluated by computing their embedding similarity in both local and global common space. Ranking Function Two type of common representations are joint learned under a a large-margin pair-wise learning to rankin framework.

Introduction - The proposed Methods (C2MLR)

Algorithm Local alignment for objects and words
Given a feature vector r extracted from an object region, the visual object inside the object region is mapped into a local common space by a non-linear projection as follows: where r is a dr-dimensional feature vector, WI is a d×dr matrix that maps the visual object into a d-dimensional local common space via local alignment, bI are the biases. For a given particular grammatical class p of words (e.g., the part of speech), we will learn a mapping matrix to embed all of words belonging to this grammatical class p into the local common space as follows: where Wp is a d×dw matrix that maps each dw-dimensional vector w into a d-dimensional vector, bp are biases.

Algorithm Global alignment for image and text
The compositional semantics of the image I is encoded by a compositional semantics embedding matrix Wc as follows: where Wc is a dc × d matrix that maps the image I into a dc-dimensional global common space, and the operator | · | is the cardinality of a set. Similarly, the compositional semantics of S is encoded as follows:

Algorithm The relevance score based on local alignment
Given a pair of image I and text S, the relevance score in terms of their local structure is obtained by: 1. aligning each visual object r in I with the textual word w whose semantic embedding has the highest cosine similarity with r 2. summing up the cosine similarity of all the aligned visual objects and textual words as the overall relevance score: The relevance score based on global alignment Overall relevance

Algorithm Parameter Estimation
The parameters of the model are learned under a max-margin learning to rank framework. where W denotes two sets of model parameters for both local alignment and global alignment.

Experiments Datesets Tags in Pascal07 are more likely to express high-level concepts instead of directly describing the specific visual objects in images in Flickr8K. To adapt different natures of different datasets, an algorithms using both local and global alignment is needed.

Experiments Performance Comparison

Experiments The observation on Pascal07 The observation on Flickr8K
Our proposed method (C2MLR) outperforms the other methods on both search directions in terms of most of the performance metrics. The methods utilizing global alignment (C2MLR, DeViSE) achieve a better performance over Pascal07, which verifies that global alignment has better ability to find relevance between image and text based on high level concept. The observation on Flickr8K C2MLR outperforms the other methods on both search direction in terms of all the performance metrics. The ranking methods adopting local alignment (DeepFE) achieve good performance over Flickr8k, which verifies that local alignment has the better ability to find relevance between image and text based on an explicit relevance between objects and words. The observation on both datasets By combining both global and local alignment, C2MLR achieve good performance on both datasets, which validates that mere local alignment or global alignment does not suffice. C2MLR- outperforms the other global alignment ranking methods (PAMIR and DeepRank) most of the time, which verifies that a compositional semantics embedding from visual objects and textual words can achieve better performance than traditional embedding.

Experiments Examples of ranking textual documents with image query using C2MLR as well as the discovered relevant object-word pairs.

Experiments Examples of twos types of objects and words alignment (nouns and adjectives) discoverd by C2MLR

Conclusions This paper proposes a new method for cross-modal ranking called C2MLR. The proposed method uses the local alignment to embed visual objects and textual words into a local common space, and employs the global alignment to map images and text in a global common space The global common space is learned by obtaining the image-level and sentence-level compositional semantics embeddings of multi-modal data from the visual objects and textual words. The local common space, the compositional semantics and the global common space are learned jointly in a max-margin learning to ranking manner. Experiments show that using both local alignment and global alignment for cross-modal ranking is able to boost the ranking performance.

Thank you! Q & A

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Similar presentations

Presentation on theme: "Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Similar presentations

Presentation on theme: "Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting."— Presentation transcript:

Similar presentations

About project

Feedback