Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1,2 1 1 1 2 Zhizhong.

Slides:



Advertisements
Similar presentations
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Advertisements

Li-Jia Li Yongwhan Lim Li Fei-Fei Chong Wang David M. Blei B UILDING AND U SING A S EMANTIVISUAL I MAGE H IERARCHY CVPR, 2010.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Presented by Zeehasham Rasheed
CS292 Computational Vision and Language Visual Features - Colour and Texture.
Distributed Representations of Sentences and Documents
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling and representation 1 – comparative review and polygon mesh models 2.1 Introduction 2.2 Polygonal representation of three-dimensional objects 2.3.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University 3D Shape Classification Using Conformal Mapping In.
ARTIFICIAL INTELLIGENCE [INTELLIGENT AGENTS PARADIGM] Professor Janis Grundspenkis Riga Technical University Faculty of Computer Science and Information.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Multimedia Databases (MMDB)
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Jifeng Dai 2011/09/27.  Introduction  Structural SVM  Kernel Design  Segmentation and parameter learning  Object Feature Descriptors  Experimental.
Introduction Surgical training environments as well as pre- and intra-operative planning environments require physics-based simulation systems to achieve.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Recuperação de Informação B Cap. 10: User Interfaces and Visualization , , 10.9 November 29, 1999.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Image Classification for Automatic Annotation
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Deep Visual Analogy-Making
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Hierarchical Motion Evolution for Action Recognition Authors: Hongsong Wang, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing,
Defects of UML Yang Yichuan. For the Presentation Something you know Instead of lots of new stuff. Cases Instead of Concepts. Methodology instead of the.
Naifan Zhuang, Jun Ye, Kien A. Hua
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Convolutional Sequence to Sequence Learning
CNN-RNN: A Unified Framework for Multi-label Image Classification
Visual Information Retrieval
A Deep Memory Network for Chinese Zero Pronoun Resolution
Global Precipitation Data Access, Value-added Services and Scientific Exploration Tools at NASA GES DISC Zhong Liu1,4, D. Ostrenga1,2, G. Leptoukh4, S.
Bag-of-Visual-Words Based Feature Extraction
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Presenter: Hajar Emami
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Adversarially Tuned Scene Generation
Attention-based Caption Description Mun Jonghwan.
CSc4730/6730 Scientific Visualization
Improving Retrieval Performance of Zernike Moment Descriptor on Affined Shapes Dengsheng Zhang, Guojun Lu Gippsland School of Comp. & Info Tech Monash.
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Deep Cross-media Knowledge Transfer
Deep Robust Unsupervised Multi-Modal Network
Ying Dai Faculty of software and information science,
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Natural Language to SQL(nl2sql)
My name is Shijian Liu, a Lecture from Fujian University of Technology
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Ying Dai Faculty of software and information science,
Word embeddings (continued)
Actively Learning Ontology Matching via User Interaction
Attention for translation
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
LOGAN: Unpaired Shape Transform in Latent Overcomplete Space
Jointly Generating Captions to Aid Visual Question Answering
Human-object interaction
Presented by: Anurag Paul
Short Contribution Title Goes Here
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Deep Structured Scene Parsing by Learning with Image Descriptions
Cengizhan Can Phoebe de Nooijer
Week 7 Presentation Ngoc Ta Aidean Sharghi
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1,2 1 1 1 2 Zhizhong Han Mingyang Shang Xiyang Wang Yu-Shen Liu Matthias Zwicker 1 School of Software, Tsinghua University, Beijing, China 2 Department of Computer Science, University of Maryland, College Park, USA

Content Background Motivation Current solution The key idea of Y2Seq2Seq Problem statement Technical details Results Contributions

Background With the development of 3D modeling and scanning techniques, more and more 3D shapes become available on the Internet with detailed physical properties, such as texture, color, and material. With large 3D datasets, however, shape class labels are becoming too coarse of a tool to help people efficiently find what they want, and visually browsing through shape classes is cumbersome.

Motivation To alleviate this issue, an intuitive approach is to allow users to describe the desired 3D object using a text description. Jointly understanding 3D shape and text by learning a cross-modal representation, however, is still a challenge because it requires an efficient 3D shape representation that can capture highly detailed 3D shape structures.

Current solution A 3D-Text cross-modal dataset was recently released, where a combined multimodal association model was also proposed to capture the many-to-many relations between 3D voxels and text descriptions. However, this strategy is limited to learning from low resolution voxel representations due to the computational cost caused by the cubic complexity of 3D voxels. This leads to low discriminability of learned cross-modal representations due to a lack of detailed geometry information.

The key idea of Y2Seq2Seq …… …… …… We resolve this issue by proposing to learn cross-modal representations of 3D shape and text from View sequences, where each 3D shape is represented by a view sequence. Word sequences, where each sentence is represented by a word sequence. Our deep learning model captures the correlation between 3D shape and text by simultaneously Reconstructing each modality itself And, predicting one modality from the other. …… A two layer table with four legs A two layer table with four legs Reconstruction Reconstruction …… Prediction Prediction …… A two layer table with four legs

Problem statement S: t: Fs Ft …… Y2Seq2Seq Jointly earn the feature of a 3D shape s and the feature of a sentence t describing s. S: …… t: A two layer table with four legs Y2Seq2Seq Fs Ft

Technical details Y2Seq2Seq is formed by two branches: A 3D shape branch S A text branch T Each branch is a “Y” like seq2seq model which is formed by: One RNN encoder Two RNN decoders The encoder output in each branch is the learned features of 3D shape and text.

Technical details Hierarchical constraints are further proposed to increase the discriminability of learned features. Class level: Classification constraint Instance pair level: Triplet constraint Instance level: Joint embedding constraint Two “Y” like Seq2Seq are coupled 3D reconstructor and 3D predictor are sharing parameters. Text reconstructor and text predictor

Technical details 3D shape branch S 3D to 3D reconstruction 3D to text prediction Total losses In feature space α and β control the balance between the two losses.

Technical details Text branch T Text to text reconstruction Text to 3D prediction Total losses ϒ and δ control the balance between the two losses.

Technical details Hierarchical constraints Class level Instance pair level Instance level Total losses Φ, ϕ and ψ control the balance between the two losses.

Technical details Objective function

Results Experimental evaluation in 3D-Text cross-modal dataset Cross-modal retrieval, from 3D to text and from text to 3D. 3D shape captioning. 3D-Text cross-modal dataset Primitive subset, 7560 shapes and 191850 descriptions (Primitive class). ShapeNet subset, 15038 shapes and 75344 descriptions (Chair class and Table class).

Results Effect of coupled “Y” under primitive subset

Results Effect of coupled “Y” under ShapeNet subset

Results Effect of hierarchical constraints under primitive subset

Results Effect of hierarchical constraints under ShapeNet subset

Results Effect of voxel resolution under ShapeNet subset

Results Cross-modal retrieval under primitive subset

Results Cross-modal retrieval under ShapeNet subset

Results Cross-modal retrieval visualization

Results 3D shape captioning under primitive subset

Results 3D shape captioning under ShapeNet subset

Results 3D shape captioning visualization

Contributions We propose a deep learning model called Y2Seq2Seq, which enables to learn cross-modal representations of 3D shape and text from view sequences and word sequences. Our novel coupled “Y” like Seq2Seq structures have a powerful capability to bridge the semantic meaning of two sequence- represented modalities by joint reconstruction and prediction. Our results demonstrate that our novel hierarchical constraints can further increase the discriminability of learned cross-modal representations by employing more detailed discriminative information.

Thank you!