Deep Structured Scene Parsing by Learning with Image Descriptions

Slides:



Advertisements
Similar presentations
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Advertisements

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Spatial Pyramid Pooling in Deep Convolutional
Grammar of Image Zhaoyin Jia, Problems  Enormous amount of vision knowledge:  Computational complexity  Semantic gap …… Classification,
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Learning Hierarchical Features for Scene Labeling
NATURAL LANGUAGE PROCESSING
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Gaussian Conditional Random Field Network for Semantic Segmentation
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Big data classification using neural network
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deeply learned face representations are sparse, selective, and robust
Object Detection based on Segment Masks
CSC 594 Topics in AI – Natural Language Processing
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Object detection with deformable part-based models
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recursive Neural Networks
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Inference as a Feedforward Network
A Pool of Deep Models for Event Recognition
Combining CNN with RNN for scene labeling (segmentation)
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Deep learning and applications to Natural language processing
CS6890 Deep Learning Weizhen Cai
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Natural Language - General
Xiaodan Liang Sun Yat-Sen University
Oral presentation for ACM International Conference on Multimedia, 2014
Discriminative Frequent Pattern Analysis for Effective Classification
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Outline Background Motivation Proposed Model Experimental Results
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Sun Yat-sen University
Graph Neural Networks Amog Kamsetty January 30, 2019.
Neural networks (3) Regularization Autoencoder
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Unsupervised Perceptual Rewards For Imitation Learning
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
David Kauchak CS159 – Spring 2019
Automatic Handwriting Generation
Human-object interaction
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Learning and Memorization
Semantic Segmentation
Object Detection Implementations
Week 7 Presentation Ngoc Ta Aidean Sharghi
SDSEN: Self-Refining Deep Symmetry Enhanced Network
CVPR 2019 Poster.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Deep Structured Scene Parsing by Learning with Image Descriptions Liang Lin Sun Yat-sen University, China Jointly work with Guangrun Wang, Rui Zhang, Ruimao Zhang, Xiaodan Liang, Wangmeng Zuo Good afternoon. I am Liang Lin from Sun Yat-sen University of China. Here, I will present our work “Deep Structured Scene Parsing by Learning with Image Descriptions”. This is a joint work with Guangrun wang, rui zhang, Ruimao zhang and Wangmeng zuo.

Scene Understanding: gap between algorithms and human A man holding a red bottle sits on a chair standing by a monitor on a table Entity Relation First, there is a huge gap between the existing computer vision algorithm and a human being to understand a scene. Currently, a state-of-the-art algorithm can predict a pixel-wise semantic labeling for the scene with high accuracy. However, people usually understand the scene using a structured configuration. This configuration often includes the main entities of the scene and the relations among these entities, such as described in this sentence. For example, the man and chair are entities, the holding and standing are relations. So one long-standing problem is, can we design a structured scene parsing system that produces the structured parsing outputs that can be consistent with human Perception. Current Algorithms vs. Human

Researches on Scene Understanding Scene Classification Semantic Labeling Scene Parsing Beyond the traditional scene understanding tasks, such as scene classification and semantic labeling, our work investigates the structured scene parsing. It predicts the pixel-wise segmentation masks for each category as well as their informative structure relations. Bicycles Boats Indoor Outdoor

Researches on Scene Parsing Attributed grammar model: generate the meaningful scene representations with probabilistic rules and layered components Top-down / Bottom-up inference for discovering scene structures No semantics The structured scene parsing has been also studied. Representative works are the stochastic attributed grammar model and its extensions. This model can use a small number of grammar rules to generate the meaningful and hierarchical scene representations, and integrate the bottom-up and top-down inference to discover scene structures. F. Han and S. C. Zhu. Bottom-up/top-down image parsing with attribute grammar. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):59–73, 2009 X. Liu, Y. Mu, and L. Lin, A Stochastic Grammar for Fine-grained 3D Scene Reconstruction, IJCAI, 2016.

Related Work on Scene Parsing Recursive Neural Network Learnable bottom-up binary tree structure construction No prediction of inter-object relations The recursive neural network is also applied for the scene structure prediction, as proposed in ICML 2011. It can generate the scene configuration by recursively learning the tree structure from bottom to top and producing the features of merged entities for classification. R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks, ICML, 2011

Complex procedure and user interface used to annotate MSCOCO dataset Open problems Weakly Scene Labeling Limited pixel-wise annotation Inconsistent with human perception Structured Scene Parsing Lacks of the ground-truth structure Ambiguity exists Although many progresses are made, there are some interesting open problems. For scene labeling, extensive pixel-wise annotation is usually required for supervised model training. For example, on the MSCOCO dataset, they developed multiple complex procedures and user interfaces for annotators to label images.   For structured scene parsing, the ground-truth annotation is more difficult to be collected. And some semantic ambiguity of annotation also often exist. Which one is better? Complex procedure and user interface used to annotate MSCOCO dataset

Motivation Parsing the input scene into a configuration including hierarchical semantic objects with their interaction relations Consistent with human perception Supervision with sentence-based image descriptions Avoid pixel-level labels Avoid explicit annotation for relations Cost-effective So, in this work, we aim to design a framework that can not only label the pixels of the scene with semantic categories but also parse the scene into a hierarchical and relation-aware configuration. On the other hand, a cost-effective learning method is proposed to avoid the expensive annotation for the scene parsing. The only supervision we need is the conveniently obtained image descriptions. The object instances and their relations described in the sentence are used to guide the model training.

Motivation Entity Structure Relation Extract knowledge from image description as supervision for training More specifically, we propose to extract knowledge from sentence-based image descriptions as supervision, and no extra pixel-wise annotation is required. As the example shown here, a sentence description includes rich information such as the nouns indicating entities the verbs indicating relations, and the organization of phrases can help discover the structured semantics. Entity Structure Relation

Our Framework Jointly optimize three sub-tasks A CNN-RNN hybrid model Entity Labeling Hierarchical structure Relations of Entities A CNN-RNN hybrid model Weakly supervised learning leverage sentence description Trained in an EM style An illustration of our structured scene parsing Relation Given a scene image, our model can jointly address the three tasks: entity labeling, hierarchical scene structure generation, and the inter-object relation prediction. Here is a parsing result generated by our method. Technically, our model is built by integrating the convolutional neural network and recursive neural network into an unified framework. And the model can be trained in a weakly supervised way by leveraging the image descriptions. Entity

CNN-RNN Model Fully-Convolutional Neural Network Recursive Neural Network Tightly Combined This is the architecture of our CNN-RNN framework. We adopt the fully convolutional neural network to generate the feature representations and segmentation masks of semantic categories. and then use the recursive neural network to predict a parsing tree of the scene. The CNN and RNN are tightly combined for end-to-end training.

Labeling Objects with CNN Model A simple forward procedure Generate object feature representation The CNN model feeds the input images into convolutional neural networks. Then the semantic objects are generated by grouping the pixels of the same label.

Producing Structures with RNN Model Greedy tree construction score relation 1.5 0.6 2.3 Categorizer Scorer Combiner The RNN model is utilized to produce scene structures. By feeding the object features into the RNN, a bottom-up greedy algorithm is used to recursively construct the parsing tree. The procedure begins with an initial set of leaf nodes. In each iteration, the algorithm enumerates all possible merging pairs and computes their merging scores, and chooses the pair with highest score for merging two nodes. the relation of the two merged nodes are also predicted in this step. The algorithm iterates until there is only one root node left.  Semantic mapping

Training CNN-RNN with Sentence Description A weakly-supervised manner by leveraging the descriptive sentences of the training images. Two parts: semantic object labeling via CNN structure prediction via RNN Our CNN-RNN training includes two parts: the first is the semantic object labeling by CNN, and the second is the structured prediction via RNN. In the initial stage, we first convert each sentence into a normalized tree including the entities and relations. Then the semantic masks for the entities are inferred using the image descriptions and parsing outputs of CNN. Then the recursive neural network produces the tree structures with interaction relations. In total, the semantic label loss and structure relation loss are jointly optimized during training. We thus train the model with a EM type algorithm. This algorithm alternates between predicting the latent scene configurations by transferring knowledge from the semantic trees, and optimizing the network parameters.

Sentence Preprocessing Parse trees from language parser is not perfect Irrelevant words e.g. adjectives, adverbs Entity names Different words for same entity Irrelevant entities Relations Relations are not recognized Example of a language parse tree produced by Stanford parser At the first, we preprocess the sentence parsing results generated by the language parser. But the parsing results are often imperfect, including the irrelevant works such as adjective, adverbs, different words for the same entity and missed relations.

Sentence Preprocessing Step 1: POS tag filtering ROOT NP VP NP NP sits VBZ PP a DT man NN holding VBG NP on IN NP a DT red JJ bottle NN NP VP a DT chair NN standing VBG PP We first use the POS tag filtering to remove some irrelevant words. by IN NP NP PP a DT monitor NN on IN NP the DT table NN

Sentence Preprocessing Step 2: Entity and relation recognition ROOT sit hold NP stand VP man NN NP VBZ PP VBG bottle NN sits IN NP holding on chair NN VP VBG PP we then summarize the important entities and relations. It thus results in a regularized tree to be compatible with our model. standing IN NP by monitor NN PP IN table NN on

Structure& Relation Loss Learning Algorithm … CNN inference Semantic Label Loss sum RNN During the model learning, First, the image goes through the CNN and produces the score maps for all categories. The score maps and parsed sentences are jointly used to produce the structure configuration of the objects. Finally, the semantic label loss and structure and relation loss are summed, and gradients are back propagated through the network for the end-to-end training. Structure& Relation Loss

Experiments Dataset Two tasks PASCAL VOC 2012 segmentation dataset We provide one descriptive sentence for each image in training and validation set A new SYSU-Scene dataset (more than 5000 images) Two tasks Structured Scene Parsing Weakly-Supervised Scene Labeling Our experiment is conducted on PASCAL VOC 2012 segmentation dataset. And we provide one descriptive sentence for each image in training and validation set.  

Structured Scene Parsing Performance Measure Relation and Structure accuracy Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence Structure accuracy is relation accuracy ignoring the relation labels Result End-to-end learning achieves best performance The second experiment is on the structured scene parsing. We report both relation and structure accuracy.   Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence.  Structure accuracy is relation accuracy ignoring the relation labels. We achieve 67.4% in terms of structure accuracy and 62.8% for relation accuracy. Note that the end-to-end learning achieves the best performance.

Visualized Scene Structure Results Here are some visualized results of scene structure parsing. The third one includes some failure relation predictions.

Visualized Scene Structure Results Here are some visualized results of scene structure parsing. The third one includes some failure relation predictions.

Semantic Labeling Weakly supervised setting Performance measured in pixel-wise IoU(Intersection Over Union) CVPR2015 ICCV2015 The first experiment is the weakly-supervised semantic labeling.   Our model achieves 35.2% IoU, better than other methods. We can see that when the network parameter of RNN is not updated, the performance drops to 33.5%. It demonstrates the effectiveness of our joint optimized framework.

Semantic Labeling Results The input images; The ground truth labeling results; Our proposed method (weakly-supervised); Deeplab(weakly-supervised) MIL-ILP(weakly-supervised) We thus show some visualized semantic labeling results by our weakly supervised framework. More accurate and meaningful segmentation results are obtained by our method.

Conclusion A deep learning framework for generating meaningful and hierarchical scene configurations. A deeper understanding of scenes compared to semantic labeling CNN-RNN for Relation and Structure Leverage sentence-based image description for model training Cost-effective Accordant with human perception The second experiment is on the structured scene parsing. We report both relation and structure accuracy.   Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence.  Structure accuracy is relation accuracy ignoring the relation labels. We achieve 67.4% in terms of structure accuracy and 62.8% for relation accuracy. Note that the end-to-end learning achieves the best performance.