Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Structured Scene Parsing by Learning with Image Descriptions

Similar presentations


Presentation on theme: "Deep Structured Scene Parsing by Learning with Image Descriptions"— Presentation transcript:

1 Deep Structured Scene Parsing by Learning with Image Descriptions
Liang Lin Sun Yat-sen University, China Jointly work with Guangrun Wang, Rui Zhang, Ruimao Zhang, Xiaodan Liang, Wangmeng Zuo Good afternoon. I am Liang Lin from Sun Yat-sen University of China. Here, I will present our work “Deep Structured Scene Parsing by Learning with Image Descriptions”. This is a joint work with Guangrun wang, rui zhang, Ruimao zhang and Wangmeng zuo.

2 Scene Understanding: gap between algorithms and human
A man holding a red bottle sits on a chair standing by a monitor on a table Entity Relation First, there is a huge gap between the existing computer vision algorithm and a human being to understand a scene. Currently, a state-of-the-art algorithm can predict a pixel-wise semantic labeling for the scene with high accuracy. However, people usually understand the scene using a structured configuration. This configuration often includes the main entities of the scene and the relations among these entities, such as described in this sentence. For example, the man and chair are entities, the holding and standing are relations. So one long-standing problem is, can we design a structured scene parsing system that produces the structured parsing outputs that can be consistent with human Perception. Current Algorithms vs. Human

3 Researches on Scene Understanding
Scene Classification Semantic Labeling Scene Parsing Beyond the traditional scene understanding tasks, such as scene classification and semantic labeling, our work investigates the structured scene parsing. It predicts the pixel-wise segmentation masks for each category as well as their informative structure relations. Bicycles Boats Indoor Outdoor

4 Researches on Scene Parsing
Attributed grammar model: generate the meaningful scene representations with probabilistic rules and layered components Top-down / Bottom-up inference for discovering scene structures No semantics The structured scene parsing has been also studied. Representative works are the stochastic attributed grammar model and its extensions. This model can use a small number of grammar rules to generate the meaningful and hierarchical scene representations, and integrate the bottom-up and top-down inference to discover scene structures. F. Han and S. C. Zhu. Bottom-up/top-down image parsing with attribute grammar. IEEE Trans. Pattern Anal. Mach. Intell., 31(1):59–73, 2009 X. Liu, Y. Mu, and L. Lin, A Stochastic Grammar for Fine-grained 3D Scene Reconstruction, IJCAI, 2016.

5 Related Work on Scene Parsing
Recursive Neural Network Learnable bottom-up binary tree structure construction No prediction of inter-object relations The recursive neural network is also applied for the scene structure prediction, as proposed in ICML 2011. It can generate the scene configuration by recursively learning the tree structure from bottom to top and producing the features of merged entities for classification. R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks, ICML, 2011

6 Complex procedure and user interface used to annotate MSCOCO dataset
Open problems Weakly Scene Labeling Limited pixel-wise annotation Inconsistent with human perception Structured Scene Parsing Lacks of the ground-truth structure Ambiguity exists Although many progresses are made, there are some interesting open problems. For scene labeling, extensive pixel-wise annotation is usually required for supervised model training. For example, on the MSCOCO dataset, they developed multiple complex procedures and user interfaces for annotators to label images. For structured scene parsing, the ground-truth annotation is more difficult to be collected. And some semantic ambiguity of annotation also often exist. Which one is better? Complex procedure and user interface used to annotate MSCOCO dataset

7 Motivation Parsing the input scene into a configuration including hierarchical semantic objects with their interaction relations Consistent with human perception Supervision with sentence-based image descriptions Avoid pixel-level labels Avoid explicit annotation for relations Cost-effective So, in this work, we aim to design a framework that can not only label the pixels of the scene with semantic categories but also parse the scene into a hierarchical and relation-aware configuration. On the other hand, a cost-effective learning method is proposed to avoid the expensive annotation for the scene parsing. The only supervision we need is the conveniently obtained image descriptions. The object instances and their relations described in the sentence are used to guide the model training.

8 Motivation Entity Structure Relation
Extract knowledge from image description as supervision for training More specifically, we propose to extract knowledge from sentence-based image descriptions as supervision, and no extra pixel-wise annotation is required. As the example shown here, a sentence description includes rich information such as the nouns indicating entities the verbs indicating relations, and the organization of phrases can help discover the structured semantics. Entity Structure Relation

9 Our Framework Jointly optimize three sub-tasks A CNN-RNN hybrid model
Entity Labeling Hierarchical structure Relations of Entities A CNN-RNN hybrid model Weakly supervised learning leverage sentence description Trained in an EM style An illustration of our structured scene parsing Relation Given a scene image, our model can jointly address the three tasks: entity labeling, hierarchical scene structure generation, and the inter-object relation prediction. Here is a parsing result generated by our method. Technically, our model is built by integrating the convolutional neural network and recursive neural network into an unified framework. And the model can be trained in a weakly supervised way by leveraging the image descriptions. Entity

10 CNN-RNN Model Fully-Convolutional Neural Network
Recursive Neural Network Tightly Combined This is the architecture of our CNN-RNN framework. We adopt the fully convolutional neural network to generate the feature representations and segmentation masks of semantic categories. and then use the recursive neural network to predict a parsing tree of the scene. The CNN and RNN are tightly combined for end-to-end training.

11 Labeling Objects with CNN Model
A simple forward procedure Generate object feature representation The CNN model feeds the input images into convolutional neural networks. Then the semantic objects are generated by grouping the pixels of the same label.

12 Producing Structures with RNN Model
Greedy tree construction score relation 1.5 0.6 2.3 Categorizer Scorer Combiner The RNN model is utilized to produce scene structures. By feeding the object features into the RNN, a bottom-up greedy algorithm is used to recursively construct the parsing tree. The procedure begins with an initial set of leaf nodes. In each iteration, the algorithm enumerates all possible merging pairs and computes their merging scores, and chooses the pair with highest score for merging two nodes. the relation of the two merged nodes are also predicted in this step. The algorithm iterates until there is only one root node left.  Semantic mapping

13 Training CNN-RNN with Sentence Description
A weakly-supervised manner by leveraging the descriptive sentences of the training images. Two parts: semantic object labeling via CNN structure prediction via RNN Our CNN-RNN training includes two parts: the first is the semantic object labeling by CNN, and the second is the structured prediction via RNN. In the initial stage, we first convert each sentence into a normalized tree including the entities and relations. Then the semantic masks for the entities are inferred using the image descriptions and parsing outputs of CNN. Then the recursive neural network produces the tree structures with interaction relations. In total, the semantic label loss and structure relation loss are jointly optimized during training. We thus train the model with a EM type algorithm. This algorithm alternates between predicting the latent scene configurations by transferring knowledge from the semantic trees, and optimizing the network parameters.

14 Sentence Preprocessing
Parse trees from language parser is not perfect Irrelevant words e.g. adjectives, adverbs Entity names Different words for same entity Irrelevant entities Relations Relations are not recognized Example of a language parse tree produced by Stanford parser At the first, we preprocess the sentence parsing results generated by the language parser. But the parsing results are often imperfect, including the irrelevant works such as adjective, adverbs, different words for the same entity and missed relations.

15 Sentence Preprocessing
Step 1: POS tag filtering ROOT NP VP NP NP sits VBZ PP a DT man NN holding VBG NP on IN NP a DT red JJ bottle NN NP VP a DT chair NN standing VBG PP We first use the POS tag filtering to remove some irrelevant words. by IN NP NP PP a DT monitor NN on IN NP the DT table NN

16 Sentence Preprocessing
Step 2: Entity and relation recognition ROOT sit hold NP stand VP man NN NP VBZ PP VBG bottle NN sits IN NP holding on chair NN VP VBG PP we then summarize the important entities and relations. It thus results in a regularized tree to be compatible with our model. standing IN NP by monitor NN PP IN table NN on

17 Structure& Relation Loss
Learning Algorithm CNN inference Semantic Label Loss sum RNN During the model learning, First, the image goes through the CNN and produces the score maps for all categories. The score maps and parsed sentences are jointly used to produce the structure configuration of the objects. Finally, the semantic label loss and structure and relation loss are summed, and gradients are back propagated through the network for the end-to-end training. Structure& Relation Loss

18 Experiments Dataset Two tasks PASCAL VOC 2012 segmentation dataset
We provide one descriptive sentence for each image in training and validation set A new SYSU-Scene dataset (more than 5000 images) Two tasks Structured Scene Parsing Weakly-Supervised Scene Labeling Our experiment is conducted on PASCAL VOC 2012 segmentation dataset. And we provide one descriptive sentence for each image in training and validation set.

19 Structured Scene Parsing
Performance Measure Relation and Structure accuracy Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence Structure accuracy is relation accuracy ignoring the relation labels Result End-to-end learning achieves best performance The second experiment is on the structured scene parsing. We report both relation and structure accuracy. Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence.  Structure accuracy is relation accuracy ignoring the relation labels. We achieve 67.4% in terms of structure accuracy and 62.8% for relation accuracy. Note that the end-to-end learning achieves the best performance.

20 Visualized Scene Structure Results
Here are some visualized results of scene structure parsing. The third one includes some failure relation predictions.

21 Visualized Scene Structure Results
Here are some visualized results of scene structure parsing. The third one includes some failure relation predictions.

22 Semantic Labeling Weakly supervised setting
Performance measured in pixel-wise IoU(Intersection Over Union) CVPR2015 ICCV2015 The first experiment is the weakly-supervised semantic labeling. Our model achieves 35.2% IoU, better than other methods. We can see that when the network parameter of RNN is not updated, the performance drops to 33.5%. It demonstrates the effectiveness of our joint optimized framework.

23 Semantic Labeling Results
The input images; The ground truth labeling results; Our proposed method (weakly-supervised); Deeplab(weakly-supervised) MIL-ILP(weakly-supervised) We thus show some visualized semantic labeling results by our weakly supervised framework. More accurate and meaningful segmentation results are obtained by our method.

24 Conclusion A deep learning framework for generating meaningful and hierarchical scene configurations. A deeper understanding of scenes compared to semantic labeling CNN-RNN for Relation and Structure Leverage sentence-based image description for model training Cost-effective Accordant with human perception The second experiment is on the structured scene parsing. We report both relation and structure accuracy. Relation accuracy measures how much a predicted structure with relation labels is consistent with the one derived from the sentence.  Structure accuracy is relation accuracy ignoring the relation labels. We achieve 67.4% in terms of structure accuracy and 62.8% for relation accuracy. Note that the end-to-end learning achieves the best performance.


Download ppt "Deep Structured Scene Parsing by Learning with Image Descriptions"

Similar presentations


Ads by Google