Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICCV 2019.

Similar presentations


Presentation on theme: "ICCV 2019."— Presentation transcript:

1 ICCV 2019

2 Spatial semantics between objects:
location, pose, shape, the frame of reference, object-specific common sense knowledge Finding spatial relations that are difficult to predict using simple cues such as 2D spatial configuration or language priors.

3 Issues in current visual relation dataset:
VG and Open Images do not provide negative examples the annotations are not exhaustive and many valid relations are left annotated K metric remains impossible to distinguish between a good system producing valid albeit annotated predictions and a bad system producing false positive language bias – relations can be guessed well without looking at the images (among relations involving a table, 89.37% of them define an object ‘on’ the table) Contributions: adversarial crowdsourcing: a human annotator is asked to come up with adversarial examples to confuse a robot 17,498 relations on 11,569 images given two objects names and bboxes, the task is to classify whether a particular spatial relation holds(decouple object detection) 10,180 RGB images RGB-D images 3579 unique object classes , 2139 appearing only once – long-tail each predicate has an equal number of positive and negative relations effectively reduce dataset bias

4 Adversarial crowdsourcing protocol
train robot on a dataset of 7850 relations collected without adversarial crowdsourcing occasionally re-trained

5 Analysis of the Dataset
Predicate distribution Setup: SpatialSense  SpatialSense-Positive VRD  VRD-Spatial VG  VG-Spatial 2D Spatial distribution

6 SpatialSense is much more difficult to tackle using simple language and 2D cues than prior dataset
Models trained on SptialSense generalize impressively well to other datasets.

7 Baselines for Spatial Relation Recognition
VtransE DRNet

8 Testing accuracy on spatial relation recognition
Correlation matrix between errors

9

10 ICCV 2019 Poster

11 Image Encoder Question Encoder VQA System Multimodal Fusion Answer Predictor To solve problem: objects surrounding environment(stuff) semantic about actions locations(relative geometrical position) “Is the zebra at the far right a baby zebra? ” “Are all the zebras eating grass? ” Relation Encoder: to learn the relation-aware, question-adaptive, region-level representations from the image .

12 Pipeline

13 Graph Construction Fully-connected Relation Graph  Implicit Relation Pruned Graph with Prior Knowledge Spatial Graph:11 categories + no relation;symmetrical Semantic Graph:top 14 semantic relations + no relation;pretrain on VG T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV, 2018. H.Hu, J.Gu, Z.Zhang, J.Dai, and Y.Wei. Relation networks for object detection. In CVPR, 2018

14 Question-adaptive Graph Attention
Implicit Relation Explicit Relation

15 Implementation Details
fuse question feature and visual representation: Bottom-up Top-down; Multimodal Tucker Fusion; Bilinear Attention Network different relation encoders train independently; weighted sum -> inference 16 heads for all three graph attention networks

16


Download ppt "ICCV 2019."

Similar presentations


Ads by Google