Human-object interaction

Slides:



Advertisements
Similar presentations
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Advertisements

INTRODUCTION Heesoo Myeong, Ju Yong Chang, and Kyoung Mu Lee Department of EECS, ASRI, Seoul National University, Seoul, Korea Learning.
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Large-Scale Object Recognition with Weak Supervision
On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.
Spatial Pyramid Pooling in Deep Convolutional
From R-CNN to Fast R-CNN
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
End-to-End Text Recognition with Convolutional Neural Networks
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Fully Convolutional Networks for Semantic Segmentation
Recognition Using Visual Phrases
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Cascade Region Regression for Robust Object Detection
Deep Residual Learning for Image Recognition
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Recent developments in object detection
Big data classification using neural network
Deep Learning for Dual-Energy X-Ray
Guillaume-Alexandre Bilodeau
Object Detection based on Segment Masks
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
A Pool of Deep Models for Event Recognition
Recognizing Deformable Shapes
Rotational Rectification Network for Robust Pedestrian Detection
Compositional Human Pose Regression
Nonparametric Semantic Segmentation
Training Techniques for Deep Neural Networks
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Adversarially Tuned Scene Generation
Image Question Answering
A Convolutional Neural Network Cascade For Face Detection
Context-Aware Modeling and Recognition of Activities in Video
Computer Vision James Hays
Image Classification.
A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,
Two-Stream Convolutional Networks for Action Recognition in Videos
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Deep Learning Hierarchical Representations for Image Steganalysis
Object Detection + Deep Learning
Introduction of MATRIX CAPSULES WITH EM ROUTING
Word Embedding Word2Vec.
CornerNet: Detecting Objects as Paired Keypoints
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
RCNN, Fast-RCNN, Faster-RCNN
Heterogeneous convolutional neural networks for visual recognition
Department of Computer Science Ben-Gurion University of the Negev
Deep Object Co-Segmentation
Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.
Motivation Semantic Transformation Module Most of the existing works neglect the semantic relationship between the visual feature and linguistic knowledge,
Rgh
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Object Detection Implementations
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Weak-supervision based Multi-Object Tracking
End-to-End Facial Alignment and Recognition
Volodymyr Bobyr Supervised by Aayushjungbahadur Rana
Week 7 Presentation Ngoc Ta Aidean Sharghi
Motivation The subjects/objects are correlated to each other under semantic relationships.
Learning to Detect Human-Object Interactions with Knowledge
Visual Grounding.
ICCV 2019.
CVPR 2019 Poster.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Human-object interaction 2019.3.15

HOI问题定义 HOI—Human-Object Interaction

HOI-Det问题定义 HOI—Human-Object Interaction 主语->Human 宾语->Object 谓语-> Action 检测出 Human和Object 预测Human和Object交互产生的动作

HOI的发展 传统方法 起源:Observing human-object interactions using spatial and functional compatibility for recognition. TPAMI 2009. Pose + hoi的先行者:Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses. TPAMI 2012 深度学习时代 数据库开启新时代:Learning to Detect Human-Object Interactions. WACV 2018. 根据动作定位相关物体:Detecting and Recognizing Human-Object Interactions. CVPR 2018. 精细化到Part和物体的交互: Attention: Pairwise Body-Part Attention for Recognizing Human-Object Interactions .ECCV 2018. :No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques. Arxiv 2018. 图卷积 Zero-shot: Compositional learning for human object interaction. ECCV 2018. 起源:Learning Human-Object Interactions by Graph Parsing Neural Networks. ECCV 2018. Two Stage: Transferable Interactiveness Prior for Human-Object Interaction Detection. CVPR 2019.

HOI的常用 Pose特征 源于Action 位置信息 外部语言知识

Learning to Detect Human-Object Interactions

Contributions Propose HICO-DET dataset: the first large benchmark for HOI detection. Propose HO-RCNN: Human-Object Region-based Convolutional Neural Networks.

HICO-Det Dataset 统计信息 600 HOI classes of interest

Method HO-RCNN

HO-RCNN Human-Object Proposals First detect bounding boxes for humans and the object categories of Interest. Then Figure2.

HO-RCNN Human and Object Stream Given a human-object proposal, the human stream extracts local features from the human bounding box, and generates confidence scores for each HOI class. Object stream as same.

HO-RCNN Pairwise Stream

Detecting and Recognizing Human-Object Interactions

Motivation 人的动作可以一定程度上确定和人产生交互物体的位置 如<人,打,球>那么球在人手周围的概率会很大,如果是<人,踢,球> 那么球更大概率会出现在脚的旁边。

Method Model Architecture Model Components Object Detection :Image->Faster-Rcnn->human and object box and associated score. Human-centric Branch: input: Human Conv5 Feature action output: action score (sigmoid) target output: Gaussian Map Interaction Brach: input: Human and Object Conv5 Feature output: HOI score.

Method We then write our target localization term as: Decompose the triplet score into four terms

Transferable Interactiveness Prior for Human-Object Interaction Detection

Motivation Implicitly predict whether human-object is interactive or not. How to utilize interactiveness and improve HOI detction learning

Contribution Propose a general and transferable Interactiveness Prior learning method Interactiveness prior can be learned across many datasets and applied to any specific dataset Outperforms state-of-the-art HOI detection results by a great margin.

Method Framework

Method Representation and Classification Networks Human and Object Detection: Detectron with ResNet-50-FPN. Representation Network: Faster R-CNN with ResNet-50 based R here. HOI Classification Network: multi-stream architecture and late fusion strategy.

Method Interactiveness Network Human and Object stream ROI pooling features from representation network R. Spatial-Pose Stream

Method Confidence Function

Method Interactiveness Prior Transfer Training

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Difficulties HOI: the relevant object tends to be small or only partially visible. Pose: the human body parts are often self-occluded

Contributions Propose a new random field model to encode the mutual context of objects and human poses in human-object interaction activities. Significantly outperforms state-of-the art in detecting very difficult objects and human poses.

Modeling mutual context of object and pose Goal: To estimate the human pose and to detect the object that the human interacts with. The model

Model The overall model can be computed as Co-occurrence context

Model Spatial Context

Model Modeling objects

Model Modeling human pose. Modeling activities

Properties of the model Co-occurrence context for the activity class, object, and human pose Multiple types of human poses for each activity Spatial context between object and body parts. Relations with the other models.

Pairwise Body-Part Attention for Recognizing Human-Object Interactions

Motivation Human interacts with an object by using some parts of the body . Different body parts should be paid with different attention in HOI recognition. The correlations between different body parts should be further considered

Contributions Propose a new pairwise body-part attention model which can learn to focus on crucial parts, and their correlations for HOI recognition. A novel attention based feature selection method and a feature representation scheme that can capture pairwise correlations between body parts . Our proposed approach achieved 10% relative over the SOTA results in HOI recognition on the HICO dataset.

Method Framework

Method Global Appearance Features Scene and Human Features ROI pooling layer extracts ROI features for each person and the scene given their bounding boxes. Concatenate Human Features and Scene Features. Incorporating Object Features Set ROI as a union box of detected human and object. Sample multiple union boxes of different objects and the person

Method Local Pairwise Body-part Features Given a pair of body parts, to extract their joint feature maps while preserving their relative spatial relationships.

Compositional Learning for Human Object Interaction

Motivation

Contribution Propose a novel method using external knowledge graph and graph convolutional networks which learns how to compose classifiers for verb-noun pairs. Provide benchmarks on several dataset for zero-shot learning including both image and video.

Method Framework

Method A Graphical Representation of Knowledge Graph Construction Nodes: Verb and Noun , and Actions Node Feature: word embeddings , (zero Init). Edges: A verb node can only connect to a noun node via a valid action node. Adjacency matrix normalization->