Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos
Mooyeol Baek Karen Simonyan, and Andrew Zisserman. "Two-Stream Convolutional Networks for Action Recognition in Videos." NIPS 2014

Previous works A large family of previous methods use local spatio- temporal features. (Shallow & High-dimentional) Histogram of Oriented Gradients (HOG) Histogram of Optical Flow (HOF) State-of-the-art shallow video representations make use of dense point trajectories. Motion Boundary Histogram (MBH) There also been a number of attempts to develop a deep architecture for video recognition. A stack of consecutive video frames as an input Operating on individual video frames performs similarly => FAILS on capturing the motion CV lab. seminar

Contributions Proposes a two-stream ConvNet architecture which incoperates spatial and temporal networks. Demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Shows that multi-task learning on ConvNet, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. CV lab. seminar

Two-Stream Video Classification
CV lab. seminar

Spatial stream ConvNet
Uses an individual video frame as an input => Compatible with pre-trained network on ImageNet Uses RELU non-linearities Temporal stream ConvNet Uses multi-frame optical flow as an input Uses same structure with spatial ConvNet CV lab. seminar

Temporal stream ConvNet (Cont.)
Optical flow stacking Trajectory stacking + Bi-directional optical flow Mean flow subtraction CV lab. seminar

Multi-task Learning Temporal ConvNet doesn’t have pre-trained model.
Train a network for both UCF-101 and HMDB-51 datasets => Two different softmax layers with individual loss function Overall training loss = sum of losses for each datasets CV lab. seminar

UCF-101 - Dataset 101 Actions 13320 Clips 100-175 Clips per Action
1) Human-Object Interaction 2) Body-Motion Only 3) Human-Human Interaction 4) Playing Musical Instruments 5) Sports. CV lab. seminar

HMDB-51 - Dataset 51 Actions 6849 Clips Min 101 Clips per Actions
1) Facial actions with object manipulation 2) General body movements 3) Body movements with object interaction 4) Body movements for human interaction CV lab. seminar

Training environment Adaptation of the AlexNet training
Mini-batch (256) stochastic gradient descent with momentum = 0.9; scheduled decrease of learning rate Optical flow pre-computed CV lab. seminar

Test settings Given an input video, sample fixed number of frames (25). Then obtain 10 ConvNet inputs per frame by cropping & flipping. Output is weighted fusion of softmax predictions of two ConvNets CV lab. seminar

Evaluation CV lab. seminar

Evaluation (Cont.) CV lab. seminar

Thank you! CV lab. seminar

Two-Stream Convolutional Networks for Action Recognition in Videos

Similar presentations

Presentation on theme: "Two-Stream Convolutional Networks for Action Recognition in Videos"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Two-Stream Convolutional Networks for Action Recognition in Videos

Similar presentations

Presentation on theme: "Two-Stream Convolutional Networks for Action Recognition in Videos"— Presentation transcript:

Similar presentations

About project

Feedback