Presentation is loading. Please wait.

Presentation is loading. Please wait.

Two-Stream Convolutional Networks for Action Recognition in Videos

Similar presentations


Presentation on theme: "Two-Stream Convolutional Networks for Action Recognition in Videos"— Presentation transcript:

1 Two-Stream Convolutional Networks for Action Recognition in Videos
Mooyeol Baek Karen Simonyan, and Andrew Zisserman. "Two-Stream Convolutional Networks for Action Recognition in Videos." NIPS 2014

2 Previous works A large family of previous methods use local spatio- temporal features. (Shallow & High-dimentional) Histogram of Oriented Gradients (HOG) Histogram of Optical Flow (HOF) State-of-the-art shallow video representations make use of dense point trajectories. Motion Boundary Histogram (MBH) There also been a number of attempts to develop a deep architecture for video recognition. A stack of consecutive video frames as an input Operating on individual video frames performs similarly => FAILS on capturing the motion CV lab. seminar

3 Contributions Proposes a two-stream ConvNet architecture which incoperates spatial and temporal networks. Demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Shows that multi-task learning on ConvNet, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. CV lab. seminar

4 Two-Stream Video Classification
CV lab. seminar

5 Spatial stream ConvNet
Uses an individual video frame as an input => Compatible with pre-trained network on ImageNet Uses RELU non-linearities Temporal stream ConvNet Uses multi-frame optical flow as an input Uses same structure with spatial ConvNet CV lab. seminar

6 Temporal stream ConvNet (Cont.)
Optical flow stacking Trajectory stacking + Bi-directional optical flow Mean flow subtraction CV lab. seminar

7 Multi-task Learning Temporal ConvNet doesn’t have pre-trained model.
Train a network for both UCF-101 and HMDB-51 datasets => Two different softmax layers with individual loss function Overall training loss = sum of losses for each datasets CV lab. seminar

8 UCF-101 - Dataset 101 Actions 13320 Clips 100-175 Clips per Action
1) Human-Object Interaction 2) Body-Motion Only 3) Human-Human Interaction 4) Playing Musical Instruments 5) Sports. CV lab. seminar

9 HMDB-51 - Dataset 51 Actions 6849 Clips Min 101 Clips per Actions
1) Facial actions with object manipulation 2) General body movements 3) Body movements with object interaction 4) Body movements for human interaction CV lab. seminar

10 Training environment Adaptation of the AlexNet training
Mini-batch (256) stochastic gradient descent with momentum = 0.9; scheduled decrease of learning rate Optical flow pre-computed CV lab. seminar

11 Test settings Given an input video, sample fixed number of frames (25). Then obtain 10 ConvNet inputs per frame by cropping & flipping. Output is weighted fusion of softmax predictions of two ConvNets CV lab. seminar

12 Evaluation CV lab. seminar

13 Evaluation (Cont.) CV lab. seminar

14 Thank you! CV lab. seminar


Download ppt "Two-Stream Convolutional Networks for Action Recognition in Videos"

Similar presentations


Ads by Google