Download presentation
Published byMonica Underwood Modified over 8 years ago
1
A Hierarchical Deep Temporal Model for Group Activity Recognition
MSc Thesis Defence Srikanth Muralidharan 12 April 2016 Good Afternoon. Welcome to my Thesis talk. I am going to present our work on Group Activity Recognition using hierarchical deep temporal model.
2
Outline Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion
3
Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion
4
Preview – Action Recognition
Walking
5
Action Recognition Datasets : A brief overview
2010 Olympic sports dataset 16 classes 2014 Youtube 1M dataset 480+ classes 2004 KTH dataset 6 classes
6
Summary-Action Recognition
Task : Predict what a single person is doing Difficulty – intraclass variations Difficulty - unconstrained nature of videos
7
Example : A surveillance scene
We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.
8
It’s a walking scene. Walking Walking Walking Walking Walking Standing
We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.
9
Example: Rally in a Volleyball Scene
The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.
10
Left Spike Spiking Waiting Waiting Standing waiting Waiting Moving
The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.
11
Challenge 1 – Context Dependency
Group Activity = Majority’s Activity Group Activity = Key Player’s Activity Challenge 1 – Context Dependency Group Activity – Right spike Challenge 2 - high level description
12
Group Activity Recognition vs Action Recognition
Walking
13
It’s hard! Group activity label Image Classifier
Be careful with the description!
14
Intuitive fix: Use only the foreground features
Therefore, the intuitive fix is to use just the features obtained from foreground
15
Group Activity – ???? waiting Person classifier Digging waiting
spiking waiting Person classifier We cut out all the people, extract their feature representation
16
Possible Solution - Hierarchical model
Pool person features Digging waiting waiting spiking waiting Stage 1 - Person feature extractor We cut out all the people, extract their feature representation
17
Possible Solution - Hierarchical model
Output Group Activity Stage 2: Frame Classifier Pooled person features We cut out all the people, extract their feature representation
18
Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion
19
Pipeline Overview Learn People Representations
Aggregate People Representations Learn Group Representations
20
From images to video clips
Given the person level annotations, we track each person assigning same label across the tracks
21
LSTM – An Introduction Stands for Long Short Term Memory
Sequential Neural Network that learns from arbitrary length inputs
22
LSTM – An Introduction Output Output Output LSTM LSTM LSTM x(t=T)
23
We use LSTMs for building person classification model and extracting person features
We construct an LSTM based frame classifier on top of pooled LSTM features
24
Stage1 : Learning Individual Activity Features
Softmax Softmax Softmax LSTM LSTM LSTM Alexnet Alexnet Alexnet
25
Stage1 : Learning Individual Activity Features
Person 1 LSTM Person 1 feature Representation LSTM Person 2 feature Representation Person 2 LSTM Person 3 feature Representation Person 3 . . . LSTM Person n feature Representation Person n
26
Stage 2: Learning Frame Representations
27
Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion
28
Tracker details We obtain 10-frame video clips – 5 before, 4 after an annotated frame We use LSTMs with 10 video clips as batch size No annotations for the tracked frames - use of unlabelled data
29
Collective Activity Dataset
Same label set for people and group activities 1925 video clips for training, 638 video clips for testing 1. Crossing 2. Queueing 3. Talking 4. Waiting 5. Walking
30
Experimental results on Collective Activity Dataset
Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5
31
Experimental results on Collective Activity Dataset
Method Accuracy Contextual Model [Lan NIPS’10] 79.1 Deep Structured Model [Deng BMVC‘15] 80.6 Our Model 81.5 Cardinality Kernel [Hajimirsadeghi CVPR‘15] 83.4 Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5
32
Volleyball Dataset – Frame Labels
1047 images for training, 478 images for testing 1. Spiking 2. Setting 3. Passing
33
Volleyball Dataset – People Labels
1047 images for training, 478 images for testing 1. Waiting 2. Digging 3. Setting 4. Spiking 5. Falling 6. Blocking
34
Experimental results on Volleyball Dataset
Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1
35
Experimental results on Volleyball Dataset
Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1
36
Visualization of results
Left set Right pass Right Spike Left pass Left spike (Left pass) Right spike (Left spike)
37
Conclusion A two stage hierarchical model for group activity recognition LSTMs as a highly effective temporal model and temporal feature source Decent people-relation modeling with simple pooling
38
Future Work Semi-supervised approaches to diversify the new datasets
Experiments under weakly supervised setting Semi-supervised approaches to diversify the new datasets
39
THANK YOU
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.