Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Similar presentations


Presentation on theme: "A Hierarchical Deep Temporal Model for Group Activity Recognition"— Presentation transcript:

1 A Hierarchical Deep Temporal Model for Group Activity Recognition
MSc Thesis Defence Srikanth Muralidharan 12 April 2016 Good Afternoon. Welcome to my Thesis talk. I am going to present our work on Group Activity Recognition using hierarchical deep temporal model.

2 Outline Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion

3 Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion

4 Preview – Action Recognition
Walking

5 Action Recognition Datasets : A brief overview
2010 Olympic sports dataset 16 classes 2014 Youtube 1M dataset 480+ classes 2004 KTH dataset 6 classes

6 Summary-Action Recognition
Task : Predict what a single person is doing Difficulty – intraclass variations Difficulty - unconstrained nature of videos

7 Example : A surveillance scene
We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.

8 It’s a walking scene. Walking Walking Walking Walking Walking Standing
We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.

9 Example: Rally in a Volleyball Scene
The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.

10 Left Spike Spiking Waiting Waiting Standing waiting Waiting Moving
The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.

11 Challenge 1 – Context Dependency
Group Activity = Majority’s Activity Group Activity = Key Player’s Activity Challenge 1 – Context Dependency Group Activity – Right spike Challenge 2 - high level description

12 Group Activity Recognition vs Action Recognition
Walking

13 It’s hard! Group activity label Image Classifier
Be careful with the description!

14 Intuitive fix: Use only the foreground features
Therefore, the intuitive fix is to use just the features obtained from foreground

15 Group Activity – ???? waiting Person classifier Digging waiting
spiking waiting Person classifier We cut out all the people, extract their feature representation

16 Possible Solution - Hierarchical model
Pool person features Digging waiting waiting spiking waiting Stage 1 - Person feature extractor We cut out all the people, extract their feature representation

17 Possible Solution - Hierarchical model
Output Group Activity Stage 2: Frame Classifier Pooled person features We cut out all the people, extract their feature representation

18 Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion

19 Pipeline Overview Learn People Representations
Aggregate People Representations Learn Group Representations

20 From images to video clips
Given the person level annotations, we track each person assigning same label across the tracks

21 LSTM – An Introduction Stands for Long Short Term Memory
Sequential Neural Network that learns from arbitrary length inputs

22 LSTM – An Introduction Output Output Output LSTM LSTM LSTM x(t=T)

23 We use LSTMs for building person classification model and extracting person features
We construct an LSTM based frame classifier on top of pooled LSTM features

24 Stage1 : Learning Individual Activity Features
Softmax Softmax Softmax LSTM LSTM LSTM Alexnet Alexnet Alexnet

25 Stage1 : Learning Individual Activity Features
Person 1 LSTM Person 1 feature Representation LSTM Person 2 feature Representation Person 2 LSTM Person 3 feature Representation Person 3 . . . LSTM Person n feature Representation Person n

26 Stage 2: Learning Frame Representations

27 Part I : Introduction to Group Activity
Part II : Description of the Model Part III : Experimental Results and Conclusion

28 Tracker details We obtain 10-frame video clips – 5 before, 4 after an annotated frame We use LSTMs with 10 video clips as batch size No annotations for the tracked frames - use of unlabelled data

29 Collective Activity Dataset
Same label set for people and group activities 1925 video clips for training, 638 video clips for testing 1. Crossing 2. Queueing 3. Talking 4. Waiting 5. Walking

30 Experimental results on Collective Activity Dataset
Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5

31 Experimental results on Collective Activity Dataset
Method Accuracy Contextual Model [Lan NIPS’10] 79.1 Deep Structured Model [Deng BMVC‘15] 80.6 Our Model 81.5 Cardinality Kernel [Hajimirsadeghi CVPR‘15] 83.4 Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5

32 Volleyball Dataset – Frame Labels
1047 images for training, 478 images for testing 1. Spiking 2. Setting 3. Passing

33 Volleyball Dataset – People Labels
1047 images for training, 478 images for testing 1. Waiting 2. Digging 3. Setting 4. Spiking 5. Falling 6. Blocking

34 Experimental results on Volleyball Dataset
Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1

35 Experimental results on Volleyball Dataset
Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1

36 Visualization of results
Left set Right pass Right Spike Left pass Left spike (Left pass) Right spike (Left spike)

37 Conclusion A two stage hierarchical model for group activity recognition LSTMs as a highly effective temporal model and temporal feature source Decent people-relation modeling with simple pooling

38 Future Work Semi-supervised approaches to diversify the new datasets
Experiments under weakly supervised setting Semi-supervised approaches to diversify the new datasets

39 THANK YOU


Download ppt "A Hierarchical Deep Temporal Model for Group Activity Recognition"

Similar presentations


Ads by Google