Download presentation
Presentation is loading. Please wait.
Published byNikolas Mill Modified over 10 years ago
1
Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli
2
Introduction Address recognition of natural human actions in diverse and realistic video settings. Addresses the limitations (lack of realistic and annotated video datasets)
3
Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images. Existing datasets for human action recognition provide samples for few action classes.
4
To Address these limitations we implement Automatic annotation of human actions Manual annotation is difficult Video classification for action recognition
5
Automatic annotation of human actions Alignment of actions in scripts and videos Text Retrieval of human actions Video datasets for human actions
6
Alignment of actions in scripts and videos … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … RICK Why weren't you honest with me? Why did you keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …www.dailyscript.comwww.movie-page.comwww.weeklyscript.com Subtitles (with time info.) are available for the most of movies Can transfer time to scripts by text alignment
7
Script alignment: Evaluation Example of a “visual false positive” A black car pulls up, two army officers get out. Annotate action samples in text Do automatic script-to-video alignment Check the correspondence of actions in scripts and movies a: quality of subtitle-script matching
8
Text Retrieval of human actions “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” Large variation of action expressions in text: GetOutCar action: Potential false positives: “…About to sit down, he freezes…” => Supervised text classification approach
9
Video Datasets for Human actions 12 movies 20 different movies Learn vision-based classifier from automatic training set Compare performance to the manual training set
10
Video Classification for action recognition
11
S PACE - TIME FEATURES Good performance for action recognition Compact and provide tolerance to background clutter, occlusions and scale changes.
12
I NTEREST POINT DETECTION Harris operator - with a space-time extension. We use multiple levels of spatio-temporal scales σ = 2(1+i)/2, i = 1, …, 6 τ = 2j/2, j = 1, 2 I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123, 2005.
14
D ESCRIPTORS Compute histogram descriptors of volume around the interest points. (∆x, ∆y, ∆t ) is related to the detection scales by ∆x, ∆y = 2kσ, ∆t = 2kτ Each volume is divided into (nx, ny, nt) grid of cuboids. We use k = 9, nx, ny=3, nt=2.
15
.. CONTD For each cuboid, we calculate HoG and HoF (optic flow) descriptors Very similar to SIFT descriptors, adapted to the third dimension.
16
S PATIO - TEMPORAL B O F Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3) Assign each feature to one word. Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio- temporal grid. If divided into grids, concatenate and normalize.
17
S PATIO - TEMPORAL B O F Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3) Assign each feature to one word. Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio- temporal grid. If divided into grids, concatenate and normalize.
18
G RIDS We divide both spatial and temporal dimensions. Spatial – 1x1, 2x2, 3x3, v1x3, h3x1, o2x2 Temporal – t1, t2, t3, ot2 6 * 4 = 24 possible grid combinations! Descriptor + grid = channel.
19
N ON - LINEAR SVM Classification using a non-linear SVM Multi-channel Gaussian kernel V = vocab size, A = mean distances between training samples Best set of channels for a training set is found by a greedy approach.
20
W HAT CHANNELS TO USE ? Channels may complement each other Greedy approach to pick the best combination Combining channels is more advantageous Table: Classification performance of different channels and their combinations
21
E VALUATION OF SPATIO - TEMPORAL GRIDS Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset
22
R ESULTS WITH THE KTH DATASET Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented
23
2391 sequences divided into the training/validation set (8+8 people) and test set (9 people) 10 fold cross validation Table: Confusion matrix for the KTH actions R ESULTS WITH THE KTH DATASET
24
R OBUSTNESS TO NOISE IN THE TRAINING DATA Up to p=0.2 the performance decreases insignificantly At p=0.4 the performance decreases by around 10% Figure: Performance of our video classification approach in the presence of wrong labels
25
A CTION RECOGNITION IN REAL - WORD VIDEOS Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)
26
A CTION RECOGNITION IN REAL - WORLD VIDEOS Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/neg the rapid getting up is typical for “GetOutCar” the false negatives are very difficult to recognize occluded handshake hardly visible person getting out of the car
27
C ONCLUSIONS Summary Automatic generation of realistic action samples Transfer of recent bag-of-features experience to videos Improved performance on KTH benchmark Decent results for actions in real-videos Future direction Improving the script-video alignment Experimenting with space-time-low-level-features Internet-scale video search
28
T HANK YOU
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.