Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli

Introduction  Address recognition of natural human actions in diverse and realistic video settings.  Addresses the limitations (lack of realistic and annotated video datasets)

Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images. Existing datasets for human action recognition provide samples for few action classes.

To Address these limitations we implement Automatic annotation of human actions Manual annotation is difficult Video classification for action recognition

Automatic annotation of human actions Alignment of actions in scripts and videos Text Retrieval of human actions Video datasets for human actions

Alignment of actions in scripts and videos … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … RICK Why weren't you honest with me? Why did you keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …www.dailyscript.comwww.movie-page.comwww.weeklyscript.com Subtitles (with time info.) are available for the most of movies Can transfer time to scripts by text alignment

Script alignment: Evaluation Example of a “visual false positive” A black car pulls up, two army officers get out. Annotate action samples in text Do automatic script-to-video alignment Check the correspondence of actions in scripts and movies a: quality of subtitle-script matching

Text Retrieval of human actions “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” Large variation of action expressions in text: GetOutCar action: Potential false positives: “…About to sit down, he freezes…” => Supervised text classification approach

Video Datasets for Human actions 12 movies 20 different movies Learn vision-based classifier from automatic training set Compare performance to the manual training set

Video Classification for action recognition

S PACE - TIME FEATURES Good performance for action recognition Compact and provide tolerance to background clutter, occlusions and scale changes.

I NTEREST POINT DETECTION Harris operator - with a space-time extension. We use multiple levels of spatio-temporal scales σ = 2(1+i)/2, i = 1, …, 6 τ = 2j/2, j = 1, 2 I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123, 2005.

D ESCRIPTORS Compute histogram descriptors of volume around the interest points. (∆x, ∆y, ∆t ) is related to the detection scales by ∆x, ∆y = 2kσ, ∆t = 2kτ Each volume is divided into (nx, ny, nt) grid of cuboids. We use k = 9, nx, ny=3, nt=2.

.. CONTD For each cuboid, we calculate HoG and HoF (optic flow) descriptors Very similar to SIFT descriptors, adapted to the third dimension.

S PATIO - TEMPORAL B O F Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3) Assign each feature to one word. Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio- temporal grid. If divided into grids, concatenate and normalize.

G RIDS We divide both spatial and temporal dimensions. Spatial – 1x1, 2x2, 3x3, v1x3, h3x1, o2x2 Temporal – t1, t2, t3, ot2 6 * 4 = 24 possible grid combinations! Descriptor + grid = channel.

N ON - LINEAR SVM Classification using a non-linear SVM Multi-channel Gaussian kernel V = vocab size, A = mean distances between training samples Best set of channels for a training set is found by a greedy approach.

W HAT CHANNELS TO USE ? Channels may complement each other Greedy approach to pick the best combination Combining channels is more advantageous Table: Classification performance of different channels and their combinations

E VALUATION OF SPATIO - TEMPORAL GRIDS Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset

R ESULTS WITH THE KTH DATASET Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

2391 sequences divided into the training/validation set (8+8 people) and test set (9 people) 10 fold cross validation Table: Confusion matrix for the KTH actions R ESULTS WITH THE KTH DATASET

R OBUSTNESS TO NOISE IN THE TRAINING DATA Up to p=0.2 the performance decreases insignificantly At p=0.4 the performance decreases by around 10% Figure: Performance of our video classification approach in the presence of wrong labels

A CTION RECOGNITION IN REAL - WORD VIDEOS Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)‏

A CTION RECOGNITION IN REAL - WORLD VIDEOS Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/neg the rapid getting up is typical for “GetOutCar” the false negatives are very difficult to recognize occluded handshake hardly visible person getting out of the car

C ONCLUSIONS Summary Automatic generation of realistic action samples Transfer of recent bag-of-features experience to videos Improved performance on KTH benchmark Decent results for actions in real-videos Future direction Improving the script-video alignment Experimenting with space-time-low-level-features Internet-scale video search

T HANK YOU

Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Similar presentations

Presentation on theme: "Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Similar presentations

Presentation on theme: "Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli."— Presentation transcript:

Similar presentations

About project

Feedback