Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Similar presentations


Presentation on theme: "Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli."— Presentation transcript:

1 Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli

2 Introduction  Address recognition of natural human actions in diverse and realistic video settings.  Addresses the limitations (lack of realistic and annotated video datasets)

3 Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images. Existing datasets for human action recognition provide samples for few action classes.

4 To Address these limitations we implement Automatic annotation of human actions Manual annotation is difficult Video classification for action recognition

5 Automatic annotation of human actions Alignment of actions in scripts and videos Text Retrieval of human actions Video datasets for human actions

6 Alignment of actions in scripts and videos … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … RICK Why weren't you honest with me? Why did you keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …www.dailyscript.comwww.movie-page.comwww.weeklyscript.com Subtitles (with time info.) are available for the most of movies Can transfer time to scripts by text alignment

7 Script alignment: Evaluation Example of a “visual false positive” A black car pulls up, two army officers get out. Annotate action samples in text Do automatic script-to-video alignment Check the correspondence of actions in scripts and movies a: quality of subtitle-script matching

8 Text Retrieval of human actions “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” Large variation of action expressions in text: GetOutCar action: Potential false positives: “…About to sit down, he freezes…” => Supervised text classification approach

9 Video Datasets for Human actions 12 movies 20 different movies Learn vision-based classifier from automatic training set Compare performance to the manual training set

10 Video Classification for action recognition

11 S PACE - TIME FEATURES Good performance for action recognition Compact and provide tolerance to background clutter, occlusions and scale changes.

12 I NTEREST POINT DETECTION Harris operator - with a space-time extension. We use multiple levels of spatio-temporal scales σ = 2(1+i)/2, i = 1, …, 6 τ = 2j/2, j = 1, 2 I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123, 2005.

13

14 D ESCRIPTORS Compute histogram descriptors of volume around the interest points. (∆x, ∆y, ∆t ) is related to the detection scales by ∆x, ∆y = 2kσ, ∆t = 2kτ Each volume is divided into (nx, ny, nt) grid of cuboids. We use k = 9, nx, ny=3, nt=2.

15 .. CONTD For each cuboid, we calculate HoG and HoF (optic flow) descriptors Very similar to SIFT descriptors, adapted to the third dimension.

16 S PATIO - TEMPORAL B O F Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3) Assign each feature to one word. Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio- temporal grid. If divided into grids, concatenate and normalize.

17 S PATIO - TEMPORAL B O F Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3) Assign each feature to one word. Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio- temporal grid. If divided into grids, concatenate and normalize.

18 G RIDS We divide both spatial and temporal dimensions. Spatial – 1x1, 2x2, 3x3, v1x3, h3x1, o2x2 Temporal – t1, t2, t3, ot2 6 * 4 = 24 possible grid combinations! Descriptor + grid = channel.

19 N ON - LINEAR SVM Classification using a non-linear SVM Multi-channel Gaussian kernel V = vocab size, A = mean distances between training samples Best set of channels for a training set is found by a greedy approach.

20 W HAT CHANNELS TO USE ? Channels may complement each other Greedy approach to pick the best combination Combining channels is more advantageous Table: Classification performance of different channels and their combinations

21 E VALUATION OF SPATIO - TEMPORAL GRIDS Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset

22 R ESULTS WITH THE KTH DATASET Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

23 2391 sequences divided into the training/validation set (8+8 people) and test set (9 people) 10 fold cross validation Table: Confusion matrix for the KTH actions R ESULTS WITH THE KTH DATASET

24 R OBUSTNESS TO NOISE IN THE TRAINING DATA Up to p=0.2 the performance decreases insignificantly At p=0.4 the performance decreases by around 10% Figure: Performance of our video classification approach in the presence of wrong labels

25 A CTION RECOGNITION IN REAL - WORD VIDEOS Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)‏

26 A CTION RECOGNITION IN REAL - WORLD VIDEOS Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/neg the rapid getting up is typical for “GetOutCar” the false negatives are very difficult to recognize occluded handshake hardly visible person getting out of the car

27 C ONCLUSIONS Summary Automatic generation of realistic action samples Transfer of recent bag-of-features experience to videos Improved performance on KTH benchmark Decent results for actions in real-videos Future direction Improving the script-video alignment Experimenting with space-time-low-level-features Internet-scale video search

28 T HANK YOU


Download ppt "Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli."

Similar presentations


Ads by Google