Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Similar presentations


Presentation on theme: "Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,"— Presentation transcript:

1 Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University, Israel Learning realistic human actions from movies

2 Human actions: Motivation Action recognition useful for: Content-based browsing e.g. fast-forward to the next goal scoring scene Video recycling e.g. find “Bush shaking hands with Putin” Human scientists influence of smoking in movies on adolescent smoking Huge amount of video is available and growing Human actions are major events in movies, TV news, personal video …   150,000 uploads every day

3 What are human actions? Actions in current datasets: Actions in “the Wild”: KTH action dataset

4 Web video search –Useful for some action classes: kissing, hand shaking –Very noisy or not useful for the majority of other action classes –Implies manual selection work –Examples are frequently non-representative Web image search –Useful for learning action context: static scenes and objects –See also [Li-Jia & Fei-Fei ICCV07] Goodle Video, YouTube, MyspaceTV, … Access to realistic human actions

5 Actions in movies Realistic variation of human actions Many classes and many examples per class Problems: Typically only a few class-samples per movie Manual annotation is very time consuming

6 … :20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? :20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way :20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … RICK Why weren't you honest with me? Why did you keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script Scripts available for >500 movies (no time synchronization) …www.dailyscript.comwww.movie-page.comwww.weeklyscript.com Subtitles (with time info.) are available for the most of movies Can transfer time to scripts by text alignment Script alignment [Everingham et al. BMVC06]

7 01:21:50 01:21:59 01:22:00 01:22:03 01:22:15 01:22:17 Script alignment RICK All right, I will. Here's looking at you, kid. ILSA I wish I didn't love you so much. She snuggles closer to Rick. CUT TO: EXT. RICK'S CAFE - NIGHT Laszlo and Carl make their way through the darkness toward a side entrance of Rick's. They run inside the entryway. The headlights of a speeding police car sweep toward them. They flatten themselves against a wall to avoid detection. The lights move past them. CARL I think we lost them. …

8 –On the good side: Realistic variation of actions: subjects, views, etc… Many examples per class, many classes No extra overhead for new classes Actions, objects, scenes and their combinations Character names may be used to resolve “who is doing what?” –Problems: No spatial localization Temporal localization may be poor Missing actions: e.g. scripts do not always follow the movie Annotation is incomplete, not suitable as ground truth for testing action detection Large within-class variability of action classes in text Script-based action annotation

9 Script alignment: Evaluation Example of a “visual false positive” A black car pulls up, two army officers get out. Annotate action samples in text Do automatic script-to-video alignment Check the correspondence of actions in scripts and movies a: quality of subtitle-script matching

10 Text-based action retrieval “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” Large variation of action expressions in text: GetOutCar action: Potential false positives: “…About to sit down, he freezes…” => Supervised text classification approach

11 Movie actions dataset 12 movies 20 different movies Learn vision-based classifier from automatic training set Compare performance to the manual training set

12 Action Classification: Overview Bag of space-time features + multi-channel SVM Histogram of visual words Multi-channel SVM Classifier Collection of space-time patches HOG & HOF patch descriptors Visual vocabulary [Schuldt’04, Niebles’06, Zhang’07]

13 Space-Time Features: Detector  Space-time corner detector [Laptev, IJCV 2005]  Dense scale sampling (no explicit scale selection)

14  Space-Time Features: Descriptor Histogram of oriented spatial grad. (HOG)‏ Histogram of optical flow (HOF)‏ 3x3x2x4bins HOG descriptor 3x3x2x5bins HOF descriptor Public code available soon! Multi-scale space-time patches from corner detector

15 We use global spatio-temporal grids In the spatial domain: 1x1 (standard BoF) 2x2, o2x2 (50% overlap) h3x1 (horizontal), v1x3 (vertical) 3x3 In the temporal domain: t1 (standard BoF), t2, t3 Figure: Examples of a few spatio-temporal grids Spatio-temporal bag-of-features Quantization:     

16 We use SVMs with a multi-channel chi-square kernel for classification Channel c is a combination of a detector, descriptor and a grid D c (H i, H j ) is the chi-square distance between histograms A c is the mean value of the distances between all training samples The best set of channels C for a given training set is found based on a greedy approach Multi-channel chi-square kernel

17 Table: Classification performance of different channels and their combinations It is worth trying different grids It is beneficial to combine channels Combining channels

18 Evaluation of spatio-temporal grids Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset

19 Comparison to the state-of-the-art Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

20 Comparison to the state-of-the-art Table: Average class accuracy on the KTH actions dataset Table: Confusion matrix for the KTH actions

21 Training noise robustness Figure: Performance of our video classification approach in the presence of wrong labels Up to p=0.2 the performance decreases insignicantly At p=0.4 the performance decreases by around 10%

22 Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false pos/neg Action recognition in real-world videos

23 Note the suggestive FP: hugging or answering the phone Note the dicult FN: getting out of car or handshaking Action recognition in real-world videos

24 Table: Average precision (AP) for each action class of our test set. We compare results for clean (annotated) and automatic training data. We also show results for a random classifier (chance)‏ Action recognition in real-world videos

25 Action classification (CVPR 2008) Test episodes from movies “The Graduate”, “It’s a wonderful life”, “Indiana Jones and the Last Crusade”

26 “Buffy the Vampire Slayer” TV series: 7 seasons = 39 DVDs = 144 episodes ≈100 hours of video ≈10M frames Going large-scale Figure: Key frames of sample “Buffy” clips.

27 “Buffy the Vampire Slayer” TV series: video samples with captions captions are parsed with Link Grammar verbs are stemmed using Morphy semantics can be obtained with WordNet What actions are available?

28 Text-based retrieval Textual retrieval results in: 10-35% false positives 5-10% “borderline” samples Sources of errors as for Hollywood movies: errors in script synchronization errors in script parsing out-of-view actions

29 Visual re-ranking Vision can be used to re-rank the samples and prune the false positives: often no FP at 40% recall half of the FP at 80% recall

30 Figure: Example results for our visual modelling procedure. We show the key frames of true/false positives/negatives with the highest confidence values. Going large-scale low score high score

31 Conclusions Summary Automatic generation of realistic action samples New action dataset available Transfer of recent bag-of-features experience to videos Improved performance on KTH benchmark Promising results for actions in the wild Future directions Automatic action class discovery Internet-scale video search Video+Text+Sound+…

32

33 Going large-scale Punch Kick FallWalk


Download ppt "Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,"

Similar presentations


Ads by Google