Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin.

Similar presentations


Presentation on theme: "Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin."— Presentation transcript:

1 Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin

2 2 Outline Introduction Approach Experiments Results Conclusion and Future Directions

3 3 Introduction Human activity recognition in real world videos is very hard –High variation in camera angle, zoom etc. –Clutter, camera motion Good activity classification systems require large amount of labeled data –Human Labeled - Expensive –Automatically collected labeled data E.g. Laptev et al. CVPR uses script and captions Use activity classifier trained using auto-labeled data to rank the retrieved video clips

4 4 Captioned Videos Many videos have associated closed captions (CC) CC contains both relevant and irrelevant information “Beautiful pull-back.” relevant “They scored in the last kick of the game against the Czech Republic.” irrelevant “That is a fairly good tackle.” relevant “Turkey can be well-pleased with the way they started.” irrelevant Use a novel caption classifier to rank the retrieved video clips

5 5 Examples “I do not think there is any real intent, just trying to make sure he gets his body across, but it was a free kick.” “Lovely kick.” “Goal kick.” “Good save as well.” “I think brown made a wonderful fingertip save there.” “And it is a really chopped save” Kick Save

6 6 “If you are defending a lead, your throw back takes it that far up the pitch and gets a throw-in.” “And Carlos Tevez has won the throw.” “Another shot for a throw.” “When they are going to pass it in the back, it is a really pure touch.” “Look at that, Henry, again, he had time on the ball to take another touch and prepare that ball properly.” “All it needed was a touch.” ThrowTouch

7 7 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier Overview of the System

8 8 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

9 9 Retrieving and Labeling Data –Identify all closed caption sentences that contain exactly one of the set of activity keywords kick, save, throw, touch –Extract clips of 8 sec around the corresponding time –Label the clips with corresponding classes …What a nice kick!… kick save touch

10 10 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

11 11 Video Classifier Extract visual features from clips –Histogram of oriented gradients and optical flow in space-time volume (Laptev et al., ICCV 07; CVPR 08) –Represent as ‘bag of visual words’ Use automatically labeled video clips to train activity classifier Use D ECORATE classifier (Melville and Mooney, IJCAI 03 ) –An ensemble based classifier –Works well with noisy and limited data

12 12 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

13 13 Caption Classifier Sportscasters talk about events on the field and otherwise –~69% of the captions in the caption dataset are ‘irrelevant’ to the current events Classifies relevant vs. irrelevant captions –Independent of the query classes Use SVM string classifier –Uses a subsequence kernel - measures how many subsequences are shared by two strings –A subsequence is any ordered sequence of tokens occurring either contiguously or noncontiguously in a string (Lodhi et al. 02, Bunescu and Mooney 05)

14 14 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

15 15 Retrieving and Ranking Videos Videos retrieved using captions, same way as before Two ways of ranking –Probabilities given by the video classifier (VIDEO) –Probabilities given by the caption classifier (CAPTION) Aggregating the rankings –Weighted late fusion of rankings by VIDEO and CAPTION

16 16 Experiment Dataset –23 soccer games recorded from TV broadcast –Avg. length: 1 hr 50 min –Avg. number of captions: 1,246 –Caption Classifier Trained on hand labeled 4 separate games Metrics –MAP score: Mean Averaged Precision Baseline: ranking clips randomly Gold-VIDEO: Video classifier trained on manually “cleaned” data

17 17 Dataset Query# Total# Correct% Noise Kick Save Throw Touch

18 18 Results - Ranking by VIDEO ClassifierD ECORATE BaggingSVM Baseline65.68 VIDEO Gold-VIDEO

19 19 Results: Aggregating the Rankings

20 20 Lovely touch. Just trying to touch it on. Just touched on by Nani If he had not touched it. I do not think it was touched Just trying to touch it on. Lovely touch. Just touched on by Nani If he had not touched it. I do not think it was touched VIDEO MAP = CAPTION MAP = VIDEO+CAPTION MAP = 80.41

21 21 Future Directions Improve the activity classifier –Currently low classification accuracy –Pre-process video to remove clutter Removing false positives in automatically labeled data Improving recall Exploiting temporal relations between activities –E.g. ‘save’ is mostly preceded by ‘kick’

22 22 Conclusion Captioned Videos are good source for automatically collecting labeled data –Help increase precision of retrieved ranked list A novel caption classifier that further improves the MAP score No human labeling of videos is required

23 23 Thank Y u !


Download ppt "Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin."

Similar presentations


Ads by Google