Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin.

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Bellwork If you roll a die, what is the probability that you roll a 2 or an odd number? P(2 or odd) 2. Is this an example of mutually exclusive, overlapping,

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.

Kapitel 21 Astronomie Autor: Bennett et al. Galaxienentwicklung Kapitel 21 Galaxienentwicklung © Pearson Studium 2010 Folie: 1.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Context-based Visual Concept Detection Using Domain Adaptive Semantic Diffusion Yu-Gang Jiang, Jun Wang, Shih-Fu Chang, Chong-Wah Ngo VIREO Research Group.

LAW 11 Offside.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 5 second questions

Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.

Learning to show the remainder

The 5S numbers game..

Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.

Break Time Remaining 10:00.

The basics for simulations

Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.

ABC Technology Project

2 |SharePoint Saturday New York City

15. Oktober Oktober Oktober 2012.

1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Squares and Square Root WALK. Solve each problem REVIEW:

We are learning how to read the 24 hour clock

Vogler and Metaxas University of Toronto Computer Science CSC 2528: Handshapes and Movements: Multiple- channel ASL recognition Christian Vogler and Dimitris.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Page 1 of 43 To the ETS – Bidding Query by Map Online Training Course Welcome This training module provides the procedures for using Query by Map for a.

Chapter 5 Test Review Sections 5-1 through 5-4.

SIMOCODE-DP Software.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

Subtraction: Adding UP

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

A small truth to make life 100%

Clock will move after 1 minute

A SMALL TRUTH TO MAKE LIFE 100%

PSSA Preparation.

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

Does one size really fit all? Evaluating classifiers in Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Data Driven Attributes for Action Detection

Learning to Sportscast: A Test of Grounded Language Acquisition

Presentation transcript:

Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin

2 Outline Introduction Approach Experiments Results Conclusion and Future Directions

3 Introduction Human activity recognition in real world videos is very hard –High variation in camera angle, zoom etc. –Clutter, camera motion Good activity classification systems require large amount of labeled data –Human Labeled - Expensive –Automatically collected labeled data E.g. Laptev et al. CVPR uses script and captions Use activity classifier trained using auto-labeled data to rank the retrieved video clips

4 Captioned Videos Many videos have associated closed captions (CC) CC contains both relevant and irrelevant information “Beautiful pull-back.” relevant “They scored in the last kick of the game against the Czech Republic.” irrelevant “That is a fairly good tackle.” relevant “Turkey can be well-pleased with the way they started.” irrelevant Use a novel caption classifier to rank the retrieved video clips

5 Examples “I do not think there is any real intent, just trying to make sure he gets his body across, but it was a free kick.” “Lovely kick.” “Goal kick.” “Good save as well.” “I think brown made a wonderful fingertip save there.” “And it is a really chopped save” Kick Save

6 “If you are defending a lead, your throw back takes it that far up the pitch and gets a throw-in.” “And Carlos Tevez has won the throw.” “Another shot for a throw.” “When they are going to pass it in the back, it is a really pure touch.” “Look at that, Henry, again, he had time on the ball to take another touch and prepare that ball properly.” “All it needed was a touch.” ThrowTouch

7 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier Overview of the System

8 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

9 Retrieving and Labeling Data –Identify all closed caption sentences that contain exactly one of the set of activity keywords kick, save, throw, touch –Extract clips of 8 sec around the corresponding time –Label the clips with corresponding classes …What a nice kick!… kick save touch

10 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

11 Video Classifier Extract visual features from clips –Histogram of oriented gradients and optical flow in space-time volume (Laptev et al., ICCV 07; CVPR 08) –Represent as ‘bag of visual words’ Use automatically labeled video clips to train activity classifier Use D ECORATE classifier (Melville and Mooney, IJCAI 03 ) –An ensemble based classifier –Works well with noisy and limited data

12 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

13 Caption Classifier Sportscasters talk about events on the field and otherwise –~69% of the captions in the caption dataset are ‘irrelevant’ to the current events Classifies relevant vs. irrelevant captions –Independent of the query classes Use SVM string classifier –Uses a subsequence kernel - measures how many subsequences are shared by two strings –A subsequence is any ordered sequence of tokens occurring either contiguously or noncontiguously in a string (Lodhi et al. 02, Bunescu and Mooney 05)

14 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

15 Retrieving and Ranking Videos Videos retrieved using captions, same way as before Two ways of ranking –Probabilities given by the video classifier (VIDEO) –Probabilities given by the caption classifier (CAPTION) Aggregating the rankings –Weighted late fusion of rankings by VIDEO and CAPTION

16 Experiment Dataset –23 soccer games recorded from TV broadcast –Avg. length: 1 hr 50 min –Avg. number of captions: 1,246 –Caption Classifier Trained on hand labeled 4 separate games Metrics –MAP score: Mean Averaged Precision Baseline: ranking clips randomly Gold-VIDEO: Video classifier trained on manually “cleaned” data

17 Dataset Query# Total# Correct% Noise Kick Save Throw Touch

18 Results - Ranking by VIDEO ClassifierD ECORATE BaggingSVM Baseline65.68 VIDEO Gold-VIDEO

19 Results: Aggregating the Rankings

20 Lovely touch. Just trying to touch it on. Just touched on by Nani If he had not touched it. I do not think it was touched Just trying to touch it on. Lovely touch. Just touched on by Nani If he had not touched it. I do not think it was touched VIDEO MAP = CAPTION MAP = VIDEO+CAPTION MAP = 80.41

21 Future Directions Improve the activity classifier –Currently low classification accuracy –Pre-process video to remove clutter Removing false positives in automatically labeled data Improving recall Exploiting temporal relations between activities –E.g. ‘save’ is mostly preceded by ‘kick’

22 Conclusion Captioned Videos are good source for automatically collecting labeled data –Help increase precision of retrieved ranked list A novel caption classifier that further improves the MAP score No human labeling of videos is required

23 Thank Y u !