1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Semantic Parsers Using Statistical.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 Learning Language from its Perceptual Context Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen Joohyun.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Presented by Zeehasham Rasheed
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 Learning to Interpret Natural Language Navigation Instructions from Observation Ray Mooney Department of Computer Science University of Texas at Austin.
1 Learning Natural Language from its Perceptual Context Ray Mooney Department of Computer Science University of Texas at Austin Joint work with David Chen.
Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney The University of Texas at.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning Language Semantics from Ambiguous Supervision Rohit J. Kate.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
1 David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition.
David L. Chen Supervisor: Professor Raymond J. Mooney Ph.D. Dissertation Defense January 25, 2012 Learning Language from Ambiguous Perceptual Context.
David Chen Advisor: Raymond Mooney Research Preparation Exam August 21, 2008 Learning to Sportscast: A Test of Grounded Language Acquisition.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
David L. Chen Fast Online Lexicon Learning for Grounded Language Acquisition The 50th Annual Meeting of the Association for Computational Linguistics (ACL)
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Today Ensemble Methods. Recap of the course. Classifier Fusion
David L. Chen and Raymond J. Mooney Department of Computer Science The University of Texas at Austin Learning to Interpret Natural Language Navigation.
1 David Chen & Raymond Mooney Department of Computer Sciences University of Texas at Austin Learning to Sportscast: A Test of Grounded Language Acquisition.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing Raymond.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Watch Listen & Learn: Co-training on Captioned Images and Videos
Chapter 23: Probabilistic Language Models April 13, 2004.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning a Compositional Semantic Parser.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Reconnecting Computational Linguistics.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing with Kernels under Various Forms of.
David Chen Supervising Professor: Raymond J. Mooney Doctoral Dissertation Proposal December 15, 2009 Learning Language from Perceptual Context 1.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Learning for Semantic Parsing of Natural.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Learning Language from its Perceptual Context Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen Rohit.
Semantic Parsing for Question Answering
Using String-Kernels for Learning Semantic Parsers
Learning to Transform Natural to Formal Languages
Joohyun Kim Supervising Professor: Raymond J. Mooney
Learning to Sportscast: A Test of Grounded Language Acquisition
Presentation transcript:

1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at Austin Joint work with David Chen, Sonal Gupta, Joohyun Kim, Rohit Kate, Kristen Grauman

Learning for Language and Vision Natural Language Processing (NLP) and Computer Vision (CV) are both very challenging problems. Machine Learning (ML) is now extensively used to automate the construction of both effective NLP and CV systems. Generally uses supervised ML and requires difficult and expensive human annotation of large text or image/video corpora for training.

Cross-Supervision of Language and Vision Use naturally co-occurring perceptual input to supervise language learning. Use naturally co-occurring linguistic input to supervise visual learning. Blue cylinder on top of a red cube. Language Learner Input Supervision Vision Learner Input Supervision

Using Perception to Supervise Language: Learning to Sportscast (Chen & Mooney, ICML-08)

5 Semantic Parsing A semantic parser maps a natural-language sentence to a complete, detailed semantic representation: logical form or meaning representation (MR). For many applications, the desired output is immediately executable by another program. Sample test application: –CLang: RoboCup Coach Language

6 CLang: RoboCup Coach Language In RoboCup Coach competition teams compete to coach simulated soccer players The coaching instructions are given in a formal language called CLang Simulated soccer field Coach If the ball is in our penalty area, then all our players except player 4 should stay in our half. CLang ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) Semantic Parsing

7 Learning Semantic Parsers Manually programming robust semantic parsers is difficult due to the complexity of the task. Semantic parsers can be learned automatically from sentences paired with their logical form. NL  MR Training Exs Semantic-Parser Learner Natural Language Meaning Rep Semantic Parser

8 Our Semantic-Parser Learners CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) –Separates parser-learning and semantic-lexicon learning. –Learns a deterministic parser using ILP techniques. COCKTAIL (Tang & Mooney, 2001) –Improved ILP algorithm for CHILL. SILT (Kate, Wong & Mooney, 2005) –Learns symbolic transformation rules for mapping directly from NL to MR. SCISSOR (Ge & Mooney, 2005) –Integrates semantic interpretation into Collins’ statistical syntactic parser. WASP (Wong & Mooney, 2006; 2007) –Uses syntax-based statistical machine translation methods. KRISP (Kate & Mooney, 2006) –Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations.  

9 WASP A Machine Translation Approach to Semantic Parsing Uses latest statistical machine translation techniques: –Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) –Statistical word alignment (Brown et al., 1993; Och & Ney, 2003) SCFG supports both: –Semantic Parsing: NL  MR –Tactical Generation: MR  NL

KRISP A String Kernel/SVM Approach to Semantic Parsing Productions in the formal grammar defining the MR are treated like semantic concepts. An SVM classifier is trained for each production using a string subsequence kernel (Lodhi et al.,2002) to recognize phrases that refer to this concept. Resulting set of string classifiers is used with a version of Early’s CFG parser to compositionally build the most probable MR for a sentence.

11 Learning Language from Perceptual Context Children do not learn language from annotated corpora. Neither do they learn language from just reading the newspaper, surfing the web, or listening to the radio. –Unsupervised language learning –DARPA Learning by Reading Program The natural way to learn language is to perceive language in the context of its use in the physical and social world. This requires inferring the meaning of utterances from their perceptual context.

Ambiguous Supervision for Learning Semantic Parsers A computer system simultaneously exposed to perceptual contexts and natural language utterances should be able to learn the underlying language semantics. We consider ambiguous training data of sentences associated with multiple potential MRs. –Siskind (1996) uses this type “referentially uncertain” training data to learn meanings of words. Extracting meaning representations from perceptual data is a difficult unsolved problem. –Our system directly works with symbolic MRs.

13 Tractable Challenge Problem: Learning to Be a Sportscaster Goal: Learn from realistic data of natural language used in a representative context while avoiding difficult issues in computer perception (i.e. speech and vision). Solution: Learn from textually annotated traces of activity in a simulated environment. Example: Traces of games in the Robocup simulator paired with textual sportscaster commentary.

14 Grounded Language Learning in Robocup Robocup Simulator Sportscaster Simulated Perception Perceived Facts Score!!!! Grounded Language Learner Language Generator Semantic Parser SCFG Score!!!!

15 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

16 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

17 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 badPass ( Purple1, Pink8 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 turnover ( Purple1, Pink8 ) pass ( Pink11, Pink8 ) pass ( Pink8, Pink11 ) ballstopped pass ( Pink8, Pink11 ) kick ( Pink11 ) kick ( Pink8) kick ( Pink11 ) kick ( Pink8 )

18 Robocup Sportscaster Trace Natural Language CommentaryMeaning Representation Purple goalie turns the ball over to Pink8 P6 ( C1, C19 ) Pink11 looks around for a teammate Pink8 passes the ball to Pink11 Purple team is very sloppy today Pink11 makes a long pass to Pink8 Pink8 passes back to Pink11 P5 ( C1, C19 ) P2 ( C22, C19 ) P2 ( C19, C22 ) P0 P2 ( C19, C22 ) P1 ( C22 ) P1( C19 ) P1 ( C22 ) P1 ( C19 )

Sportscasting Data Collected human textual commentary for the 4 Robocup championship games from –Avg # events/game = 2,613 –Avg # sentences/game = 509 Each sentence matched to all events within previous 5 seconds. –Avg # MRs/sentence = 2.5 (min 1, max 12) Manually annotated with correct matchings of sentences to MRs (for evaluation purposes only). 19

KRISPER: KRISP with EM-like Retraining Extension of K RISP that learns from ambiguous supervision (Kate & Mooney, AAAI-07). Uses an iterative EM-like self-training method to gradually converge on a correct meaning for each sentence.

21 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 1. Assume every possible meaning for a sentence is correct

22 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 1. Assume every possible meaning for a sentence is correct

23 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 2. Resulting NL-MR pairs are weighted and given to K RISP 1/2 1/4 1/5 1/3

24 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 3. Estimate the confidence of each NL-MR pair using the resulting trained parser

25 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]

26 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]

27 saw(john, walks(man, dog)) KRISPER’s Training Algorithm Daisy gave the clock to the mouse. Mommy saw that Mary gave the hammer to the dog. The dog broke the box. John gave the bag to the mouse. The dog threw the ball. ate(mouse, orange) gave(daisy, clock, mouse) ate(dog, apple) saw(mother, gave(mary, dog, hammer)) broke(dog, box) gave(woman, toy, mouse) gave(john, bag, mouse) threw(dog, ball) runs(dog) 5. Give the best pairs to K RISP in the next iteration, and repeat until convergence

WASPER WASP with EM-like retraining to handle ambiguous training data. Same augmentation as added to KRISP to create KRISPER. 28

KRISPER-WASP First iteration of EM-like training produces very noisy training data (> 50% errors). KRISP is better than WASP at handling noisy training data. –SVM prevents overfitting. –String kernel allows partial matching. But KRISP does not support language generation. First train KRISPER just to determine the best NL→MR matchings. Then train WASP on the resulting unambiguously supervised data. 29

WASPER-GEN In KRISPER and WASPER, the correct MR for each sentence is chosen based on maximizing the confidence of semantic parsing (NL→MR). Instead, WASPER-GEN determines the best matching based on generation (MR→NL). Score each potential NL/MR pair by using the currently trained WASP -1 generator. Compute NIST MT score between the generated sentence and the potential matching sentence. 30

Strategic Generation Generation requires not only knowing how to say something (tactical generation) but also what to say (strategic generation). For automated sportscasting, one must be able to effectively choose which events to describe. 31

Example of Strategic Generation 32 pass ( purple7, purple6 ) ballstopped kick ( purple6 ) pass ( purple6, purple2 ) ballstopped kick ( purple2 ) pass ( purple2, purple3 ) kick ( purple3 ) badPass ( purple3, pink9 ) turnover ( purple3, pink9 )

Example of Strategic Generation 33 pass ( purple7, purple6 ) ballstopped kick ( purple6 ) pass ( purple6, purple2 ) ballstopped kick ( purple2 ) pass ( purple2, purple3 ) kick ( purple3 ) badPass ( purple3, pink9 ) turnover ( purple3, pink9 )

Learning for Strategic Generation For each event type (e.g. pass, kick) estimate the probability that it is described by the sportscaster. Requires NL/MR matching that indicates which events were described, but this is not provided in the ambiguous training data. –Use estimated matching computed by KRISPER, WASPER or WASPER-GEN. –Use a version of EM to determine the probability of mentioning each event type just based on strategic info. 34

Iterative Generation Strategy Learning (IGSL) Directly estimates the likelihood of commenting on each event type from the ambiguous training data. Uses self-training iterations to improve estimates (à la EM).

Demo Game clip commentated using WASPER- GEN with EM-based strategic generation, since this gave the best results for generation. FreeTTS was used to synthesize speech from textual output. Also trained for Korean to illustrate language independence.

37

38

Experimental Evaluation Generated learning curves by training on all combinations of 1 to 3 games and testing on all games not used for training. Baselines: –Random Matching: WASP trained on random choice of possible MR for each comment. –Gold Matching: WASP trained on correct matching of MR for each comment. Metrics: –Precision: % of system’s annotations that are correct –Recall: % of gold-standard annotations correctly produced –F-measure: Harmonic mean of precision and recall

Evaluating Semantic Parsing Measure how accurately learned parser maps sentences to their correct meanings in the test games. Use the gold-standard matches to determine the correct MR for each sentence that has one. Generated MR must exactly match gold- standard to count as correct.

Results on Semantic Parsing

Evaluating Tactical Generation Measure how accurately NL generator produces English sentences for chosen MRs in the test games. Use gold-standard matches to determine the correct sentence for each MR that has one. Use NIST score to compare generated sentence to the one in the gold-standard.

Results on Tactical Generation

Evaluating Strategic Generation In the test games, measure how accurately the system determines which perceived events to comment on. Compare the subset of events chosen by the system to the subset chosen by the human annotator (as given by the gold-standard matching).

Results on Strategic Generation

Human Evaluation (Quasi Turing Test) Asked 4 fluent English speakers to evaluate overall quality of sportscasts. Randomly picked a 2 minute segment from each of the 4 games. Each human judge evaluated 8 commented game clips, each of the 4 segments commented once by a human and once by the machine when tested on that game (and trained on the 3 other games). The 8 clips presented to each judge were shown in random counter-balanced order. Judges were not told which ones were human or machine generated. 46

Human Evaluation Metrics Score English Fluency Semantic Correctness Sportscasting Ability 5FlawlessAlwaysExcellent 4GoodUsuallyGood 3Non-nativeSometimesAverage 2DisfluentRarelyBad 1GibberishNeverTerrible 47

Results on Human Evaluation Commentator English Fluency Semantic Correctness Sportscasting Ability Human Machine Difference  0.5 

Co-Training with Visual and Textual Views (Gupta, Kim, Grauman & Mooney, ECML-08) 49

Semi-Supervised Multi-Modal Image Classification Use both images or videos and their textual captions for classification. Use semi-supervised learning to exploit unlabeled training data in addition to labeled training data. How?: Co-training (Blum and Mitchell, 1998) using visual and textual views. Illustrates both language supervising vision and vision supervising language. 50

Sample Classified Captioned Images Cultivating farming at Nabataean Ruins of the Ancient Avdat Bedouin Leads His Donkey That Carries Load Of Straw Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School Desert Trees

The University of Texas at Austin 52 Co-training Semi-supervised learning paradigm that exploits two mutually independent and sufficient views Features of dataset can be divided into two sets: –The instance space: –Each example: Proven to be effective in several domains –Web page classification (content and hyperlink) – classification (header and body)

The University of Texas at Austin 53 Co-training Initially Labeled Instances Visual Classifier Text Classifier Text ViewVisual View Text ViewVisual View Text ViewVisual View Text ViewVisual View

The University of Texas at Austin 54 Co-training Initially Labeled Instances Visual Classifier Text Classifier Supervised Learning Text View Visual View

The University of Texas at Austin 55 Co-training Unlabeled Instances Visual Classifier Text Classifier Text View Visual View

The University of Texas at Austin 56 Co-training Partially Labeled Instances Classify most confident instances Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin 57 Co-training Classifier Labeled Instances Label all views in instances Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin 58 Co-training Retrain Classifiers Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin 59 Co-training Label a new Instance Text Classifier Visual Classifier +-Text View Visual View Text ViewVisual View- +- Text ViewVisual View

The University of Texas at Austin60 Baseline - Individual Views Image/Video View : Only image/video features are used Text View : Only textual features are used

The University of Texas at Austin61 Baseline - Early Fusion  Concatenate visual and textual features + - Text View Visual View Text View Visual View Classifier Training Testing Text ViewVisual View-

The University of Texas at Austin62 Baseline - Late Fusion Visual Classifier Text Classifier Text View Visual View Training +-Text View Visual View Text ViewVisual View- +- Label a new instance

The University of Texas at Austin 63 Image Dataset Our captioned image data is taken from (Bekkerman & Jeon CVPR ‘07, Consists of images with short text captions. Used two classes, Desert and Trees. A total of 362 instances.

Text and Visual Features Text view: standard bag of words. Image view: standard bag of visual words that capture texture and color information. 64

The University of Texas at Austin 65 Experimental Methodology Test set is disjoint from both labeled and unlabeled training set. For plotting learning curves, vary the percentage of training examples labeled, rest used as unlabeled data for co-training. SVM with RBF kernel is used as base classifier for both visual and text classifiers. All experiments are evaluated with 10 iterations of 10-fold cross-validation.

Learning Curves for Israel Images 66

Using Closed Captions to Supervise Activity Recognition in Videos (Gupta & Mooney, VCL-09) 67

Activity Recognition in Video Recognizing activities in video generally uses supervised learning trained on human- labeled video clips. Linguistic information in closed captions (CCs) can be used as “weak supervision” for training activity recognizers. Automatically trained activity recognizers can be used to improve precision of video retrieval. 68

Sample Soccer Videos “I do not think there is any real intent, just trying to make sure he gets his body across, but it was a free kick.” “Lovely kick.” “Goal kick.” “Good save as well.” “I think brown made a wonderful fingertip save there.” “And it is a really chopped save” Kick Save

“If you are defending a lead, your throw back takes it that far up the pitch and gets a throw-in.” “And Carlos Tevez has won the throw.” “Another shot for a throw.” “When they are going to pass it in the back, it is a really pure touch.” “Look at that, Henry, again, he had time on the ball to take another touch and prepare that ball properly.” “All it needed was a touch.” ThrowTouch

71 Using Video Closed-Captions CCs contains both relevant and irrelevant information: “Beautiful pull-back.” relevant “They scored in the last kick of the game against the Czech Republic.” irrelevant “That is a fairly good tackle.” relevant “Turkey can be well-pleased with the way they started.” irrelevant Use a novel caption classifier to rank the retrieved video clips by relevance.

72 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier SYSTEM OVERVIEW

73 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

74 Retrieving and Labeling Data –Identify all closed caption sentences that contain exactly one of the set of activity keywords kick, save, throw, touch –Extract clips of 8 sec around the corresponding time –Label the clips with corresponding classes …What a nice kick!… kick save touch

75 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

76 Video Classifier Extract visual features from clips. –Histogram of oriented gradients and optical flow in space-time volume (Laptev et al., ICCV 07; CVPR 08) –Represent as ‘bag of visual words’ Use automatically labeled video clips to train activity classifier. Use D ECORATE (Melville and Mooney, IJCAI 03 ) –An ensemble based classifier –Works well with noisy and limited training data

77 Manually Labeled Captions Query Captioned Video Training Testing Captioned Training Videos Video Classifier Ranked List of Video Clips Caption Based Video Retriever Caption Based Video Retriever Automatically Labeled Video Clips Video Ranker Retrieved Clips Caption Classifier

78 Caption Classifier Sportscasters talk about both events on the field as well as other information –69% of the captions in our dataset are ‘irrelevant’ to the current events Classifies relevant vs. irrelevant captions –Independent of the query classes Use SVM string classifier –Uses a subsequence kernel that measures how many subsequences are shared by two strings (Lodhi et al. 02, Bunescu and Mooney 05) –More accurate than a “bag of words” classifier since it takes word order into account.

79 Retrieving and Ranking Videos Videos retrieved using captions, same way as before. Two ways of ranking: –Probabilities given by video classifier (VIDEO) –Probabilities given by caption classifier (CAPTION) Aggregating the rankings –Weighted late fusion of rankings from VIDEO and CAPTION

80 Experiment Dataset –23 soccer games recorded from TV broadcast –Avg. length: 1 hr 50 min –Avg. number of captions: 1,246 –Caption Classifier Trained on hand labeled 4 separate games Metric: MAP score: Mean Averaged Precision Methodology: Leave one-game-out cross-validation Baseline: ranking clips randomly

81 Dataset Statistics Query# Total# Correct% Noise Kick Save Throw Touch

82 Retrieval Results Mean Average Precision (MAP)

Future Work Use real (not simulated) visual context to supervise language learning. Use more sophisticated linguistic analysis to supervise visual learning. 83

84 Conclusions Current language and visual learning uses expensive, unrealistic training data. Naturally occurring perceptual context can be used to supervise language learning: –Learning to sportscast simulated Robocup games. Naturally occurring linguistic context can be used to supervise learning for computer vision: –Using multi-modal co-training to improve classification of captioned images and videos. –Using closed-captions to automatically train activity recognizers and improve video retrieval.

Questions? Relevant Papers at: 85