Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney The University of Texas at.

Slides:

Advertisements

Similar presentations

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Advertisements

Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval Sonal Gupta and Raymond Mooney University of Texas at Austin.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Content-Based Image Retrieval

Detecting Cartoons a Case Study in Automatic Video-Genre Classification Tzvetanka Ianeva Arjen de Vries Hein Röhrig.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.

Character retrieval and annotation in multimedia

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

Fast intersection kernel SVMs for Realtime Object Detection

Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

Lecture 28: Bag-of-words models

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Presented by Zeehasham Rasheed

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.

Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.

Bag of Video-Words Video Representation

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Tamara Berg Machine Learning Recognizing People, Objects, & Actions 1.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Action recognition with improved trajectories

“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce

Visual Object Recognition

Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.

Pedestrian Detection and Localization

Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.

Watch Listen & Learn: Co-training on Captioned Images and Videos

A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Classification using Co-Training

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Learning to Detect and Classify Malicious Executables in the Wild by J

Data Driven Attributes for Action Detection

Learning Mid-Level Features For Recognition

Performance of Computer Vision

Video Google: Text Retrieval Approach to Object Matching in Videos

Using Transductive SVMs for Object Classification in Images

A Tutorial on HOG Human Detection

AHED Automatic Human Emotion Detection

Video Google: Text Retrieval Approach to Object Matching in Videos

Presentation transcript:

Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney The University of Texas at Austin, U.S.A.

The University of Texas at Austin2 Outline  Introduction  Motivation  Approach  How does Co-training work?  Experimental Evaluation  Conclusions 2

The University of Texas at Austin3 Introduction (mute) Without sound or text ….. ….. ….. …. With sound or text ??? ….. ….. ….. …. Only sound or text 3

The University of Texas at Austin4 Motivation 4  Image Recognition & Human Activity Recognition in Videos  Hard to classify, ambiguous visual cues  Expensive to manually label instances  Often images and videos have text captions  Leverage multi-modal data  Use readily available unlabeled data to improve accuracy

The University of Texas at Austin5 Goals  Classify images and videos with the help of visual information and associated text captions  Use unlabeled image and video examples

The University of Texas at Austin6 Image Examples 6 Cultivating farming at Nabataean Ruins of the Ancient Avdat Bedouin Leads His Donkey That Carries Load Of Straw Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School Desert Trees

The University of Texas at Austin7 Video Examples 7 Her last spin is going to make her win He runs in and hits ball with the inside of his shoes to reach the target God, that jump was very tricky Using the sole to tap the ball she keeps it in check. Kicking Dribbling Dancing Spinning

The University of Texas at Austin8 Related Work  Images + Text  Barnard et al. (JLM R 03) and Duygulu et al. (ECCV 02) generated models to annotate image regions with words.  Bekkerman and Jeon (CVPR 07) exploited multi-modal information to cluster images with captions  Quattoni et al. (CVPR 07) used unlabeled images with captions to improve learning in future image classification problems with no associated captions  Videos + Text  Wang et al. (MIR 07) used co-training to combine visual and textual ‘concepts’ to categorize TV ads., retrieved text using OCR and used external sources to expand the textual features.  Everingham et al. (BMVC 06) used visual information, closed-captioned text, and movie scripts to annotate faces  Fleischman and Roy (NAACL 07)used text commentary and motion description in baseball games to retrieve relevant video clips given text query

The University of Texas at Austin9 Outline  Introduction  Motivation  Approach  How does Co-training work?  Experimental Evaluation  Conclusions 9

The University of Texas at Austin10 Approach  Combining two views of images and videos using Co-training (Blum and Mitchell ‘98) learning algorithm  Views: Text and Visual  Text View  Caption of image or video  Readily available  Visual View  Color, texture, temporal information in image/video

The University of Texas at Austin11 Outline  Introduction  Motivation  Approach  How does Co-training work?  Experimental Evaluation  Conclusions 11

The University of Texas at Austin12 Co-training Semi-supervised learning paradigm that exploits two mutually independent and sufficient views Features of dataset can be divided into two sets: – The instance space: – Each example: Proven to be effective in several domains – Web page classification (content and hyperlink) – classification (header and body) 12

The University of Texas at Austin13 Co-training Initially Labeled Instances Visual Classifier Text Classifier Text ViewVisual View Text ViewVisual View Text ViewVisual View Text ViewVisual View

The University of Texas at Austin14 Co-training 14 Initially Labeled Instances Visual Classifier Text Classifier Supervised Learning Text View Visual View

The University of Texas at Austin15 Co-training 15 Unlabeled Instances Visual Classifier Text Classifier Text View Visual View

The University of Texas at Austin16 Co-training 16 Partially Labeled Instances Classify most confident instances Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin17 Co-training 17 Classifier Labeled Instances Label all views in instances Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin18 Co-training 18 Retrain Classifiers Text Classifier Visual Classifier Text View Visual View

The University of Texas at Austin19 Co-training 19 Label a new Instance Text Classifier Visual Classifier +-Text View Visual View Text ViewVisual View- +- Text ViewVisual View

The University of Texas at Austin20 Features 20  Visual Features  Image Features  Video features  Textual features

The University of Texas at Austin21 Image Features Divide images into 4  6 grid Capture texture and color distributions of each cell into 30-dim vector Cluster the vectors using k-Means to quantize the features into a dictionary of visual words Represent each image as histogram of visual words [Fei-Fei et al. ‘05, Bekkerman & Jeon ‘07] … N  30

The University of Texas at Austin22 Video Features Detect Interest Points Harris-Forstener Corner Detector for both spatial and temporal space Describe Interest Points Histogram of Oriented Gradients (HoG) Create Spatio-Temporal Vocabulary Quantize interest points to create 200 visual words dictionary Represent each video as histogram of visual words [Laptev, IJCV ‘05] … N  72

The University of Texas at Austin23 Textual Features 23 That was a very nice forward camel. Well I remember her performance last time. He has some delicate hand movement. She gave a small jump while gliding He runs in to chip the ball with his right foot. He runs in to take the instep drive and executes it well. The small kid pushes the ball ahead with his tiny kicks. That was a very nice forward camel. Well I remember her performance last time. He has some delicate hand movement. She gave a small jump while gliding He runs in to chip the ball with his right foot. He runs in to take the instep drive and executes it well. The small kid pushes the ball ahead with his tiny kicks. Standard Bag-of-Words Representation Raw Text Commentary Porter StemmerRemove Stop Words

The University of Texas at Austin24 Outline  Introduction  Motivation  Approach  How does Co-training work?  Experimental Evaluation  Conclusions 24

The University of Texas at Austin25 Experimental Methodology  Test set is disjoint from both labeled and unlabeled training set  For plotting learning curves, vary the percentage of training examples labeled  SVM is used as base classifier for both visual and text classifiers  SMO implementation in WEKA (Witten & Frank ‘05)  RBF Kernel (  = 0.01)  All experiments are evaluated with 10 iterations of 10-fold cross- validation

The University of Texas at Austin26 Baselines - Overview  Uni-modal  Visual View  Textual View  Multi-modal (Snoek et al. ICMI ‘05)  Early Fusion  Late Fusion  Supervised SVM  Uni-modal, Multi-modal  Other Semi-Supervised methods  Semi-Supervised EM - Uni-modal, Multi-modal  Transductive SVM - Uni-modal, multi-modal

The University of Texas at Austin27 Baseline - Individual Views  Individual views  Image/Video View : Only image/video features are used  Text View : Only textual features are used

The University of Texas at Austin28 Baseline - Early Fusion  Concatenate visual and textual features + - Text View Visual View Text View Visual View Classifier Training Testing Text ViewVisual View-

The University of Texas at Austin29 Baseline - Late Fusion Visual Classifier Text Classifier Text View Visual View Training +-Text View Visual View Text ViewVisual View- +- Label a new instance

The University of Texas at Austin30 Baseline - Other Semi-Supervised  Semi-Supervised Expectation Maximization (SemiSup EM)  Introduced by Nigam et al. CIKM ‘00  Used Naïve bayes as the base classifier  Transductive SVM in Semi-Supervised setting  Introduced by Joachims ICML ‘99, Bennett & Demiriz ANIPS ‘99

The University of Texas at Austin31 Image Dataset  Our image data is taken from the Israel dataset (Bekkerman & Jeon CVPR ‘07,  Consists of images with short text captions  Used two classes, Desert and Trees  A total of 362 instances

The University of Texas at Austin32 Image Examples 32 Cultivating farming at Nabataean Ruins of the Ancient Avdat Bedouin Leads His Donkey That Carries Load Of Straw Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School Desert Trees

The University of Texas at Austin33 Results Co-training v. Supervised SVM 33 Co-training SVM Late Fusion SVM Early Fusion SVM Text View SVM Image View

The University of Texas at Austin34 Results Co-training v. Supervised SVM 34 ~5% ~7% ~12%

The University of Texas at Austin35 Results Co-training v. Semi-Supervised EM 35 Co-training SemiSup- EM Late Fusion SemiSup- EM Early Fusion SemiSup- EM Text View SemiSup- EM Image View

The University of Texas at Austin36 Results Co-training v. Semi-Supervised EM 36 ~7%

The University of Texas at Austin37 Results Co-training v. Transductive SVM 37 ~4%

The University of Texas at Austin38 Video Dataset  Manually collected video clips of  kicking and dribbling from soccer game DVDs  dancing and spinning from figure skating DVDs  Manually commented the clips  Significant variation in the size of the person across the clips  Number of clips  dancing: 59, spinning: 47, dribbling: 55 and kicking: 60  The video clips  resized to 240x360 resolution  length varies from 20 to 120 frames

The University of Texas at Austin39 Video Examples 39 Her last spin is going to make her win He runs in and hits ball with the inside of his shoes to reach the target God, that jump was very tricky Using the sole to tap the ball she keeps it in check. Kicking Dribbling Dancing Spinning

The University of Texas at Austin40 Results Co-training v. Supervised SVM 40 Co-training SVM Late Fusion SVM Early Fusion SVM Text View SVM Video View

The University of Texas at Austin41 Results Co-training v. Supervised SVM 41 ~3%

The University of Texas at Austin42 What if test Videos have no captions?  During training  Video has associated text caption  During Testing  Video with no text caption  Real life situation  Co-training can exploit text captions during training to improve video classifier

The University of Texas at Austin43 Results Co-training (Test on Video view) v. SVM 43 ~2%

The University of Texas at Austin44 Conclusion  Combining textual and visual features can help improve accuracy  Co-training can be useful to combine textual and visual features to classify images and videos  Co-training helps in reducing labeling of images and videos [ More information on 44

The University of Texas at Austin45 Questions?

The University of Texas at Austin46 References  Bekkerman et al. Multi-way distributional clustering, ICML 2005  Blum and Mitchell, Combining labeled and unlabeled data with co-training, COLT 1998  Laptev, On space-time interest points, IJCV 2005  Weka Data Mining Tool (Witten and Frank) 46

The University of Texas at Austin47 Dataset Details  Image:  k=25 for k-Means  Number of textual features  Video:  Most clips 20 to 40 frames  k=200 in k-Means  Number of textual features - 381

The University of Texas at Austin48 Feature Details  Image Features  Texture features - Gabor filters with 3 scales and 4 orientations  Color - Mean, Standard deviation & Skewness of per-channel RBG and Lab color pixel values  Video Features  Maximizes a normalized spatio-temporal Laplacian operation over both spatial and temporal scales  HoG - 3x3x2 spatio-temporal blocks, 4-bin HoG descriptor for every block => 72 element descriptor

The University of Texas at Austin49 Methodology Details  Batch size = 5 in Co-training  Thresholds for image experiments  image view = 0.65  text view = 0.98  Thresholds for video experiments  image view = 0.6  text view = 0.9  Experiments evaluated using two-tailed paired t-test with 95% confidence level