Presentation is loading. Please wait.

Presentation is loading. Please wait.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Similar presentations


Presentation on theme: "Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah."— Presentation transcript:

1 Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah

2 What’s the problem We want to automatically annotate and index images (considering their ever growing number) Visual cues alone can be ambiguous (depend on lighting, and variety exhibited even by objects of the same kind)

3 Previous methods Previous work was focused on learning the association between visual and textual information Many researchers have worked on activity recognition in videos using only visual cues Some used co-training but in a different flavour (co-SVM, 2 visual views, 2 text views)

4 How we propose to solve it The two major factors are the features and approach. We want to use : {Visual information + Linguistic information + Unlabeled multi-modal data } for the learning process using a co-training approach. Visual and linguistic information are used as separate cues and we expect they will complement each other during training

5 What is co-training Training using 2 different (conditionally independent and sufficient) views First learn a separate classifier for each view Most confident predictions of these classifiers are then used to iteratively construct labeled training data Change made: an unlabeled example is only labeled if a pre-specified confidence threshold for that view is exceeded Used for classifying : Webpages (based on content and hyperlink views) Email ( based on header and body) object detection models

6 Co-training Algorithm: Algorithm: A classifier for each view using the labeled data with just the features for that view. – Loop following steps until there are no more unused unlabeled instances: 1. Compute predictions and confidences of both classifiers for all of the unlabeled instances. 2. For each view, choose the m unlabeled instances for which its classifier has the highest confidence. For each such instance, if the confidence value is less than the threshold for this view, then ignore the instance and stop labeling instances with this view, else label the instance and add it to the supervised training set. 3. Retrain the classifiers for both views using the augmented labeled data.

7 Text Feature Pre process text by removing stop-words Stem the remaining words using Porter stemming Frequency of the resulting word stems comprises of the final textual features. (“bag of words” representation)

8 Captioned images Features used : We want to capture overall texture and color distributions in local regions Texture – Gabor filter with 3 scales and 4 orientations Color – Mean, Standard deviation and skewness of per-channel RGB and lab color pixel values

9 Method (for captioned images) Divide each image into a 4 by 6 sized cells Compute texture feature using Gabor filter for each The resulting feature vector for each region is then clustered using k-means Each region is then assigned to one of the k-clusters based on its closeness to cluster centroids Final “bag of visual words” represents every image with a vector of k values, each denoting number of regions of the image close to that value.

10 The University of Texas at Austin10 Image Features Divide images into 4  6 grid Capture texture and color distributions of each cell into 30-dim vector Cluster the vectors using k-Means to quantize the features into a dictionary of visual words Represent each image as histogram of visual words [Fei-Fei et al. ‘05, Bekkerman & Jeon ‘07] … N  30

11 Example dataset

12 Results for captioned images Compare co-training to supervised SVM Compare co-training to Semi-supervised EM Compare co-training to Transductive SVM

13 Commented videos Features used: we use features that describe both salient spatial changes and interesting movements. Maximize a normalized temporal laplacian operation over spatial and temporal scale HOG – 3x3x2 spatial temporal blocks,4-bin HOG descriptor for every block => 72 element descriptor

14 Method (for commented videos) Use spatio-temporal descriptor motion descriptor (Laptev) To detect events, use significant local changes in image values in both space and time. Estimate the spatio-temporal extent of the detected events by maximizing a normalized spatiotemporal Laplacian operator over both spatial and temporal scales A HOG(histogram of oriented gradients) is calculated at each interest point. The patch is partitioned into a grid with 3x3x2 spatio-temporal blocks, and four-bin HOG descriptors are then computed for all blocks and concatenated into a 72-element descriptor These descriptors are clustered to form a vocabulary

15 The University of Texas at Austin15 Video Features Detect Interest Points Harris-Forstener Corner Detector for both spatial and temporal space Describe Interest Points Histogram of Oriented Gradients (HoG) Create Spatio-Temporal Vocabulary Quantize interest points to create 200 visual words dictionary Represent each video as histogram of visual words [Laptev, IJCV ‘05] … N  72

16 Example dataset

17 Results for commented video Compare co-training with supervised SVM for commented video dataset Compare co-training with supervised SVM when commentary is not available during testing

18 What does the future look like? Larger dataset + more categories for testing Labeled data versus associated text Already the results show that co-training gives better results than existing semi-supervised and supervised methods.

19 Questions ?


Download ppt "Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah."

Similar presentations


Ads by Google