Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, 2009. CVPR Actions in Context.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

CVPR2013 Poster Modeling Actions through State Changes.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.

Patch to the Future: Unsupervised Visual Prediction

Image and video descriptors

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Space-time interest points Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute.

Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

Discriminative and generative methods for bags of features

Matching with Invariant Features

Local Descriptors for Spatio-Temporal Recognition

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

A Study of Approaches for Object Recognition

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Feature extraction: Corners and blobs

Object Recognition Using Distinctive Image Feature From Scale-Invariant Key point D. Lowe, IJCV 2004 Presenting – Anat Kaspi.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

Scale Invariant Feature Transform (SIFT)

Presented by Zeehasham Rasheed

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Bag of Video-Words Video Representation

Computer vision.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

A Thousand Words in a Scene P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez and T. Tuytelaars PAMI, Sept

Action recognition with improved trajectories

Classification 2: discriminative models

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Characterizing activity in video shots based on salient points Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of.

Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.

Learning a Fast Emulator of a Binary Decision Process Center for Machine Perception Czech Technical University, Prague ACCV 2007, Tokyo, Japan Jan Šochman.

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Kylie Gorman WEEK 1-2 REVIEW. CONVERTING AN IMAGE FROM RGB TO HSV AND DISPLAY CHANNELS.

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Image Classification for Automatic Annotation

Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

Presented by David Lee 3/20/2006

Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Machine learning & object recognition Cordelia Schmid Jakob Verbeek.

Machine Learning and Category Representation Jakob Verbeek November 25, 2011 Course website:

Face recognition using Histograms of Oriented Gradients

Guillaume-Alexandre Bilodeau

Presented by David Lee 3/20/2006

Data Driven Attributes for Action Detection

Learning Mid-Level Features For Recognition

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

The Open World of Micro-Videos

Brief Review of Recognition + Context

KFC: Keypoints, Features and Correspondences

SIFT keypoint detection

Knowledge-based event recognition from salient regions of activity

Presentation transcript:

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context

Outline Introduction Script mining for visual learning Visual learning of actions and scenes Classification with context Experimental results

Introduction Video understanding involves interpretation of objects, scenes and actions. Previous work mostly dealt with these concepts separately.

Introduction We automatically discover relevant scene classes and their correlation with human actions. We show how to learn selected scene classes from video without manual supervision. We develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We consider a large and diverse set of video samples from movies. Script-to-video alignment, text search To automatically retrieve video samples and corresponding labels for scenes and actions in movies. Bag-of-features framework, joint scene-action SVM-based classifier. Learn separate visual models for action and scene classification

Script mining for visual learning Script-to-video alignment Action retrieval Scene retrieval

Script-to-video alignment We match script text with subtitles using dynamic programming. We obtain a set of short video clips

Action retrieval We choose twelve frequent action classes and manually assign corresponding action labels to scene descriptions in a few movie scripts. We then train an off-the-shelf bag-of-words text classifier and automatically retrieve scene descriptions with action labels. We split all retrieved samples into the test and training subsets. We manually validate and correct labels for the test samples in order to properly evaluate the performance of our approach.

Scene retrieval Objective : exploiting re-occurring relations between actions and their scene context to improve classification. Identification of relevant scene types and Estimation of the co-occurrence relation with actions. We use scene captions, which are annotated in most of the movie scripts : We use WordNet [7] To select expressions corresponding to instances of “physical entity”. Hierarchy to generalize concepts to their hyponyms, such as taxi → car, café → restaurant, [7] C. Fellbaum, editor. Wordnet: An Electronic Lexical Database. Bradford Books, 1998.

Scene retrieval We validate : How well script-estimated co-occurrences transfer to the video domain How well they generalize to the test set

Scene retrieval

Visual learning of actions and scenes Bag-of-features video representation Support Vector Classifier Evaluation of the action and scene classifier

Bag-of-features video representation : Views videos as spatio-temporal volumes Given a video sample, salient local regions are extracted using an interest point detector The video content in each of the detected regions is described with local descriptors We view the video as a collection of frames as well This allows us to employ the bag-of-features components developed for static scene and object recognition [28] Having a hybrid representation in a uni ﬁ ed framework Three different types of descriptors : Capture motion patterns (HoF) Dynamic appearance (HoG) Static appearance (SIFT) [28] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi ﬁ cation of texture and object categories: A comprehensive study. IJCV, 73(2):213–238, 2007.

3D-Harris detector Salient motion is detected using a multi-scale 3D generalization of the Harris operator [11]. Since the operator responds to 3D corners Can detect (in space and time) characteristic points of moving salient parts in video. [11] I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123,2005.

2D-Harris detector 2D scale-invariant Harris operator [17] To describe static content It responds to corner-like structures in images Allowing us to detect salient areas in individual frames extracted from the video. [17] K. Mikolajczyk and C. Schmid. Scale and af ﬁ ne invariant interest point detectors. IJCV, 60(1):63–86, 2004.

HoF and HoG descriptors [12] For the regions obtained with the 3D-Harris detector. HoF : Based on local histograms of optical ﬂ ow. It describes the motion in a local region. HoG : A 3D histogram of 2D (spatial) gradient orientations. It describes the static appearance over space and time. [12] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.

SIFT [15] For the regions obtained with the 2D-Harris detector. A weighted histogram of gradient orientations. Unlike the HoG descriptor, it is a 2D (spatial) histogram and does not take into account the temporal domain. [15] D. Lowe. Distinctive image features form scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Support Vector Classifier We use Support Vector Machines [23] (SVM) to learn actions and scene context. The decision function of a binary C-SVM classi ﬁ er takes the following form: K(x i, x) is the value of a kernel function for the training sample x i and the test sample x : class label (positive/negative) of x i α i is a learned weight of the training sample x i b is a learned threshold [23] B. Sch¨olkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press,Cambridge, MA, 2002.

we use the multi-channel Gaussian kernel: To compare the overall system performance, we compute an average AP (AAP) over a set of binary classi ﬁ cation problems. Support Vector Classifier [28] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi ﬁ cation of texture and object categories:A comprehensive study. IJCV, 73(2):213–238,2007.

Evaluation of the action and scene classiﬁers We Compare HoG, HoF and SIFT features for action and scene recognition on the datasets described before. Both static and dynamic features are important for recognition of scenes and actions.

A combination of static and dynamic features is a good choice for a generic video recognition framework Evaluation of the action and scene classiﬁers

Classiﬁcation with context The goal is to improve action classification based on scene context obtained with the scene classi ﬁ ers. τ is a global context weight (set to 3) w as are weights linking concepts a and s Text mining Learning the weights from visual data

Mining contextual links from text is a : a vector with the original scores for all basic (action) classes : a vector with scores for all context (scene) classes W is a weight matrix From scripts of the training set De ﬁ ned it to be a conditional probability matrix

Learning contextual links from visual data Rewrite (4) as : Replace the linear combination with a linear C-SVM classi ﬁ er, The weights of our context model are determined by a second-layer SVM classi ﬁ er. Combining action and context classes in one learning task. Takes a vector of scene scores as an input and produces a contextual prediction for the action class.

Experimental results We evaluate the gain for action recognition when using scene context. We show the average precision gains for context weights learned from text (red) and with a linear SVM (yellow)

Experimental results We evaluate the gain for scene recognition when using action context

Conclusion We can use automatically extracted context information based on video scripts and improve action recognition. Given completely automatic action and scene training sets we can learn individual classifiers which integrate static and dynamic information. Experimental results show that both types of information are important for actions and scenes and that in some cases the context information dominates. Combining the individual classifiers improves the results for action as well as scene classi ﬁ cation.