Local Descriptors for Spatio-Temporal Recognition

Slides:



Advertisements
Similar presentations
Feature extraction: Corners
Advertisements

Antón R. Escobedo cse 252c Behavior Recognition via Sparse Spatio-Temporal Features Piotr Dollár Vincent Rabaud Garrison CottrellSerge Belongie.
Object Recognition from Local Scale-Invariant Features David G. Lowe Presented by Ashley L. Kapron.
Presented by Xinyu Chang
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
CSE 473/573 Computer Vision and Image Processing (CVIP)
The SIFT (Scale Invariant Feature Transform) Detector and Descriptor
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.
Image and video descriptors
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Space-time interest points Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute.
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
Robust and large-scale alignment Image from
Feature extraction: Corners 9300 Harris Corners Pkwy, Charlotte, NC.
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.
Feature extraction: Corners and blobs
1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.
Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.
1 Interest Operators Find “interesting” pieces of the image Multiple possible uses –image matching stereo pairs tracking in videos creating panoramas –object.
Overview Introduction to local features
ICCV 2003UC Berkeley Computer Vision Group Recognizing Action at a Distance A.A. Efros, A.C. Berg, G. Mori, J. Malik UC Berkeley.
Learning to classify the visual dynamics of a scene Nicoletta Noceti Università degli Studi di Genova Corso di Dottorato.
Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
Action recognition with improved trajectories
IRISA / INRIA Rennes Computational Vision and Active Perception Laboratory (CVAP) KTH (Royal Institute of Technology)
1 Interest Operators Harris Corner Detector: the first and most basic interest operator Kadir Entropy Detector and its use in object recognition SIFT interest.
Recognition and Matching based on local invariant features Cordelia Schmid INRIA, Grenoble David Lowe Univ. of British Columbia.
Characterizing activity in video shots based on salient points Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of.
Overview Harris interest points Comparing interest points (SSD, ZNCC, SIFT) Scale & affine invariant interest points Evaluation and comparison of different.
Local invariant features Cordelia Schmid INRIA, Grenoble.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Periodic Motion Detection via Approximate Sequence Alignment Ivan Laptev*, Serge Belongie**, Patrick Perez* *IRISA/INRIA, Rennes, France **Univ. of California,
Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.
Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.
Feature extraction: Corners 9300 Harris Corners Pkwy, Charlotte, NC.
776 Computer Vision Jan-Michael Frahm, Enrique Dunn Spring 2013.
Local invariant features Cordelia Schmid INRIA, Grenoble.
Extracting features from spatio-temporal volumes (STVs) for activity recognition Dheeraj Singaraju Reading group: 06/29/06.
Overview Introduction to local features Harris interest points + SSD, ZNCC, SIFT Scale & affine invariant interest point detectors Evaluation and comparison.
Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?
Local features: detection and description
Features Jan-Michael Frahm.
CS654: Digital Image Analysis
Presented by David Lee 3/20/2006
776 Computer Vision Jan-Michael Frahm Spring 2012.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
CSE 185 Introduction to Computer Vision Local Invariant Features.
Recognizing specific objects Matching with SIFT Original suggestion Lowe, 1999,2004.
Keypoint extraction: Corners 9300 Harris Corners Pkwy, Charlotte, NC.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Lecture 07 13/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
TP12 - Local features: detection and description
Video Google: Text Retrieval Approach to Object Matching in Videos
Feature description and matching
CAP 5415 Computer Vision Fall 2012 Dr. Mubarak Shah Lecture-5
Computer Vision Lecture 16: Texture II
The SIFT (Scale Invariant Feature Transform) Detector and Descriptor
SIFT keypoint detection
Video Google: Text Retrieval Approach to Object Matching in Videos
Lecture VI: Corner and Blob Detection
Presented by Xu Miao April 20, 2005
Recognition and Matching based on local invariant features
Presentation transcript:

Local Descriptors for Spatio-Temporal Recognition Ivan Laptev and Tony Lindeberg Computational Vision and Active Perception Laboratory (CVAP) Dept of Numerical Analysis and Computer Science KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden

Motivation Area: Interpretation of non-rigid motion Non-rigid motion results in visual events such as Occlusions, disocclusions Appearance, disappearance Unifications, splits Velocity discontinuities  Events are often characterized by non-constant motion and complex spatio-temporal appearance. Events provide a compact way to capture important aspects of spatio-temporal structure.

Local Motion Events Idea: look for spatio-temporal neighborhoods that maximize the local variation of image values over space and time

Interest points Spatial domain (Harris and Stephens, 1988): where Select maxima over (x,y) of Analogy in space-time: Select space-time maxima of points with high variation of image values over space and time. (Laptev and Lindeberg, ICCV’03) 

Synthetic examples Velocity discontinuity (spatio-temporal ”corner”) Unification and split Velocity discontinuity (spatio-temporal ”corner”)

Image transformations • p p’ Spatial scale: • p p’ Temporal scale: • p p’ Galilean transformation:  Estimate locally to obtain invariance to these transformations (Laptev and Lindeberg ICCV’03, ICPR’04)

Selection of spatial scale Feature detection: Selection of spatial scale Invariance with respect to size changes

Feature detection: Velocity adaptation Stabilized camera Stationary camera

Selection of temporal scale Feature detection: Selection of temporal scale Selection of temporal scales captures the temporal extent of events

Features from human actions

Why local features in space-time? Make a sparse and informative representation of complex motion patterns; Obtain robustness w.r.t. missing data (occlusions) and outliers (complex, dynamic backgrounds, multiple motions); Match similar events in image sequences; Recognize image patterns of non-rigid motion. Do not rely on tracking or spatial segmentation prior to motion recognition

Space-time neighborhoods boxing walking hand waving

Local space-time descriptors Describe image structures in the neighborhoods of detected features defined by positions and covariance matrices where A well-founded choice of local descriptors is the local jet (Koenderink and van Doorn, 1987) computed from spatio-temporal Gaussian derivatives (here at interest points pi)

Use of descriptors: Clustering Group similar points in the space of image descriptors using K-means clustering Select significant clusters Clustering c1 c2 c3 c4 Classification

Use of descriptors: Clustering

Use of descriptors: Matching Find similar events in pairs of video sequences

Other descriptors better? Consider the following choices: Multi-scale spatio-temporal derivatives Projections to orthogonal bases obtained with PCA Histogram-based descriptors Spatio-temporal neighborhood

Multi-scale derivative filters Derivatives up to order 2 or 4; 3 spatial scales; 3 temporal scales: 9 x 3 x 3 = 81 or 34 x 3 x 3 = 306 dimensional descriptors

PCA descriptors Compute normal flow or optic flow in locally adapted spatio-temporal neighborhoods of features Subsample the flow fields to resolution 9x9x9 pixels Learn PCA basis vectors (separately for each flow) from features in training sequences Project flow fields of the new features onto the 100 most significant eigen-flow-vectors:

Position-dependent histograms Divide the neighborhood i of each point pi into M^3 subneighborhoods, here M=1,2,3 Compute space-time gradients (Lx, Ly, Lt)T or optic flow (vx, vy)T at combinations of 3 temporal and 3 spatial scales where are locally adapted detection scales Compute separable histograms over all subneighborhoods, derivatives/velocities and scales ...

Evaluation: Action Recognition Database: walking running jogging handwaving handclapping boxing Initially, recognition with Nearest Neighbor Classifier (NNC): Take sequences of X subjects for training (Strain) For each test sequence stest find the closest training sequence strain,i by minimizing the distance Action of stest is regarded as recognized if class(stest)= class(strain,i)

Scale and velocity adapted features Results: Recognition rates (all) Scale and velocity adapted features Scale-adapted features

Scale and velocity adapted features Results: Recognition rates (Hist) Scale and velocity adapted features Scale-adapted features

Scale and velocity adapted features Results: Recognition rates (Jets) Scale and velocity adapted features Scale-adapted features

Results: Comparison Global-STG-HIST: Zelnik-Manor and Irani CVPR’01 Spatial-4Jets: Spatial interest points (Harris and Stephens, 1988)

Confusion matrices Position-dependent histograms for space-time interest points Local jets at spatial interest points

Confusion matrices STG-PCA, ED STG-PD2HIST, ED

Related work Mikolayczyk and Schmid CVPR’03, ECCV’02 Lowe ICCV’99 Zelnik and Irani CVPR’01 Fablet, Bouthemy and Peréz PAMI’02 Laptev and Lindeberg ICCV’03, IVC 2004, ICPR’04 Efros et.al. ICCV’03 Harris and Stephens Alvey’88 Koenderink and Doorn PAMI 1992 Lindeberg IJCV 1998

Summary Descriptors of local spatio-temporal features enable classification and matching of motion events in video Position-dependent histograms of space-time gradients and optical flow give high recognition performance. Results consistent with findings for SIFT descriptor (Lowe, 1999) in the spatial domain. Future: Include spatial and temporal consistency of local features Multiple actions in the scene Information inbetween events

walking running jogging handwaving handclapping boxing

Results: Recognition Rates Scalar product Distance Euclidean Distance

Walking model Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person: Associate each phase  with a silhouette of a person extracted from the original sequence

Sequence alignment Given a data sequence with the current moment t0, detect and classify interest points in the time window of length tw: (t0, t0-tw) Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance di to the most close data feature fd,j, cd,j=cm,i: Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching

Sequence alignment At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method data features model features

Experiments

Experiments