Bag of Video-Words Video Representation

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.
Machine learning continued Image source:
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…
Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.
Patch to the Future: Unsupervised Visual Prediction
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Addressing the Medical Image Annotation Task using visual words representation Uri Avni, Tel Aviv University, Israel Hayit GreenspanTel Aviv University,
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Discriminative and generative methods for bags of features
Local Descriptors for Spatio-Temporal Recognition
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Lecture 28: Bag-of-words models
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
CS Instance Based Learning1 Instance Based Learning.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Exercise Session 10 – Image Categorization
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Action recognition with improved trajectories
Learning Visual Bits with Direct Feature Selection Joel Jurik 1 and Rahul Sukthankar 2,3 1 University of Central Florida 2 Intel Research Pittsburgh 3.
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Image Categorization Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/11.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.
Human pose recognition from depth image MS Research Cambridge.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
CS654: Digital Image Analysis
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
A Hierarchical Deep Temporal Model for Group Activity Recognition
Lecture IX: Object Recognition (2)
k-Nearest neighbors and decision tree
Data Driven Attributes for Action Detection
Learning Mid-Level Features For Recognition
Video Google: Text Retrieval Approach to Object Matching in Videos
Action Recognition in Temporally Untrimmed Videos
CS 1674: Intro to Computer Vision Scene Recognition
The Open World of Micro-Videos
Human Action Recognition Week 8
Video Google: Text Retrieval Approach to Object Matching in Videos
Presentation transcript:

Bag of Video-Words Video Representation UCF VIRAT Efforts Bag of Video-Words Video Representation

Outline Bag of Video-words approach for video representation Feature detection Feature quantization Histogram-based video descriptor generation Preliminary experimental results on aerial videos Discussion on ways to improve the performance

Bag of video-words approach (I) Interest Point Detector Motion Feature Detection 3

Bag of video-words approach (II) Video-word A Video-word B Video-word C Feature Quantization: Codebook Generation

Bag of video-words approach (III) Histogram-based Video Descriptor Histogram-based video descriptor generation

Similarity Metrics Histogram Intersection Chi-square distance Create a confusion table on all the examples (clustering)

Classifiers Bayesian Classifier K-Nearest Neighbors (KNN) Support Vector Machines (SVM) Histogram Intersection Kernel Chi-square Kernel RBF (Radial Basis Function) Kernel

Experiments on Aerial videos Dataset Blimp with a HD camera on a gimbal 11 Actions: Digging, gesturing, picking up, throwing, kicking, carrying object, walking, standing, running, entering vehicle, exiting vehicle

Clipping & Cropping Actions End of Frame Start of Frame Optimal box is created so that the object of interest doesn't go out of view in all the frames (Start Frame to End Frame) Compare the video clips demo with global…

Feature Detection for Video Clips 200 Features Digging Kicking Throwing Walking

Classification Results (I) “kicking”(22 clips) v.s. “non kicking” (22 clips) Number of features Per video Codebook 50 Codebook 100 Codebook 200 50 65.91% 79.55% 75.00% 100 77.27% 200 81.82%

Classification Results (II)

Classification Results (III) “Digging”, “Kicking”, “Walking”, “Throwing” ( 25clips x 4 ) digging kicking Similarity Matrix (Histogram Intersection) throwing walking

Classification Results (V) Average accuracy with different codebook size Confusion table for the case of codebook size of 300 Number of Features Per Video Codebook 100 Codebook 200 Codebook 300 200 84.6% 85.0% 86.7%

Misclassified examples (I) Misclassified “walking” into “kicking”

Misclassified examples (I) Misclassified “digging” into “walking”

Misclassified examples (III) Misclassified “walking” into “throwing”

How to improve the performance? Low Level Features Stable motion features Different Motion Features Different Motion Feature Sampling Hybrid of Motion and Static Features Video-words generation Unsupervised method Hierarchical K-Means (David Nister, et al., CVPR 2006) Supervised method Random Forest (Bill Triggs, et al., NIPS 2007) “Visual Bits” (Rong Jin, et al., CVPR 2008) Classifiers SVM Kernels : histogram intersection v.s. Chi-Square distance Multiple Kernels Low level fetures: video words : activity histogram discriptors

Stable motion features Motion compensation Video clipping and cropping Start of Frame End of Frame

Different Low-level Features Flattened gradient vector (magnitude) Histogram of Gradient (direction) Histogram of Optical Flow Combination of all types of features

Feature sampling Feature detection: Gabor filter or 3D Harris corner detection Random sampling Grid-based sampling Bill Triggs et al., Sampling Strategies for Bag-of-Features Image Classification, ECCV 2006

Hybrid of Motion and Static Features (I) Multiple-frame Features (spatiotemporal, motion) 3D Harris Capture the local spatiotemporal information around the interest points Single-frame Features (spatial, static) 2D Harris detector MSER (Maximally Stable Extremal Regions ) detector Perform action recognition by a sequence instantaneous postures or poses Overcome the shortcoming of multiple-frame features which require relative stable camera motion Hybrid of motion and static features Represent a video by the combination of multiple-frame and single-frame features

Hybrid of Motion and Static Features (II) Examples of 2D Harris and MSER feature 2D Harris MSER

Hybrid of Motion and Static Features (III) Experiments on three action datasets KTH, 6 action categories, 600 videos UCF sports, 10 action categories, about 200 videos YouTube videos, 11 action categories, about 1,100 videos

KTH dataset Boxing Clapping Waving Walking Jogging Running

Experimental results on KTH dataset Recognition using either Motion (L), Static (M) features and Hybrid (R) features Average Accuracy 87.65% Average Accuracy 82.96% Average Accuracy 92.66%

Results on UCF sports dataset The average accuracy for static, motion and static+motion experimental strategy is 74.5%, 79.6% and 84.5% respectively.

YouTube Video Dataset (I) Cycling Diving Golf Swinging Show some examples… add captions Riding Juggling

YouTube Video Dataset (II) Basketball Shooting Swinging Tennis Swinging Show some examples… Trampoline Jumping Volleyball Spiking

Results on YouTube dataset The average accuracy for motion, static and hybrid features are 65.4%, 63.1% and 71.2%, respectively

Hierarchical K-Means (I) Traditional k-means Slow when generating large size of codebook Less discriminative when dealing with large size of codebook Hierarchical k-means Building a tree on the training features Children nodes are clusters of applying k-means on the parent node Treat each node as a “word”, so the tree is a hierarchical codebook D. Nister, Scalable Recognition with a Vocabulary Tree, CVPR 2006

Hierarchical K-Means (II) Advantages Tree also defines the quantization of features, so it integrates the indexing and quantization in one tree It is much more efficient when generating a large size of codebook The word (node) frequency can be integrated with the inverse document frequency to weight it. It can generate more discriminative word than that of k-means Large size of codebook which can generally obtain better performance.

Hierarchical K-Means (II) Simple illustration

Hierarchical K-Means (III) Treating each node as a visual word Tracking each feature indexed by the tree Count the frequency (ni) of the video or image traversing each tree node The frequency is weighted by wi

Random Forests (I) K-means based quantization methods Unsupervised It suffers from the high dimensionality of the features Single-tree based methods Each path through the tree typically accesses only a few of the feature dimensions It fails to deal with the variance of the feature dimensions It is fast, but performance is not even as good as k-means Random Forests Build an ensemble trees Each tree node is splitted by checking the randomly selected subset of feature dimensions Building all the trees using video or image labels (supervised method) Instead of taking the trees as an ensemble classifiers, we treat all the leaves of all the trees as “words”. The generated “words” are more meaningful and discriminative, since it contains class category information

Random Forests (I) Basic steps to obtain C4.5 decision tree A set of training examples with M attributes For each attribute a Find the information gain from splitting on a Let a_best be the attribute with the highest gain Create a decision node that splits on a_best Recursive on the sublists obtained by splitting on a_best and add those nodes as children of node

Random Forests (II) Build an ensemble trees For each single tree construction How to split a node? Select randomly K attributes as candidate attributes Generate K splits, and pick up the one s* having maximal information gain (maximum score value) Split the current data set into Sl and Sr

Random Forests (III) Instead of using Random Forests for classification, we can treat it as codebook generation Treating each leaf as a visual word An image or video can represent by the histogram of the visual words

Random Forests (II)

Random Forests (III)

Random Forests (VI) Advantages The trained visual words are more meaningful than that of k-means, since it use class labels to guide the codebook generation More efficient than k-means when the codebook size is large Being able to overcome the instability of k-means introduced by the high-dimensional features.

“Visual Bits” (I) Both k-means or random forests “Visual Bits” Treat all the features equally when generating the codebooks. Hard assignment (each feature can only be assigned to one “word”) “Visual Bits” Rong Jin et al., Unifying Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008 Training a visual codebook for each category, so it can overcome the shortcomings of “hard assignment” of the features It integrates the classification and codebook generation together, so it is able to select the relevant features by weighting them

“Visual Bits” (I) Rong Jin et al., Unifying Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition

“Visual Bits” (II)

Classifiers Kernel SVM Multiple kernels Histogram Intersection Chi-square distance Multiple kernels Fuse different type of features Fuse different distance metrics

The end… Thank you!