Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Manifold Alignment for Multitemporal Hyperspectral Image Classification H. Lexie Yang 1, Melba M. Crawford 2 School of Civil Engineering, Purdue University.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Aggregating local image descriptors into compact codes
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Local Discriminative Distance Metrics and Their Real World Applications Local Discriminative Distance Metrics and Their Real World Applications Yang Mu,
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Limin Wang, Yu Qiao, and Xiaoou Tang
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
A Geometric Perspective on Machine Learning 何晓飞 浙江大学计算机学院 1.
Machine learning continued Image source:
Proposed concepts illustrated well on sets of face images extracted from video: Face texture and surface are smooth, constraining them to a manifold Recognition.
CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.
Patch to the Future: Unsupervised Visual Prediction
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Human Action Recognition by Learning Bases of Action Attributes and Parts.
A novel supervised feature extraction and classification framework for land cover recognition of the off-land scenario Yan Cui
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Frustratingly Easy Domain Adaptation
Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection CVPR2013 POSTER.
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
3D Hand Pose Estimation by Finding Appearance-Based Matches in a Large Database of Training Views
A neural approach to extract foreground from human movement images S.Conforto, M.Schmid, A.Neri, T.D’Alessio Compute Method and Programs in Biomedicine.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Learning from Multiple Outlooks Maayan Harel and Shie Mannor ICML 2011 Presented by Minhua Chen.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
Action recognition with improved trajectories
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.
Human pose recognition from depth image MS Research Cambridge.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Geodesic Flow Kernel for Unsupervised Domain Adaptation Boqing Gong University of Southern California Joint work with Yuan Shi, Fei Sha, and Kristen Grauman.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Manifold learning: MDS and Isomap
H. Lexie Yang1, Dr. Melba M. Crawford2
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations ZUO ZHEN 27 SEP 2011.
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
A Hierarchical Deep Temporal Model for Group Activity Recognition
Guillaume-Alexandre Bilodeau
Unsupervised Riemannian Clustering of Probability Density Functions
Machine Learning Basics
Action Recognition ECE6504 Xiao Lin.
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
The Open World of Micro-Videos
Online Graph-Based Tracking
Papers 15/08.
Presentation transcript:

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin

Outline Introduction Approach Experimental Results Conclusions

Introduction

we expand the training set without requiring additional manually labeled examples and explore two ways to implement our idea. example-based representation – We match the labeled training images to unlabeled video frames based on their pose similarity, and then augment the training set with the poses appearing before and after the matched frames. manifold-based representation – we learn a nonlinear manifold over body poses, relying on the temporal nearness of the video frames to establish which should maintain proximity. Then, we map the static training instances to the manifold, and explore their neighborhoods on the manifold to augment the training set.

Approach Our approach augments a small set of static images labeled by their action category by introducing synthetic body pose examples. The synthetic examples extend the real ones locally in time, so that we can train action classifiers on a wider set of poses that are (likely) relevant for the actions of interest.

Representing Body Pose We use a part-based representation of pose called a poselet activation vector (PAV), adopted from [17]. A poselet [4] is an SVM classifier trained to fire on image patches that look like some consistent fragment of human body pose.

Unlabeled Video Data In our current implementation, we use video from the Hollywood dataset to form the unlabeled pool. To pre-process the unlabeled video, we – 1) detect people and extract person tracks – 2) compute a PAV pose descriptor for each person window found – 3) either simply index those examples for our exemplar-based method or else compute a pose manifold for our manifold-based method

Generating Synthetic Pose Examples Let denote the N training snapshots our system receives as input, where the superscript i denotes image, and each is a PAV descriptor with an associated action class label Let denote the K person tracks from the unlabeled video, and let each track be represented by a sequence of PAV descriptors,, where superscript denotes video, and is the number of frames in the k-th track.

Example-based strategy For each training snapshot pose, we find its nearest neighbor pose in any of the video tracks, according to Euclidean distance in PAV space. Denote that neighbor we simply sample temporally adjacent poses to to form synthetic examples that will “pad” the training set for class.Specifically, we take and, the poses T frames before and T frames after the match an expanded training set :

Manifold-based strategy To construct the manifold, we use the locally linear embedding (LLE) algorithm [25]. use the PAVs from the unlabeled video to build the manifold. We determine neighbors for LLE using a similarity function capturing both temporal nearness and pose similarity: denotes the PAV for the q-th frame within the k-th track in the unlabeled video. [25] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. In Science, volume 290, pages 2323–2326,December 2000.

Training with a Mix of Real and Synthetic Poses we employ domain adaptation to account for the potential mismatch in feature distributions between the labeled snapshots and unrelated video. Domain adaptation (DA) techniques are useful when there is a shift between the data distributions in a “source” and “target” domain.

Domain adaptation

Training with a Mix of Real and Synthetic Poses We use the “frustratingly simple” DA approach of [6]. It maps original data in to a new feature space of dimension. Every synthetic (source) pose example is mapped to, where 0 = [0,..., 0]. Every real (target) pose example is mapped to. [6] H. Daume III. Frustratingly easy domain adaptation. In ACL, 2007.

Datasets For the unlabeled video data, we use the training and testing clips from the Hollywood Human Actions dataset For the recognition task with static test images, we test on both the 9 actions in the PASCAL VOC 2010 dataset and 10 selected verbs from the Stanford 40 Actions dataset For the video recognition task, we gather 78 test videos from the HMDB51 [14], Action Similarity Labeling Challenge [14], and UCF Sports datasets that contain activities also appearing in PASCAL

Recognizing Activity in Novel Images

Recognizing Activity in Novel Video

Conclusions We proposed a framework to augment training data for activity recognition without additional labeling cost. We explore simple but effective example- and manifold-based representations of pose dynamics, and combine them with a domain adaptation feature mapping that can connect the real and generated examples. Our results classifying activities in three datasets show that the synthetic poses have significant impact when the labeled training examples are sparse.