Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong.

Slides:

Advertisements

Similar presentations

Shape Matching and Object Recognition using Low Distortion Correspondence Alexander C. Berg, Tamara L. Berg, Jitendra Malik U.C. Berkeley.

Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Learning an Attribute Dictionary for Human Action Classification

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

By: Ryan Wendel.  It is an ongoing analysis in which videos are analyzed frame by frame  Most of the video recognition is pulled from 3-D graphic engines.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Patch to the Future: Unsupervised Visual Prediction

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Yuanlu Xu Human Re-identification: A Survey.

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Mean transform, a tutorial KH Wong mean transform v.5a1.

Robust Object Tracking via Sparsity-based Collaborative Model

Multiple View Based 3D Object Classification Using Ensemble Learning of Local Subspaces ( ThBT4.3 ) Jianing Wu, Kazuhiro Fukui

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

São Paulo Advanced School of Computing (SP-ASC’10). São Paulo, Brazil, July 12-17, 2010 Looking at People Using Partial Least Squares William Robson Schwartz.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Lecture 28: Bag-of-words models

A Study of Approaches for Object Recognition

CS292 Computational Vision and Language Visual Features - Colour and Texture.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.

Exercise Session 10 – Image Categorization

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Bag of Video-Words Video Representation

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.

Final Exam Review CS485/685 Computer Vision Prof. Bebis.

Action recognition with improved trajectories

Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

A General Framework for Tracking Multiple People from a Moving Camera

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

REU Presentation Week 3 Nicholas Baker.  What features “pop out” in a scene?  No prior information/goal  Identify areas of large feature contrasts.

A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.

Svetlana Lazebnik, Cordelia Schmid, Jean Ponce

Learning Decompositional Shape Models from Examples Alex Levinshtein Cristian Sminchisescu Sven Dickinson Sven Dickinson University of Toronto.

Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July

Representations for object class recognition David Lowe Department of Computer Science University of British Columbia Vancouver, Canada Sept. 21, 2006.

BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.

Recognition Using Visual Phrases

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Spring 2007 COMP TUI 1 Computer Vision for Tangible User Interfaces.

Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Bag-of-Visual-Words Based Feature Extraction

Data Driven Attributes for Action Detection

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Learning Mid-Level Features For Recognition

Performance of Computer Vision

Can Computer Algorithms Guess Your Age and Gender?

Action Recognition in the Presence of One

Fast Preprocessing for Robust Face Sketch Synthesis

Real-Time Human Pose Recognition in Parts from Single Depth Image

Mean transform , a tutorial

The Open World of Micro-Videos

Outline Background Motivation Proposed Model Experimental Results

University of Central Florida

Human-object interaction

Presentation transcript:

Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong

Outline Research Background Method Experiment

Research Background Cross-view action recognition: –Automatically analyze ongoing activities from an unknown video; –Recognize actions from different views, to be robust to viewpoints variation ; –Essential for human computer interaction, video retrieval, especially in activity monitoring in surveillance scenarios. Challenging: –Same action looks quite different from different viewpoints; –Action models learned from one view become less discriminative for recognizing in a much different view;

Research Background Transfer learning approaches for cross-view action recognition: 1.Split-based features method [1]; 2.Bilingual-words based method [2]; 3.Transferable dictionary pair (TDP) based method [3]; [1] Ali Farhadi and Mostafa Kamali Tabrizi. Learning to recognize activities from the wrong view point. In ECCV, [2] Jingen Liu, Mubarak Shah, Benjamin Kuipers, and Silvio Savarese. Cross-view action recognition via view knowledge transfer. In CVPR, [3] Zheng, J., Jiang, Z., Phillips, J., Chellappa,R.. ‘Cross-View Action Recognition via a Transferable Dictionary Pair’. Proc. of the British Mach. Vision Conf., 2012.

Research Background Motivation: 1.Split-based features method: Exploited the frame-to-frame correspondence in pairs of videos taken from two views of the same action by transferring the split-based features of video frames in the source view to the corresponding video frames in the target view. Defect: the frame-to-frame correspondence is computationally expensive. 2.Bilingual-words based method: exploit the correspondence between the view-dependent codebooks constructed by k-means clustering on videos in each view. Defect: the codebook-to-codebook correspondence is not accurate enough to guarantee that a pair of videos observed in the source and target views will have similar feature representations.

Research Background Motivation 3. Transferable dictionary pair (TDP) based method: It the most encouraging method now. It learn two dictionaries of source and target views simultaneously to ensure the same action to have the same representation. Defect: Although this transfer learning algorithm achieve good performance. However, it still remain hard to transfer action models across views that involves the top view.

Object: forcing two sets of videos of shared actions in two views to have the same sparse representations. Thus, the action model learned in the source view can be directly applied to classify test videos in the target view. Method: TDP based method Flowchart of cross-view action recognition framework

Spatio-temporal interest point (STIP) based feature:  Advantage: Capture local salient characteristics of appearance and motion; Robust to spatiotemporal shifts and scales, background clutter and multiple motions;  Local space–time feature extraction: Detector: select spatio-temporal interest points in video by maximizing specific saliency functions Descriptor: capture shape and motion in the neighborhoods of selected points using image measurements Method

Bag of Words (BoW) feature: STIP features are first quantized into visual words and a video is then represented as the frequency histogram over the visual words. STIP cookbook BoW feature view1 view2 K-means

Method Sparse coding and Dictionary learning  The K-SVD is well known for efficiently learning a dictionary from a set of training signals. It solves the following optimization problem: where is the learned dictionary, and are the sparse representations of input signals

Method View-invariant action recognition Object: recognize an unknown action from an unseen (target) view using training data taken from other (source) views. Method: simultaneously learn the source and target dictionaries by forcing the shared videos taken from two views to have the same sparse representations Ys Yt Xs || Xt

Method Ds and Dt are learned by forcing two sets of videos of shared actions in two views to have the same sparse representations. With such sparse view-invariant representations, we can learn an action model for orphan actions in the source view and test it in the target view. Where, Then, {Ds, Dt } can be efficiently learned using K-SVD algorithm Objective function:

Experiment Protocol: leave one action class out Each time we only consider one action class for testing in the target view. This action class is not used to construct a transferable dictionary pair. Dataset We test the approach on the IXMAS multi-view dataset; Website:

Experiment: dataset The most popular multi-view dataset: Five views: four side views and one top view; 11 actions performed 3 times by 10 actors; Each view contain 330 action videos; Each action class contain 30 samples under each view; The actors choose freely position and orientation;

cam0 cam1 cam2 cam3 The four arrows indicate the directions actors may face Experiment: dataset

cam0 cam1 cam2 cam3 cam4 IXMAS dataset: exemplar frames time1 time2 time3 Experiment: dataset

Experiment Each time, we select one action class (30 samples) for testing in the target views. Except this action class samples, the rest (300 samples ) in both source view and source view are used for dictionary pair learning. The classification accuracy is averaging over all possible combinations for selecting test actions.

Experiment: STIP feature  Local feature: Cuboid detector & descriptor [4]  Global feature: shape-flow descriptor[5]  BoW feature: 1000-dimensional local BoW feature dimensional global BoW feature. Finally, each action video is represented by 1500-dimensional BoW feature. [4] P. Doll´ar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS. [5] D. Tran and A. Sorokin. Human activity recognition with metric learning. In ECCV, 2008.

Experiment: result Cuboid + KNN The red numbers are the recognition results in [1]. The black blod numbers are our result with the same method as [1]. This experiment’s setting is follow the [1], using the same feature and classifier. The comparative results can show that our experiment is correct. It demonstrates that we realize the method in [1] successfully. [1] Zheng, J., Jiang, Z., Phillips, J., Chellappa, R.. ‘Cross-View Action Recognition via a Transferable Dictionary Pair’. Proc. of the British Mach. Vision Conf., 2012.