Activity Recognition Ram Nevatia Presents work of

Slides:



Advertisements
Similar presentations
1 Gesture recognition Using HMMs and size functions.
Advertisements

Evidential modeling for pose estimation Fabio Cuzzolin, Ruggero Frezza Computer Science Department UCLA.
Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.
Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
Active Appearance Models
Probabilistic Tracking and Recognition of Non-rigid Hand Motion
Kien A. Hua Division of Computer Science University of Central Florida.
Fast Algorithms For Hierarchical Range Histogram Constructions
- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -
Patch to the Future: Unsupervised Visual Prediction
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Database-Based Hand Pose Estimation CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
3D Human Body Pose Estimation from Monocular Video Moin Nabi Computer Vision Group Institute for Research in Fundamental Sciences (IPM)
International Conference on Automatic Face and Gesture Recognition, 2006 A Layered Deformable Model for Gait Analysis Haiping Lu, K.N. Plataniotis and.
Kiyoshi Irie, Tomoaki Yoshida, and Masahiro Tomono 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center.
1.Introduction 2.Article [1] Real Time Motion Capture Using a Single TOF Camera (2010) 3.Article [2] Real Time Human Pose Recognition In Parts Using a.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Visual Event Detection & Recognition Filiz Bunyak Ersoy, Ph.D. student Smart Engineering Systems Lab.
Robust Object Tracking via Sparsity-based Collaborative Model
Modeling 3D Deformable and Articulated Shapes Yu Chen, Tae-Kyun Kim, Roberto Cipolla Department of Engineering University of Cambridge.
A Study of Approaches for Object Recognition
1 Face Tracking in Videos Gaurav Aggarwal, Ashok Veeraraghavan, Rama Chellappa.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Fuzzy Medical Image Segmentation
Student: Hsu-Yung Cheng Advisor: Jenq-Neng Hwang, Professor
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Real Time Abnormal Motion Detection in Surveillance Video Nahum Kiryati Tammy Riklin Raviv Yan Ivanchenko Shay Rochel Vision and Image Analysis Laboratory.
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.
EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Graphical models for part of speech tagging
Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.
Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler Alvina Goh Reading group: 07/06/06.
Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)
Vehicle Segmentation and Tracking From a Low-Angle Off-Axis Camera Neeraj K. Kanhere Committee members Dr. Stanley Birchfield Dr. Robert Schalkoff Dr.
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Chapter 5 Multi-Cue 3D Model- Based Object Tracking Geoffrey Taylor Lindsay Kleeman Intelligent Robotics Research Centre (IRRC) Department of Electrical.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Chapter 7. Learning through Imitation and Exploration: Towards Humanoid Robots that Learn from Humans in Creating Brain-like Intelligence. Course: Robots.
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Looking at people and Image-based Localisation Roberto Cipolla Department of Engineering Research team
Human Activity Recognition at Mid and Near Range Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan,
HIGH PERFORMANCE OBJECT DETECTION BY COLLABORATIVE LEARNING OF JOINT RANKING OF GRANULES FEATURES Chang Huang and Ram Nevatia University of Southern California,
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Hand Gesture Recognition Using Haar-Like Features and a Stochastic Context-Free Grammar IEEE 高裕凱 陳思安.
Data Mining for Surveillance Applications Suspicious Event Detection Dr. Bhavani Thuraisingham.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.
Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.
Data Mining for Surveillance Applications Suspicious Event Detection
A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology
Supervised Time Series Pattern Discovery through Local Importance
Compositional Human Pose Regression
Machine Learning Basics
Paper Presentation: Shape and Matching
Real-Time Human Pose Recognition in Parts from Single Depth Image
Dynamical Statistical Shape Priors for Level Set Based Tracking
Vehicle Segmentation and Tracking in the Presence of Occlusions
Context-Aware Modeling and Recognition of Activities in Video
Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.
Vehicle Segmentation and Tracking from a Low-Angle Off-Axis Camera
Data Mining for Surveillance Applications Suspicious Event Detection
Human Activity Analysis
Data Mining for Surveillance Applications Suspicious Event Detection
Presentation transcript:

Activity Recognition Ram Nevatia Presents work of F. Lv, P. Natarajan and V. Singh Institute of Robotics and Intelligent Systems Computer Science Department Viterbi School of Engineering University of Southern California

Activity Recognition: Motivation Is the key content of a video (along with scene description) Useful for Monitoring (alerts) Indexing (forensic, deep analysis, entertainment…) HCI ..

Issues in Activity Recognition Inherent ambiguities of 2-D videos Variations in image/video appearance due to changes in viewpoint, illumination, clothing (texture)…. Variations in style: different actors or even the same actor at different times Reliable detection and tracking of objects, especially those directly involved in activities Temporal segmentation Most work assumes single activity in a given clip “Recognition” of novel events

Match video signals directly Possible Approaches Match video signals directly Dynamic time warping Extract spatio-temporal features, classify based on them Bag of words, histograms, “clouds”… Work of Laptev et al Most earlier work assumes action segmentation (detection vs classification) Andrew’s talk on use of localization and tracking Structural Approach Based on detection of objects, their tracks and relationships Requires ability to perform above operations

Primitive events: those we choose not to decompose, e.g. walking Event Hierarchy Composite Events Compositions of other, simpler events. Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building. Form a natural hierarchy (or lattice) Primitive events: those we choose not to decompose, e.g. walking Recognized directly from observations, Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.

Only need a few primitive actions in any domain. Key Ideas Only need a few primitive actions in any domain. Sign Language: Moves and Holds. Human Pose Articulation: Rotate, Flex and Pause. Rigid Objects (cars, people): Translate, Rotate, Scale. Can be represented symbolically using formal rules. Composite Actions can be represented as combinations of the primitive actions. Handle uncertainty and error in video by mapping rules to a Graphical Models. HMM. DBN. CRF.

Graphical Models A network, normally used to represent temporal evolution of a state Next state depends only on previous state; observation depends only on current state, single state variable Typical task is to estimate most likely state sequence given an observation sequence- Viterbi algorithm An HMM A CRF

Mid vs Near Range Mid-range Near-range Limbs of human body, particularly the arms, are not distinguishable Common approach is to detect and track moving objects and make inferences based on trajectories Near-range Hands/arms are visible; activities are defined by pose transitions, not just the position transitions Pose tracking is difficult; top-down methods are commonly used

Mid-Range Example Example of abandoned luggage detection Based on trajectory analysis and simple object detection/recognition Uses a simple Bayesian classifier and logical reasoning about order of sub-events Tested on PETS, ETISEO and TRECVID data 9

Simultaneous Tracking and Action Recognition (STAR) Top-Down Approaches Bottom-up methods remain slow, are not robust; many methods are based on use of multiple video streams An alternative is of top-down approaches where processing is driven by event models Simultaneous Tracking and Action Recognition (STAR) In analogy with SLAM in robotics Provides action segmentation, in addition to recogntion Closed-world assumption Current work limited to single actor actions

Activity Recognition w/o Tracking Input sequence … Action segments check watch punch kick pick up throw Our objective is very clear: given a single video sequence, we want to segment the whole sequence into many parts. Each part contains only one instance of some human action and we want to recognize the action in each part. We are interested in basic human actions such as walking, sitting down, punching, kicking. In our dataset, there are 15 different actions. Although we do not explicitly estimate 3D human body poses, but our result will provide such poses as a by product. + 3D body pose …

Difficulties Viewpoint change & pose ambiguity (with a single camera view) Spatial and temporal variations (style, speed)

Key Poses and Action Nets Key poses are determined from MoCap data by an automatic method that computes large changes in energy; key poses may be shared among different actions

Experiments: Training Set Let me show our experimental results. We have 15 action models, including stand, check watch, cross arm, scratch head, sit down, stand up, turn around, walk in a circle, wave hand, punch, kick, point, pick up and throw. We also include some variations for some action classes. For example, we include wave left hand v.s. wave right hand, kick gently v.s. kick vigorously. In total we have one hundred and seventy seven key poses. Given the tilt angle, we render them from 36 viewpoints with different pan angles. And that results in more than six thousand rendered key poses in the action net. 15 action models 177 key poses 6372 nodes in Action Net

Action Net: Apply constraints Because key poses are rendered from many viewpoints, we repeat the same action net for each viewpoint. We use these magenta links to connect action models rendered from adjacent viewpoints. These links allow us to model smooth change in the actor’s orientation. (The major difference between the Action Net and other graph models is that information such as camera viewpoint and action connectivity is explicitly modeled in the Action Net, while these graph models use parameters to encode such information.) 10o …

50 clips, average length 1165 frames Experiments: Test Set Our testing set contains 50 video clips. In each clip, one actor is performing all 15 actions. The actors can freely choose orientation and position. The order and the number of action instances in each clip are also variable. These clips are shot from 5 different viewpoints. There are 5 male actors and 5 female actors. The large number of action classes and the large variation in viewpoints and actors make action recognition on this dataset a challenging task 50 clips, average length 1165 frames 5 viewpoints 10 actors (5 men, 5 women)

A Video Result extracted blob & ground truth original frame without action net with action net

Working with Natural Environments Reduce reliance on good foreground segmentation Key poses may not be discriminative enough w/o accurate segmentation; include models for motion between key poses More general graphical models that include Hierarchy Transition probabilities may depend on observations Observations may depend on multiple states Duration models (HMMs imply an exponential decay) Remove need for MoCap data to acquire models

Composite Event Representation CE: Sequence(P1,P2) P1: Rotate( Right, Arm, 90o,z-axis) P2: Rotate( Right, Arm, 90o,-z-axis)

Learning Event Models Primitive Event P1 Primitive Event P2 Composite Event = Sequence(P1,P2)

Dynamic Bayesian Action Network Map action models to a Dynamic Bayesian Network Decompose a composite action into a sequence of primitive actions Each primitive is expressed in a function form fpe(s,s’,N). Maps current state s to next state s’ given parameters N. Assume a known, finite set of functions f for primitives.

For each current state: Inference Overview Given a video, obtain initial state distribution with start key pose for all composite actions For each current state: Predict the primitive based on the current duration Predict a 3D pose given the primitive and current duration Collect the observation potential of the pose using foreground overlap and difference image Obtain the best state sequence using dynamic programming (Viterbi Algorithm) Features used to match models with observations If “foreground” can be extracted reliably, then we can use blob shape properties; otherwise, use edge and motion flow matching

Pose Tracking & Action Recognition Obtain state distributions by matching poses sampled from action models Infer the action by finding the max likelihood state sequence, Inference Algorithm

Observations Foreground overlap with full body model, Difference Image overlap with body parts in action Grid-of-centroids to match foreground blob with pose

Results From CVPR08 paper

Action Learning Involves two problems Model Learning: Learning parameters N in the primitive event definition fpe(s,s’,N). Key Pose Annotation and Lifting. Pose Interpolation Feature Weight Learning: Learning the weights wk of the different potentials.

KeyPose Annotation and 3D Lifting

Pose Interpolation All limb motions can be expressed in terms of Rotate(part,axis,q). We need to learn axis and q. Simple to do given the start and end joints of part.

Feature Weight Learning Feature weight estimation as minimization of a log-likelihood error function. Learn the weights using Voted Perceptron Algorithm Requires fully labeled training data -> not available. We propose an extension to deal with partial annotations. Latent State Voted Percepton

Tested method on 3 datasets Experiments Tested method on 3 datasets Weizmann dataset Gesture set with arm gestures Grocery Store set with full body actions Dataset Train:Test Ratio Action Recognition (% accuracy) 2D Tracking (% error) Speed (fps) Weizmann 3:6 99.5 -- Gesture 3:5 90.18 5.25 8 Grocery Store 1:7 100.0 11.88 1.6

Weizmann Dataset Popular dataset for action recognition 10 full body actions from 9 actors Each video has multiple instance of one action Train:Test Recognition Accuracy Jhuang et al [9] 6:3 98.8 Space-Time Shapes [6] 8:1 100.0 Fathi Et al [5] Sun et al [20] 3:6 87.3 DBAN 1:8 96.7 99.5

Gesture Dataset 5 instances of 12 gestures from 8 actors. Indoor lab setting. 500 instances of all actions. 852x480 pixel resolution, person height: 200-250 pix.

Grocery Store Dataset Videos of 3 actions collected from a static camera. 16 videos from 8 actors, performed at pan angles. Actor height varies from 200-375 pixels, in 852x480 resolution videos.

Incorporating Better Descriptors Previous work based on weak lower-level analysis We can also evaluate 2D part models Dynamic Bayesian Action Network with Part Model

Experiments Hand gesture dataset in an Indoor lab 5 instances of 12 gestures from 8 actors, total of 500 action segments Evaluation metrics Recognition rate over all action segments 2D pose tracking as average 2D part accuracy over 48 randomly selected instances Dataset Train:Test Ratio Recognition (% accuracy) 2D Tracking DBAN-FGM 1:7 78.6 75.67 (89.94) DBAN-Parts 84.52 91.76 (92.66)

Summary and Conclusions Structural approach to activity recognition offers many attractions and challenges Results are descriptive but detecting and tracking objects is challenging Hierarchical representation is natural and can be used to reduce complexity Good bottom-up analysis remains a key to improved robustness Concept of “novel” or “anomalous” events remains difficult to formalize