Context-Aware Modeling and Recognition of Activities in Video

Context-Aware Modeling and Recognition of Activities in Video
Y. Zhu, N. M. Nayak, A. Roy-Chowdhury University of California, Riverside

Background - Single Person Action Recognition Strategies
Time Series of Histograms of Optical Flow Bag-Of-Features Approach Space-time shapes

Background – Feature Level Relationships
Spatio-Temporal Relationship Match String of Feature Graphs Streakline Representation of Activities

Background – Activity Level Relationships (Context)
Collective Activity Recognition And-Or Graphs Generative - Markov Random Field Model Discriminative – Structured SVM Model

“Strings of Feature Graphs” Modeling of Activity
A novel generalized framework of activity recognition based on “string of feature graphs” is proposed. Strings of 3D feature graphs Feature Graph Matching Strings of Feature Graphs Matching

Compute Local Features
SFG Construction Feature Graph 2 Compute Local Features Feature Graph 1 Show Feature Graph1 first.

SFG Matching

Subsequence Matching … … Cost Matrix Accumulated Cost Matrix Y1 Y2 Y3

Multi-Object Behavior Modeling using Streaklines
The figure shows the streaklines for people opening a trunk in two videos. The circled region shows the similarity in the activity captured by the streaklines. Streaklines are a representation of the motion in a video obtained by integrating the optical flow in a time-varying flow field. An activity is modeled as a spatio-temporal arrangement of motion patterns identified using streaklines.

Framework for Activity Recognition using Streaklines

Our Goal Given a video and a set of activity classes of interest, to localize and label all the activities of interest jointly in the video, considering the inter-relationships between them.

Motivation person getting out of vehicle;
The relationships between the participating objects provide important cues about the on-going activities - Intra-activity context. The spatial layout of activities and their sequential patterns provide useful cues for their understanding - Inter-activity context. person getting out of vehicle; (b) person opening vehicle trunk; (c) person closing vehicle trunk.

Challenges Traditional challenges in video analysis
Illumination variation and occlusion Spatial and temporal extent of an activity Modeling context in unconstrained video the object interaction patterns within activities the spatial and temporal patterns between activities

Related Works Features for Activity Recognition Action Recognition
E.g., STIP, Cuboids, HOOF, Space-time shapes Action Recognition E.g., Bag-of-words and its extensions, Dynamical systems Multi-object Interactions E.g., Graphical models and logical approaches Activity Recognition in Continuous Videos E.g. AND-OR graph

Markov Random Field Model
Activity Labels INFERENCE ON THE MRF

Markov Random Field Approach
The overall MRF model for context based activity modeling of a set of nodes 𝑋={ 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑁 } is 𝑃 𝑋 = 1 𝑍 𝑖,𝑗 𝜖 𝑋 𝜓 𝑠𝑡 𝑥 𝑖 , 𝑥 𝑗 𝑖 𝜖 𝑋 𝜓 𝑜 ( 𝑥 𝑖 , 𝑦 𝑖 ) , Activities in a scene are modeled as a Markov random field. Nodes are modeled using the low level features, edges are modeled using the spatio-temporal contextual potential between pairs of activities

Potential Functions Observation Potential Spatio-temporal Potential
𝜓 𝑜 𝑥 𝑖 , 𝑦 𝑖 =𝑃( 𝑥 𝑖 | 𝑦 𝑖 ,𝐶) for baseline classifiers 𝐶 and observations 𝑦𝑖. Spatio-temporal Potential Baseline Classifier 𝜓 𝑜 ( 𝑥 𝑖 , 𝑦 𝑖 ) (BoF/SFG) Spatial Potential 𝜓 𝑠 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 𝒩𝑠𝑑 𝑠 𝑖 − 𝑠 𝑗 2;𝜇𝑠 𝑠 𝑖 , 𝑠 𝑗 𝜎𝑠 𝑠 𝑖 , 𝑠 𝑗 , Temporal Potential 𝜓 𝑡 ( 𝑥 𝑖 , 𝑥 𝑗 )=𝒩𝑡𝑑( 𝑡 𝑖 − 𝑡 𝑗 2;𝜇𝑡 𝑡 𝑖 , 𝑡 𝑗 𝜎𝑡( 𝑡 𝑖 , 𝑡 𝑗 )) Spatio-temporal Potential 𝜓 𝑠𝑡 𝑥 𝑖 , 𝑥 𝑗 = 𝑓𝑖𝑗 𝜓 𝑠 ( 𝑥 𝑖 , 𝑥 𝑗 ) 𝜓 𝑡 ( 𝑥 𝑖 , 𝑥 𝑗 ), where 𝑓𝑖𝑗 is the frequency of occurrence of an activity in the vicinity of another

Learning and Inference
The observation potential 𝜓 𝑜 𝑥 𝑖 , 𝑦 𝑖 is learnt by training a baseline classifier on the training instances The pairwise spatiotemporal potentials are learned independently for all pairs of activities. The parameters for any two categories ci , cj are learnt by maximizing 𝐷= 𝑘 log 𝜓 𝑠𝑡 ( 𝑥 𝑖 , 𝑥 𝑗 ) Inference Loopy Belief Propagation The activity label which has the highest marginal distribution is assigned to the region.

Overview The overall MRF model for context based activity modeling of a set of nodes 𝑋={ 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑁 } is 𝑃 𝑋 = 1 𝑍 𝑖,𝑗 𝜖 𝑋 𝜓 𝑠𝑡 𝑥 𝑖 , 𝑥 𝑗 𝑖 𝜖 𝑋 𝜓 𝑜 ( 𝑥 𝑖 , 𝑦 𝑖 ) ,⁡ where 𝜓 𝑜 ( 𝑥 𝑖 , 𝑦 𝑖 ) is the observation potential obtained from baseline classifiers 𝜓 𝑜 𝑥 𝑖 , 𝑦 𝑖 =𝑃( 𝑥 𝑖 | 𝑦 𝑖 ,𝐶) for baseline classifiers 𝐶 and observations 𝑦𝑖. 𝜓 𝑠𝑡 𝑥 𝑖 , 𝑥 𝑗 is the spatiotemporal context potential given by 𝜓 𝑠𝑡 𝑥 𝑖 , 𝑥 𝑗 = 𝑓𝑖𝑗 𝜓 𝑠 ( 𝑥 𝑖 , 𝑥 𝑗 ) 𝜓 𝑡 ( 𝑥 𝑖 , 𝑥 𝑗 ), where 𝑓𝑖𝑗 is the frequency of occurrence of an activity in the vicinity of another, are functions of spatial and temporal separation of activities 𝑥𝑖 𝑎𝑛𝑑 𝑥𝑗. The graph structure is iteratively learnt to maximize the system entropy on the posteriors 𝐸 𝑋 =− 𝑖=1 𝑁 𝑗=1 𝑁𝑐 𝑃( 𝑐 𝑗 𝑥𝑖 𝑙𝑜𝑔2( 𝑐 𝑗 | 𝑥 𝑖 ) , which in turn is seen to increase the accuracy of activity recognition. 𝜓 𝑠 ( 𝑥 𝑖 , 𝑥 𝑗 ) = 𝒩𝑠𝑑( 𝑠 𝑖 − 𝑠 𝑗 2;𝜇𝑠 𝑠 𝑖 , 𝑠 𝑗 𝜎𝑠( 𝑠 𝑖 , 𝑠 𝑗 )), 𝜓 𝑡 ( 𝑥 𝑖 , 𝑥 𝑗 )=𝒩𝑡𝑑( 𝑡 𝑖 − 𝑡 𝑗 2;𝜇𝑡 𝑡 𝑖 , 𝑡 𝑗 𝜎𝑡( 𝑡 𝑖 , 𝑡 𝑗 ))

Structure Improvisation
Rather than having a fixed graph structure, we propose to construct the graphical model on the test sequence. The structure is modified iteratively in a manner that is likely to improve the recognition scores. The graph structure is modified to maximize the system entropy on the posteriors 𝐸 𝑋 = 𝑖=1 𝑁 𝑗=1 𝑁𝑐 𝑃( 𝑐 𝑗 𝑥𝑖 𝑙𝑜𝑔2( 𝑐 𝑗 | 𝑥 𝑖 ) , which in turn is seen to increase the accuracy of activity recognition. The graph edges are learnt iteratively to increase the recognition of accuracy as well as the confidence of recognition given by the reduction in overall entropy of the system.

Results Precision and recall values obtained
using our approach as against the Bag-of-Words baseline classifier on the VIRAT release 2 dataset.

Framework - Overall Motion Detection Segmentation Motion Input Video
Locations of activities of interest and their activity labels Structural Model

Structural Model The left graph shows the video representation of an activity set. The right graph shows the graphical representation of our model.

Potential function Activity-duration potentials

Activity Duration Potential
Activity duration potential measures the compatibility between the duration of an activity and its activity label.

Potential function Activity-duration potentials
Intra-activity motion potentials

Intra-activity Motion Potential
Intra-activity motion potential measures the compatibility between the motion feature of an activity and its activity label. The intra-activity motion feature is very confusing as the notation is not clear – maybe you can provide an intuition on what it is rather than the precise notation.

Intra-activity motion potentials Intra-activity context potentials

Intra-activity context Potential
Intra-activity context potential measures the compatibility between intra-activity context feature of an activity and its activity label. Candidate discriminative attributes: An opening door A person appears at the door Person entering a facility? not likely Person exiting a facility? highly likely

Intra-activity motion potentials Intra-activity context potentials Inter-activity context potentials

Inter-activity context Potential
Inter-activity context potential measures the compatibility between the spatial and temporal relationships between two activities and their labels. Relative location in the image plane time

Inter-activity context Potential
Inter-activity context potential measures the compatibility between the spatial and temporal relationships between two activities and their labels. Relative location in the image plane time Closing a trunk Opening a trunk

Learning Methodology: regularized risk minimization in structural SVM.
Can be solved efficiently by modified bundle method (C. H. Teo, 2007).

Inference with Greedy Search
Optimum activity labels for action segments:

Anomaly Detection Test instances whose patterns deviate from the learned instances are considered as anomaly Types of anomaly: Point Anomaly – without any contextual information Contextual Anomaly – normal behavioral attributes but abnormal contextual attributes Collective Anomaly – the sequence of events deviates from the sequences in the training set.

Point Anomalies The figure shows the probability density function for a motion score histogram for three activities (to which the anomalous activity does not belong). The point anomaly is marked in red in the figure. An activity is considered as a point anomaly if the probability of its motion/feature histogram belonging to any of the “normal” classes is lower than a pre-defined threshold

Contextual Anomalies The figure shows the probability density function for the contextual attributes for three activities (to which the anomalous activity does not belong). The point anomaly is marked in red in the figure. An activity is considered as a contextual anomaly if the contextual attributes such as location of occurrence of the activity with respect to its surroundings is unlikely. First row: Person getting into a vehicle usually occurs in the parking area (a), and the anomaly is detected when it happen in an area not for parking (b). Second row: Person exiting a facility happens at a normal exit (c), whereas an anomaly is detected when the person exits from a door that is rarely used (d). Third row: A person gesturing far froma vehicle is normal in our dataset (e), whereas in (f) the ‘gesturing’ occurs near the trunk of the vehicle, which is identified as a contextual anomaly

Collective Anomalies The figure a) shows the probability density function for collective spatial anomaly and b) shows the collective temporal anomly for three activities (to which the anomalous activity does not belong). The point anomaly is marked in red in the figure. (a, b): For most training examples, ‘person getting into a vehicle’ occurs at the front of the car while ‘person unloading an object from a vehicle’ occurs from the trunk. Thus, a collective spatial anomaly is detected when the unloading and entering happen near the same part of the vehicle. (c, d): ‘person getting out of a vehicle’ usually occurs before ‘person getting into a vehicle’, however, in the detected anomaly, ‘person getting out of a vehicle’ occurs after ‘person getting into a vehicle’. A collective anomaly occurs if the ordering of activities is different from what is usually found in the training instances or when the spatial or temporal location of one or more activities in the sequence differs from those in similar sequences in the training data

Experiment VIRAT Ground Dataset Release 1
1 - loading an object; 2 - unloading an object; 3 - opening vehicle trunk; 4 - closing vehicle trunk; 5 - getting into vehicle; 6 - getting out of vehicle. VIRAT Ground Dataset Release 2 7 - person gesturing; 8 - person carrying an object; 9 - person running; 10 - person entering a facility; 11 - person exiting a facility.

Results on VIRAT Release 1
Baseline - NDM+SVM; Our Method (1) - structural model with intra-activity context as the context; Our Method (2) - structural model with both intra- and inter- context.

Results on VIRAT Release 2
Baseline - NDM+SVM; Our Method - structural model with both intra- and inter- context.

Example Results Examples of activities corrected using context

Conclusion & Future Work
We build a high-level model that integrates various features within and between activities. Experiments show the benefits of modeling the context within and between activities. The proposed approach jointly models a variable number of activities in continuous videos. Future Work Explicitly model the hierarchy of actions and activities. Anomaly detection with context.

THANK YOU! Acknowledgement
DARPA STTR award W31P4Q-11-C0042 through Mayachitra Inc. NSF IIS THANK YOU!

Context-Aware Modeling and Recognition of Activities in Video

Similar presentations

Presentation on theme: "Context-Aware Modeling and Recognition of Activities in Video"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Context-Aware Modeling and Recognition of Activities in Video

Similar presentations

Presentation on theme: "Context-Aware Modeling and Recognition of Activities in Video"— Presentation transcript:

Similar presentations

About project

Feedback