Knowledge-based event recognition from salient regions of activity Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of Geneva Knowledge-based event recognition from salient regions of activity M4 – Meeting – January 2004 January 23 2003 / Nicolas.Moenne-Loccoz@cui.unige.ch
Outline Context Salient Regions of Activity (SRA) Learning the semantic of SRA Visual Event Query language Conclusion NML - CVML - UniGe
Context Retrieval of visual events based on user query Abstract representation of the visual content Query Language to express visual events Approach Region-based description of the content Classification of the regions Events queried as spatio-temporal constraints on the regions NML - CVML - UniGe
Overview Domain Knowledge Region extraction Classification Salient regions of activity Labelled regions Videos database Region extraction Classification User queries NML - CVML - UniGe
Salient regions of activity Regions of the image space Moving in the scene Having an homogenous colour distribution Moving objects or meaningful parts of moving objects Extraction : From moving salient points By an adaptive mean-shift algorithm NML - CVML - UniGe
Salient points extraction Scale invariant interest points (Mikolajczyk, Schmid 2001) Extracted in the linear scale-space Local maxima of the scale normalized Harris function (image space) Local maxima of the scale normalized Laplacian (scale space) NML - CVML - UniGe
Salient points extraction Example : scale NML - CVML - UniGe
Salient points trajectories Trajectories used to : Find salient points moving in the scene Track salient points along the time Points matching using Local grayvalue invariants (Schmid) NML - CVML - UniGe
Salient points trajectories Mahalanobis distance : Set of matching points minimize Greedy Winner-Takes-All algorithm Set of points trajectories Moving salient points : NML - CVML - UniGe
Salient regions estimation Estimate characteristic regions of the moving salient points Mean-Shift algorithm : estimate the position Likelihood of pixels (RGB colour distribution) Ellipsoidal Epanechnikov Kernel NML - CVML - UniGe
Salient regions estimation Kernel adaptation step : estimate shape and size Algorithm : NML - CVML - UniGe
Salient regions representation Set of salient regions of activity represented by : Position Ellipsoid Colour distribution Set of salient points Salient regions tracking Regions are matched by a majority vote of their salient points NML - CVML - UniGe
Salient regions of activity NML - CVML - UniGe
Regions classification To obtain an abstract description : Map regions to a domain-specific basic vocabulary Meetings : {Arm, Head, Body, Noise} SVM classifier : Set of 500 annotated salient regions of activity (~200 frames) NML - CVML - UniGe
Regions classification Confusion Matrix : Discussion : Noise class is ill-defined Good results explained by the limited number of classes Arm Head Body Noise 1.000 0.909 0.091 0.052 0.946 NML - CVML - UniGe
Visual event language To express visual events queries Spatio-temporal constraints on labelled regions (LR) To integrate domain Knowledge As specification of the layout (L) As set of basic events a formula of the language is a conjunctive form of : Temporal relations {after, just-after} between 2 LR Spatial relations {above, left} between 2 LR {in} between a LR and a L Identity relations {is} between 2 LR {is-a} between a LR and a label NML - CVML - UniGe
Knowledege - Meetings Scene layout : L = {SEATS, DOOR, BOARD} NML - CVML - UniGe
Knowledege - Meetings Basic events : {Meeting-participant, sitting, standing} Meeting-participant : actors LR constraints is-a(head, LR). Sitting : actor : LR constraints : Meeting-participant(LR), in(SEATS, LR). Standing : actor : LR ~in(SEATS, LR). NML - CVML - UniGe
Events queries Example of user queries : Sitting-down : actors LR1, LR2 constraints is(LR1, LR2), sitting(LR1), standing(LR2), just-after(LR1, LR2). Go-to-board : actors LR1, LR2 standing(LR1), ~in(Board, LR1), in(Board, LR2), just-after(LR2, LR1). NML - CVML - UniGe
Events queries - Results Discussion : Recall validate the retrieval capability False alarms occur because of the hard decision Precision Recall Sit-down 0.43 1.00 Stand-up 0.50 Go-to-board Enter 0.20 Leave 0.25 NML - CVML - UniGe
Conclusion Contributions Limitations Ongoing work Well-suited framework for constraint domains Generic representation of the visual content Paradigm to retrieve visual events from videos Limitations Cannot retrieve all visual events (e.g. emotion) Ongoing work Uncertainty handling and fuzziness Integration of other modalities (e.g. transcripts) NML - CVML - UniGe