Presentation on theme: "Activity Recognition Ram Nevatia Presents work of"— Presentation transcript:
1Activity Recognition Ram Nevatia Presents work of F. Lv, P. Natarajan and V. SinghInstitute of Robotics and Intelligent SystemsComputer Science DepartmentViterbi School of EngineeringUniversity of Southern California
2Activity Recognition: Motivation Is the key content of a video (along with scene description)Useful forMonitoring (alerts)Indexing (forensic, deep analysis, entertainment…)HCI..
3Issues in Activity Recognition Inherent ambiguities of 2-D videosVariations in image/video appearance due to changes in viewpoint, illumination, clothing (texture)….Variations in style: different actors or even the same actor at different timesReliable detection and tracking of objects, especially those directly involved in activitiesTemporal segmentationMost work assumes single activity in a given clip“Recognition” of novel events
4Match video signals directly Possible ApproachesMatch video signals directlyDynamic time warpingExtract spatio-temporal features, classify based on themBag of words, histograms, “clouds”…Work of Laptev et alMost earlier work assumes action segmentation (detection vs classification)Andrew’s talk on use of localization and trackingStructural ApproachBased on detection of objects, their tracks and relationshipsRequires ability to perform above operations
5Primitive events: those we choose not to decompose, e.g. walking Event HierarchyComposite EventsCompositions of other, simpler events.Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building.Form a natural hierarchy (or lattice)Primitive events: those we choose not to decompose, e.g. walkingRecognized directly from observations,Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.
6Only need a few primitive actions in any domain. Key IdeasOnly need a few primitive actions in any domain.Sign Language: Moves and Holds.Human Pose Articulation: Rotate, Flex and Pause.Rigid Objects (cars, people): Translate, Rotate, Scale.Can be represented symbolically using formal rules.Composite Actions can be represented as combinations of the primitive actions.Handle uncertainty and error in video by mapping rules to a Graphical Models.HMM.DBN.CRF.
7Graphical ModelsA network, normally used to represent temporal evolution of a stateNext state depends only on previous state; observation depends only on current state, single state variableTypical task is to estimate most likely state sequence given an observation sequence- Viterbi algorithmAn HMMA CRF
8Mid vs Near Range Mid-range Near-range Limbs of human body, particularly the arms, are not distinguishableCommon approach is to detect and track moving objects and make inferences based on trajectoriesNear-rangeHands/arms are visible; activities are defined by pose transitions, not just the position transitionsPose tracking is difficult; top-down methods are commonly used
9Mid-Range Example Example of abandoned luggage detection Based on trajectory analysis and simple object detection/recognitionUses a simple Bayesian classifier and logical reasoning about order of sub-eventsTested on PETS, ETISEO and TRECVID data9
10Simultaneous Tracking and Action Recognition (STAR) Top-Down ApproachesBottom-up methods remain slow, are not robust; many methods are based on use of multiple video streamsAn alternative is of top-down approaches where processing is driven by event modelsSimultaneous Tracking and Action Recognition (STAR)In analogy with SLAM in roboticsProvides action segmentation, in addition to recogntionClosed-world assumptionCurrent work limited to single actor actions
11Activity Recognition w/o Tracking Input sequence…Action segmentscheck watchpunchkickpick upthrowOur objective is very clear: given a single video sequence, we want to segment the whole sequence into many parts.Each part contains only one instance of some human action and we want to recognize the action in each part.We are interested in basic human actions such as walking, sitting down, punching, kicking. In our dataset, there are 15 different actions.Although we do not explicitly estimate 3D human body poses, but our result will provide such poses as a by product.+3D body pose…
12DifficultiesViewpoint change & pose ambiguity (with a single camera view)Spatial and temporal variations (style, speed)
13Key Poses and Action Nets Key poses are determined from MoCap data by an automatic method that computes large changes in energy; key poses may be shared among different actions
14Experiments: Training Set Let me show our experimental results.We have 15 action models, including stand, check watch, cross arm, scratch head, sit down, stand up, turn around, walk in a circle, wave hand, punch, kick, point, pick up and throw. We also include some variations for some action classes. For example, we include wave left hand v.s. wave right hand, kick gently v.s. kick vigorously.In total we have one hundred and seventy seven key poses. Given the tilt angle, we render them from 36 viewpoints with different pan angles.And that results in more than six thousand rendered key poses in the action net.15 action models177 key poses6372 nodes in Action Net
15Action Net: Apply constraints Because key poses are rendered from many viewpoints, we repeat the same action net for each viewpoint.We use these magenta links to connect action models rendered from adjacent viewpoints. These links allow us to model smooth change in the actor’s orientation.(The major difference between the Action Net and other graph models is that information such as camera viewpoint and action connectivity is explicitly modeled in the Action Net, while these graph models use parameters to encode such information.)10o…
1650 clips, average length 1165 frames Experiments: Test SetOur testing set contains 50 video clips. In each clip, one actor is performing all 15 actions. The actors can freely choose orientation and position. The order and the number of action instances in each clip are also variable.These clips are shot from 5 different viewpoints.There are 5 male actors and 5 female actors.The large number of action classes and the large variation in viewpoints and actors make action recognition on this dataset a challenging task50 clips, average length 1165 frames5 viewpoints10 actors (5 men, 5 women)
17A Video Result extracted blob & ground truth original frame without action netwith action net
18Working with Natural Environments Reduce reliance on good foreground segmentationKey poses may not be discriminative enough w/o accurate segmentation; include models for motion between key posesMore general graphical models that includeHierarchyTransition probabilities may depend on observationsObservations may depend on multiple statesDuration models (HMMs imply an exponential decay)Remove need for MoCap data to acquire models
21Dynamic Bayesian Action Network Map action models to a Dynamic Bayesian NetworkDecompose a composite action into a sequence of primitive actionsEach primitive is expressed in a function form fpe(s,s’,N).Maps current state s to next state s’ given parameters N.Assume a known, finite set of functions f for primitives.
22For each current state: Inference OverviewGiven a video, obtain initial state distribution with start key pose for all composite actionsFor each current state:Predict the primitive based on the current durationPredict a 3D pose given the primitive and current durationCollect the observation potential of the pose using foreground overlap and difference imageObtain the best state sequence using dynamic programming (Viterbi Algorithm)Features used to match models with observationsIf “foreground” can be extracted reliably, then we can use blob shape properties; otherwise, use edge and motion flow matching
23Pose Tracking & Action Recognition Obtain state distributions by matching poses sampled from action modelsInfer the action by finding the max likelihood state sequence,Inference Algorithm
24ObservationsForeground overlap with full body model,Difference Image overlap with body parts in actionGrid-of-centroids to match foreground blob with pose
26Action LearningInvolves two problemsModel Learning: Learning parameters N in the primitive event definition fpe(s,s’,N).Key Pose Annotation and Lifting.Pose InterpolationFeature Weight Learning: Learning the weights wk of the different potentials.
28Pose InterpolationAll limb motions can be expressed in terms of Rotate(part,axis,q).We need to learn axis and q.Simple to do given the start and end joints of part.
29Feature Weight Learning Feature weight estimation as minimization of a log-likelihood error function.Learn the weights using Voted Perceptron AlgorithmRequires fully labeled training data -> not available.We propose an extension to deal with partial annotations.Latent State Voted Percepton
30Tested method on 3 datasets ExperimentsTested method on 3 datasetsWeizmann datasetGesture set with arm gesturesGrocery Store set with full body actionsDatasetTrain:TestRatioAction Recognition(% accuracy)2D Tracking(% error)Speed(fps)Weizmann3:699.5--Gesture3:590.185.258Grocery Store1:7100.011.881.6
31Weizmann Dataset Popular dataset for action recognition 10 full body actions from 9 actorsEach video has multiple instance of one actionTrain:TestRecognitionAccuracyJhuang et al 6:398.8Space-Time Shapes 8:1100.0Fathi Et al Sun et al 3:687.3DBAN1:896.799.5
32Gesture Dataset 5 instances of 12 gestures from 8 actors. Indoor lab setting.500 instances of all actions.852x480 pixel resolution, person height: pix.
33Grocery Store DatasetVideos of 3 actions collected from a static camera.16 videos from 8 actors, performed at pan angles.Actor height varies from pixels, in 852x480 resolution videos.
34Incorporating Better Descriptors Previous work based on weak lower-level analysisWe can also evaluate 2D part modelsDynamic Bayesian Action Networkwith Part Model
35Experiments Hand gesture dataset in an Indoor lab 5 instances of 12 gestures from 8 actors, total of 500 action segmentsEvaluation metricsRecognition rate over all action segments2D pose tracking as average 2D part accuracy over 48 randomly selected instancesDatasetTrain:TestRatioRecognition(% accuracy)2D TrackingDBAN-FGM1:778.675.67 (89.94)DBAN-Parts84.5291.76 (92.66)
36Summary and Conclusions Structural approach to activity recognition offers many attractions and challengesResults are descriptive but detecting and tracking objects is challengingHierarchical representation is natural and can be used to reduce complexityGood bottom-up analysis remains a key to improved robustnessConcept of “novel” or “anomalous” events remains difficult to formalize