Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting video information for Meeting Structuring ….

Similar presentations


Presentation on theme: "Exploiting video information for Meeting Structuring …."— Presentation transcript:

1 Exploiting video information for Meeting Structuring ….

2 Agenda Introduction Feature set extension Video features processing Video features integration Preliminary results Conclusions

3 Meeting Structuring (1) Goal: recognise events which involve one or more communicative modalities: Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard Working environment: “IDIAP framework” : 69 five minutes long meetings of 4 participants 30 transcribed meetings Scripted meeting structure

4 Meeting Structuring (2) 3 audio derived feature families: Speaker turns, Prosodic Features, Lexical Features Mic. Array Lapel Mic. Speaker Turns Beam-forming Rate Of Speech Pitch baseline Energy Prosody Transcription.M/DI discrimination Lexical features ASR

5 Meeting Structuring (3) Dynamic Bayesian Network based models (using GMTK, Bilmes et al.) Multi-stream processing (parallel stream processing) “Counter structure” (state duration modelling) S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 Y t+1 2 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 CorrSubDelInsAER W.o. counter91.74.53.82.610.9 With counter92.95.11.9 9.0 3 feature families: Prosodic features (S 1 ) Speaker Turns (S 2 ) Lexical features (S 3 ) Leave-one-out cross-validation over 30 annotated meetings

6 Feature set extension (1) Multi-party meeting are multi-modal communicative processes Our features cover only two modalities: audio (prosodic features & speaker turns) and lexical content (lexical monologue/dialogue discriminator) Exploiting video contents is the next step!!

7 Approach: extract low level video features and leave their interpretation to high level specialised models The three most confused symbols Feature set extension (2) Three meeting actions which highly involve body/hands movements Goal: improve the recognition of “Note taking”, “Presentation” and “Whiteboard”

8 Feature set extension (3) We need motion features for hands/head-torso regions Constraints: –The system must be simple –Reliable against “environmental” changes (lighting, backgrounds, …) –Open to further extensions / modifications Initial assumptions: –Meetings video contents are quite “static” –Participants occupy only few spatial regions and tend to stay there –Meeting room configuration (camera positions, seats, furniture …) is fixed

9 Kanade Lucas Tomasi (KLT) feature tracking… Video feature extraction (1) Motion analysis is performed using : …and partitioning resulting trajectories according to their relative position into the scene Four spatial regions for each scene: Head 1 / 2 Hands 1 / 2

10 KLT (1) Assumption: brightness of every point of a (slow) moving or static object does not change for images taken at near time instants (Taylor series approximated to the 1 st derivative) Optical flow constraint equation : Represents how fast the intensity is changing with time Moving object speeds Brightness gradient If we have one equation in two unknown; hence more than one solution

11 KLT (2) Minimizing weighted least square error: In two dimensions the system has the form: If the solution is : are neighbour points of x, with same constant velocity

12 KLT (3) A good feature is : 1.one that can be tracked well … (Tomasi et al.) if are the eigenvalues of, the system is well- conditioned if: 2.… and even better if it is part of a human body Large eigenvalues, but in the same range Pixel with higher probability to be skin are preferred (high texture content)

13 We decided to track n=100 features is a square (7x7) window KLT (4) KLT feature tracking consists of 3 steps : 1.Select n good features 2.Track the selected n features 3.Replace lost features

14 Skin modelling Color based approach: (Cr,Cb) chromatic subspace Now: 3 components Gaussian Mixture Model Initial experiments made using a single Gaussian Skin samples taken from unused meetings

15 Structure of the implemented system: Video feature extraction (2) Video KLT Skin Detection Trajectory Structure Skin model 100 features / frame 100 trajectories / frame

16 Video feature extraction (3) Trajectories classification Define 4 partitions (regions) (2 x heads,2 x hands) Trajectory Structure Evaluate: Average Motion Remove: long and quite static trajectories Define 2 additional fixed regions H1H2 Ha1Ha2 R L + 4 regions

17 Video feature extraction (4) 1. 2. 3.4.

18 Video feature extraction (5) Open issues: Loss of tracking for fast moving objects Account during the tracking Assumption of a fixed scene structure Delayed/offline processing For each scene 4 motion vectors, one for each region, are estimated (to be soon enhanced with 2 more regions/vectors L and R) H1H2 Ha1Ha2 In order to detect if someone is entering or leaving the scene Taking motion vectors averaged over many trajectories helps reducing noise

19 Integration Goal: extend the multi-stream model with a new video stream It is possible that the extended model will be intractable due to the increased state space In this case: State space reduction through a multi-time-scale approach will be attempted Early integration of Speaker turns + Lexical features will be investigated

20 Video features alone have quite poor performances, but they seem to be helpful if evaluated together with Speaker Turns Preliminary results Before proceeding with the proposed integration we need to: compare video performances against the other features families validate the extracted video features (A)(Speaker Turns) + (Prosody + Lexical Features) (B)(Speaker Turns) + (Video Features) CorrSubDelInsAER (A) Two-stream model87.84.57.73.215.4 (B) Two-stream model90.43.26.44.514.1 Speaker Turns Prosodic Features Lexical Features Video Features Accuracy %85.969.952.648.1

21 Summary –Extraction of video features through: A skin detector enhanced KLT feature tracker Segmentation of trajectories into 4/6 spatial regions (Simple and fast approach, but with some open problems) –Validation of Motion Vectors as a video feature –Integration in the existing framework (work in progress)


Download ppt "Exploiting video information for Meeting Structuring …."

Similar presentations


Ads by Google