Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Similar presentations


Presentation on theme: "1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute."— Presentation transcript:

1 1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute Switzerland

2 2 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

3 3 Meetings: Sequences of Actions Meetings are commonly understood as sequences of events or actions:  meeting agenda: prior sequence of discussion points, presentations, decisions to be made, etc.  meeting minutes: posterior sequence of key phases of meeting, summarised discussions, decisions made, etc. We aim to investigate the automatic structuring of meetings as sequences of meeting actions. The actions are multimodal in nature: speech, gestures, expressions, gaze, written text, use of devices, laughter, etc. In general, these action sequences are due to the group as a whole, rather than a particular individual.

4 4 Structuring of Meetings Meetings t A meeting is modelled as a continuous sequence of group actions taken from a mutually exclusive and exhaustive set: V = { V 1, V 2, V 3, …, V N } V1V1 V4V4 V5V5 V1V1 V6V6 V2V2 V3V3 V3V3 t Engaged Neutral Disengaged Group actions based on Interest Level Brainstorming Decision Making Information sharing Group actions based on Tasks Discussion GA based on Turn-taking Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

5 5 Previous Work Recognition of Meeting Actions  supervised, single layer approaches.  investigated different multi-stream HMM variants, with streams modelling modalities or individuals. Layered HMM Approach:  First layer HMM models Individual Actions (I-HMM), second layer HMM models Group Actions (G-HMM).  Showed improvement over single layer approach. Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003. I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005. D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

6 6 Why Clustering? Unsupervised action clustering instead of supervised action recognition. High-level semantic group actions are difficult to:  define: what action lexica are appropriate?  annotate: in general temporal boundaries not precise. Clustering allows to find natural structure of meeting, and may help us better understand the data. Engaged Neutral Disengaged Group actions based on Interest Level Brainstorming Decision Making Information sharing Group actions based on Tasks Discussion GA based on Turn-taking Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

7 7 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

8 8 Single-layer HMM Framework Single-layer HMM: a large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003. I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.

9 9 Two-layer HMM Framework Two-layer HMM: By defining a proper set of individual actions, we decompose the group action recognition problem into two layers: from individual to group. Both layers use ergodic HMMs or extensions. Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

10 10 Advantages 1. Compared with single-layer HMM, smaller observation spaces. 2. Individual layer HMM (I-HMM) is person-independent, then well- estimated model can be trained with much more data. 3. Group layer HMM (G-HMM) is less sensitive to variations of the low-level audio-visual features. 4. Easily explore different combination systems. Audio-visual features Unsupervised HMM Supervised HMM

11 11 Models for I-HMM Early Integration (Early Int.) A standard HMM is trained on combined audio-visual features. Multi-stream HMM (MS-HMM) combine audio-only and visual-only streams. Each stream is trained independently. The final classification is based on the fusion of the outputs of both modalities by estimating their joint occurrence. Asynchronous HMM (A-HMM) S. Dupont et “Audio-visual speech modeling for continuous speech recognition”. IEEE Transactions on Multimedia 141--151, Sep. 2000. Please refer to S. Bengio. “An asynchronous hidden Markov model for audio-visual speech recognition”. NIPS 2003

12 12 Models for G-HMM: Clustering Clusters No.Segmentations 6 5 4 3 2 1 Likelihood Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003 Assume unknown segmentation and number of clusters.

13 13 Linking Two Layers (1)

14 14 Linking Two Layers (2) Normalization Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

15 15 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

16 16 Data Collection Scripted meeting corpus  30 for training, 29 for testing  5-minute each meeting  4-person each meeting  http://mmm.idiap.ch/ 3 cameras, 12 microphones. Meeting was ‘scripted’ as a sequences of actions. IDIAP meeting room Please refer to I. McCowan, et “Modeling human interactions in meetings”. In ICASSP 2003.

17 17 Audio-Visual Feature Extraction Person-specific audio-visual features  Audio Seat region audio Activity Speech Pitch Speech Energy Speech Rate  Visual Head vertical centroid Head eccentricity Right hand centroid Right hand angle Right hand eccentricity Head and hand motion Group-level audio-visual features  Audio Audio activity from white-board region Audio activity from screen region  Visual Mean difference from white-board Mean difference from projector screen camera1camera3camera2

18 18 Action Lexicon Group Actions = Individual Actions + Group Devices (Group actions can be treated as a combination of individual actions plus states of group devices.) Idle Writing Speaking Individual actions Projector screen Whiteboard Group devices Discussion Group actions Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

19 19 Example ProjectorUsed Person 2WSW Person 1SSW Person 3WSSW Person 4SWS WhiteboardUsed Monologue1 + Note-taking Group ActionDiscussion Presentation + Note-taking Whiteboard + Note-taking W W S Speaking W Writing Idle

20 20 Performance Measures We use the “purity” concept to evaluate results  Average action purity (aap): “How well one action is limited to only one cluster?”  Average cluster purity (acp): “How well one cluster is limited to only one action?”  Combine “aap” and “acp” into one measure “K” Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003

21 21 Results MethodNK Single-layer HMM Visual8.7250.6 Audio3.0358.6 AV4.1065.7 Two-layer HMM Visual6.2056.8 Audio3.1063.7 Early Int.3.5970.1 MS-HMM4.1771.8 A-HMM3.5173.8 Clustering Individual Meetings True number of clusters: 3.93 MethodNK Single-layer HMM Visual16.3330.6 Audio3.1650.6 AV6.7362.1 Two-layer HMM Visual11.6738.2 Audio3.5056.7 Early Int.10.6068.3 MS-HMM7.2869.8 A-HMM7.1072.2 Clustering Meeting Collections True number of clusters: 8

22 22 Results Clustering individual meetingsClustering entire meeting collection

23 23 Conclusions Structuring of meetings as a sequence of group actions. We proposed a layered HMM framework for group action clustering:  supervised individual layer and unsupervised group layer. Experiments showed:  advantage of using both audio and visual modalities.  better performance using layered HMM.  clustering gives meaningful segmentation into group actions.  clustering yields consistent labels when done across multiple meetings.

24 24 Future Work Clustering:  Investigating different sets of Individual Actions.  Handling variable numbers of participants across or within meetings. Related:  Joint training of layers in supervised 2-layer HMM.  Defining new sets of group actions, e.g. interest-level. Data collection:  In the scope of the AMI project (www.amiproject.org), we are currently collecting a 100 hour corpus of natural meetings to facilitate further research.

25 25 Results

26 26 Results

27 27 Linking Two Layers (1) Hard decision the individual action model with the highest probability outputs a value of 1 while all other models output a 0 value. Audio-visual features Soft decision: (0.7, 0.1, 0.2) Hard decision: (1, 0, 0) Soft decision outputs the probability of each individual action model as input features to G-HMM:

28 28 Results Two clustering cases  Clustering individual meetings  Clustering the entire meeting collection The baseline system  Single-layer HMM MethodsK Two-layer HMM0.738 Single-layer HMM0.657 MethodsK Two-layer HMM0.722 Single-layer HMM0.621 Clustering individual meetings Clustering entire meeting collection


Download ppt "1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute."

Similar presentations


Ads by Google