1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Slides:



Advertisements
Similar presentations
Gestures Recognition. Image acquisition Image acquisition at BBC R&D studios in London using eight different viewpoints. Sequence frame-by-frame segmentation.
Advertisements

We consider situations in which the object is unknown the only way of doing pose estimation is then building a map between image measurements (features)
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.
Supervised Learning Recap
Patch to the Future: Unsupervised Visual Prediction
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Scalable Text Mining with Sparse Generative Models
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
Recognizing Daily Routines Through Activity Spotting Ulf Blanke and Bernt Schiele Computer Science Department, TU Darmstadt.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
What’s Making That Sound ?
Semantic Indexing of multimedia content using visual, audio and text cues Written By:.W. H. Adams. Giridharan Iyengar. Ching-Yung Lin. Milind Ramesh Naphade.
Exploiting video information for Meeting Structuring ….
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {
Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.
K. J. O’Hara AMRS: Behavior Recognition and Opponent Modeling Oct Behavior Recognition and Opponent Modeling in Autonomous Multi-Robot Systems.
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)
CVPR Workshop on RTV4HCI 7/2/2004, Washington D.C. Gesture Recognition Using 3D Appearance and Motion Features Guangqi Ye, Jason J. Corso, Gregory D. Hager.
Multimodal Information Analysis for Emotion Recognition
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.
Using Inactivity to Detect Unusual behavior Presenter : Siang Wang Advisor : Dr. Yen - Ting Chen Date : Motion and video Computing, WMVC.
Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“
Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
MURI Annual Review, Vanderbilt, Sep 8 th, 2009 Heterogeneous Sensor Webs for Automated Target Recognition and Tracking in Urban Terrain (W911NF )
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
AUTOMATIC TARGET RECOGNITION AND DATA FUSION March 9 th, 2004 Bala Lakshminarayanan.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Variational Bayesian Methods for Audio Indexing
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Christopher M. Bishop Object Recognition: A Statistical Learning Perspective Microsoft Research, Cambridge Sicily, 2003.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Learning Deep Rhetorical Structure for Extractive Speech Summarization ICASSP2010 Justin Jian Zhang and Pascale Fung HKUST Speaker: Hsiao-Tsung Hung.
Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Simon Tucker and Steve Whittaker University of Sheffield
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Predicting Body Movement and Recognizing Actions: an Integrated Framework for Mutual Benefits Boyu Wang and Minh Hoai Stony Brook University Experiments:
Presentation transcript:

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute Switzerland

2 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

3 Meetings: Sequences of Actions Meetings are commonly understood as sequences of events or actions:  meeting agenda: prior sequence of discussion points, presentations, decisions to be made, etc.  meeting minutes: posterior sequence of key phases of meeting, summarised discussions, decisions made, etc. We aim to investigate the automatic structuring of meetings as sequences of meeting actions. The actions are multimodal in nature: speech, gestures, expressions, gaze, written text, use of devices, laughter, etc. In general, these action sequences are due to the group as a whole, rather than a particular individual.

4 Structuring of Meetings Meetings t A meeting is modelled as a continuous sequence of group actions taken from a mutually exclusive and exhaustive set: V = { V 1, V 2, V 3, …, V N } V1V1 V4V4 V5V5 V1V1 V6V6 V2V2 V3V3 V3V3 t Engaged Neutral Disengaged Group actions based on Interest Level Brainstorming Decision Making Information sharing Group actions based on Tasks Discussion GA based on Turn-taking Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

5 Previous Work Recognition of Meeting Actions  supervised, single layer approaches.  investigated different multi-stream HMM variants, with streams modelling modalities or individuals. Layered HMM Approach:  First layer HMM models Individual Actions (I-HMM), second layer HMM models Group Actions (G-HMM).  Showed improvement over single layer approach. Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

6 Why Clustering? Unsupervised action clustering instead of supervised action recognition. High-level semantic group actions are difficult to:  define: what action lexica are appropriate?  annotate: in general temporal boundaries not precise. Clustering allows to find natural structure of meeting, and may help us better understand the data. Engaged Neutral Disengaged Group actions based on Interest Level Brainstorming Decision Making Information sharing Group actions based on Tasks Discussion GA based on Turn-taking Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

7 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

8 Single-layer HMM Framework Single-layer HMM: a large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.

9 Two-layer HMM Framework Two-layer HMM: By defining a proper set of individual actions, we decompose the group action recognition problem into two layers: from individual to group. Both layers use ergodic HMMs or extensions. Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

10 Advantages 1. Compared with single-layer HMM, smaller observation spaces. 2. Individual layer HMM (I-HMM) is person-independent, then well- estimated model can be trained with much more data. 3. Group layer HMM (G-HMM) is less sensitive to variations of the low-level audio-visual features. 4. Easily explore different combination systems. Audio-visual features Unsupervised HMM Supervised HMM

11 Models for I-HMM Early Integration (Early Int.) A standard HMM is trained on combined audio-visual features. Multi-stream HMM (MS-HMM) combine audio-only and visual-only streams. Each stream is trained independently. The final classification is based on the fusion of the outputs of both modalities by estimating their joint occurrence. Asynchronous HMM (A-HMM) S. Dupont et “Audio-visual speech modeling for continuous speech recognition”. IEEE Transactions on Multimedia , Sep Please refer to S. Bengio. “An asynchronous hidden Markov model for audio-visual speech recognition”. NIPS 2003

12 Models for G-HMM: Clustering Clusters No.Segmentations Likelihood Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003 Assume unknown segmentation and number of clusters.

13 Linking Two Layers (1)

14 Linking Two Layers (2) Normalization Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

15 Outline Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

16 Data Collection Scripted meeting corpus  30 for training, 29 for testing  5-minute each meeting  4-person each meeting  3 cameras, 12 microphones. Meeting was ‘scripted’ as a sequences of actions. IDIAP meeting room Please refer to I. McCowan, et “Modeling human interactions in meetings”. In ICASSP 2003.

17 Audio-Visual Feature Extraction Person-specific audio-visual features  Audio Seat region audio Activity Speech Pitch Speech Energy Speech Rate  Visual Head vertical centroid Head eccentricity Right hand centroid Right hand angle Right hand eccentricity Head and hand motion Group-level audio-visual features  Audio Audio activity from white-board region Audio activity from screen region  Visual Mean difference from white-board Mean difference from projector screen camera1camera3camera2

18 Action Lexicon Group Actions = Individual Actions + Group Devices (Group actions can be treated as a combination of individual actions plus states of group devices.) Idle Writing Speaking Individual actions Projector screen Whiteboard Group devices Discussion Group actions Monologue Monologue + Note-taking Note-taking Presentation Presentation + Note-taking Whiteboard Whiteboard + Note-taking

19 Example ProjectorUsed Person 2WSW Person 1SSW Person 3WSSW Person 4SWS WhiteboardUsed Monologue1 + Note-taking Group ActionDiscussion Presentation + Note-taking Whiteboard + Note-taking W W S Speaking W Writing Idle

20 Performance Measures We use the “purity” concept to evaluate results  Average action purity (aap): “How well one action is limited to only one cluster?”  Average cluster purity (acp): “How well one cluster is limited to only one action?”  Combine “aap” and “acp” into one measure “K” Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003

21 Results MethodNK Single-layer HMM Visual Audio AV Two-layer HMM Visual Audio Early Int MS-HMM A-HMM Clustering Individual Meetings True number of clusters: 3.93 MethodNK Single-layer HMM Visual Audio AV Two-layer HMM Visual Audio Early Int MS-HMM A-HMM Clustering Meeting Collections True number of clusters: 8

22 Results Clustering individual meetingsClustering entire meeting collection

23 Conclusions Structuring of meetings as a sequence of group actions. We proposed a layered HMM framework for group action clustering:  supervised individual layer and unsupervised group layer. Experiments showed:  advantage of using both audio and visual modalities.  better performance using layered HMM.  clustering gives meaningful segmentation into group actions.  clustering yields consistent labels when done across multiple meetings.

24 Future Work Clustering:  Investigating different sets of Individual Actions.  Handling variable numbers of participants across or within meetings. Related:  Joint training of layers in supervised 2-layer HMM.  Defining new sets of group actions, e.g. interest-level. Data collection:  In the scope of the AMI project ( we are currently collecting a 100 hour corpus of natural meetings to facilitate further research.

25 Results

26 Results

27 Linking Two Layers (1) Hard decision the individual action model with the highest probability outputs a value of 1 while all other models output a 0 value. Audio-visual features Soft decision: (0.7, 0.1, 0.2) Hard decision: (1, 0, 0) Soft decision outputs the probability of each individual action model as input features to G-HMM:

28 Results Two clustering cases  Clustering individual meetings  Clustering the entire meeting collection The baseline system  Single-layer HMM MethodsK Two-layer HMM0.738 Single-layer HMM0.657 MethodsK Two-layer HMM0.722 Single-layer HMM0.621 Clustering individual meetings Clustering entire meeting collection