Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.

Implicit Speaker Separation DaimlerChrysler Research and Technology.

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.

Human interaction is not constructed as a single channel – it is multimodal. Speech and gestures correlate to convey meaning. Moreover, human interaction.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

GMM-Based Multimodal Biometric Verification Yannis Stylianou Yannis Pantazis Felipe Calderero Pedro Larroy François Severin Sascha Schimke Rolando Bonal.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.

1 Discussion Class 10 Informedia. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment.

MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

WORD-PREDICTION AS A TOOL TO EVALUATE LOW-LEVEL VISION PROCESSES Prasad Gabbur, Kobus Barnard University of Arizona.

ICCS-NTUA Contributions to E-teams of MUSCLE WP6 and WP10 Prof. Petros Maragos National Technical University of Athens School of Electrical and Computer.

A Fast and Efficient VOP Extraction Method Based on Watershed Segmentation Alireza Tavakkoli Dr. Shohreh Kasaei Gholamreza Amayeh Sharif University of.

February 2001SUNY Plattsburgh Concise Track Characterization of Maneuvering Targets Stephen Linder Matthew Ryan Richard Quintin This material is based.

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

Introduction to machine learning

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Twenty-First Century Automatic Speech Recognition: Meeting Rooms and Beyond ASR 2000 September 20, 2000 John Garofolo

What’s Making That Sound ?

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Exploiting video information for Meeting Structuring ….

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.

Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

Characterizing activity in video shots based on salient points Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of.

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Multimodal Information Analysis for Emotion Recognition

Adaptive Methods for Speaker Separation in Cars DaimlerChrysler Research and Technology Julien Bourgeois

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.

1Seminar „Multimodale Räume“ Uni Karlsruhe, The FAME Project Acronym:Facilitating Agent for Multicultural Exchange Partners: Universität Karlsruhe,INPG.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

National Taiwan University, Taiwan

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Image and Video Retrieval INST 734 Doug Oard Module 13.

By Naveen kumar Badam. Contents INTRODUCTION ARCHITECTURE OF THE PROPOSED MODEL MODULES INVOLVED IN THE MODEL FUTURE WORKS CONCLUSION.

Tracking-dependent and interactive video projection (Big Brother project)

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Final Year Project. Project Title Kalman Tracking For Image Processing Applications.

PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:

Video Tips. Test Video First, make a short test video with audio.

UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.

WBI/WCI - SKM 14 July Analysis and Knowledge Extraction from Video & Audio Rick Parent Jim Davis Raghu Machiraju Deleon Wang Department of Computer.

Atos, Atos and fish symbol, Atos Origin and fish symbol, Atos Consulting, and the fish symbol itself are registered trademarks of Atos Origin SA. June.

Data Mining and Text Mining. The Standard Data Mining process.

Mr. Darko Pekar, Speech Morphing Inc.

Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,

Tracking parameter optimization

Incremental Boosting Incremental Learning of Boosted Face Detector ICCV 2007 Unsupervised Incremental Learning for Improved Object Detection in a Video.

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Progress Report Meng-Ting Zhong 2015/5/6.

Progress Report Meng-Ting Zhong 2015/9/10.

Requirements Management

Speaker Identification:

-Intelligence Transport System PHHung Media IC & System Lab

Discussion Class 9 Informedia.

Presentation transcript:

Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance.

Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location.

Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location. Multisource problem (overlaps, noise).

How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.

How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete)

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV).

Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV). 3D mouth annotation. Used in the AMI project.

AV16.3 corpus

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where?

Sector-based Approach Question: is there at least one active source in a given sector?

Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine).

Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine). Tested on real data: AV16.3 corpus. To do: –Finalize (optimization, multi-level). –Compare with existing.

Task 2: Single Speaker Example

Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active

Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active

Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected >=1 detected100%99.8% Average nb detected loudspeakers simultaneously active

Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences)

Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6

Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF).

Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering.

Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering. Short-term = 0.25 s. Threshold-free, online, unsupervised. Unknown number of objects.

Example: iteration 1 (partition)

Example: iteration 1 (merge)

Example: iteration 2 (partition)

Example: iteration 2 (merge)

Example: iteration 3 (partition)

Example: iteration 3 (merge)

Example: iteration 4 (partition)

Example: iteration 4 (merge)

Example: result

Task 3: Application Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete) Annotated IDIAP corpus of short meetings (total 1h45) Single source localization

Application (2)

Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Task 3: Metrics

Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. Task 3: Metrics

Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. F-measure: F = 2 * PRC * RCL PRC + RCL Task 3: Metrics

Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% F = 2 * PRC * RCL PRC + RCL

Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% Overlaps only: ProposedLapel baseline PRC55.4%46.6% RCL84.8%66.4% F67.0%54.7% F = 2 * PRC * RCL PRC + RCL

Conclusion Spontaneous speech = multisource problem. AV16.3 corpus recorded, annotated. Approach: detect, localize, track, segment. Location is not identity! –Fusion with monochannel analysis. –Fusion with video.

Thank you!

Detection: Energy and Localization

Task 2: Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)

Task 2: Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected loudspeakers simultaneously active

Task 2: Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~ humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~ humans simultaneously active