Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP
Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance.
Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location.
Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location. Multisource problem (overlaps, noise).
How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.
How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete)
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)
Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV).
Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV). 3D mouth annotation. Used in the AMI project.
AV16.3 corpus
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)
Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where?
Sector-based Approach Question: is there at least one active source in a given sector?
Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine).
Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine). Tested on real data: AV16.3 corpus. To do: –Finalize (optimization, multi-level). –Compare with existing.
Task 2: Single Speaker Example
Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active
Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected loudspeakers simultaneously active
Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected >=1 detected100%99.8% Average nb detected loudspeakers simultaneously active
Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences)
Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~ speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6
Audio: Global Picture Multiple waveforms Resolution s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution s Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)
Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF).
Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering.
Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering. Short-term = 0.25 s. Threshold-free, online, unsupervised. Unknown number of objects.
Example: iteration 1 (partition)
Example: iteration 1 (merge)
Example: iteration 2 (partition)
Example: iteration 2 (merge)
Example: iteration 3 (partition)
Example: iteration 3 (merge)
Example: iteration 4 (partition)
Example: iteration 4 (merge)
Example: result
Task 3: Application Link locations across space and time Resolution s Links up to 0.25 s Task 3: joint segmentation and tracking (complete) Annotated IDIAP corpus of short meetings (total 1h45) Single source localization
Application (2)
Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Task 3: Metrics
Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. Task 3: Metrics
Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. F-measure: F = 2 * PRC * RCL PRC + RCL Task 3: Metrics
Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% F = 2 * PRC * RCL PRC + RCL
Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% Overlaps only: ProposedLapel baseline PRC55.4%46.6% RCL84.8%66.4% F67.0%54.7% F = 2 * PRC * RCL PRC + RCL
Conclusion Spontaneous speech = multisource problem. AV16.3 corpus recorded, annotated. Approach: detect, localize, track, segment. Location is not identity! –Fusion with monochannel analysis. –Fusion with video.
Thank you!
Detection: Energy and Localization
Task 2: Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)
Task 2: Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected loudspeakers simultaneously active
Task 2: Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~ humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~ humans simultaneously active