Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Similar presentations


Presentation on theme: "Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP."— Presentation transcript:

1 Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

2 Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance.

3 Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location.

4 Context Spontaneous multi-party speech. Goal: extract salient information: –Who? What? When? Where? –Automatic meeting annotation/transcription. –Speaker tracking, speech acquisition. –Surveillance. Approach: based on speaker location. Multisource problem (overlaps, noise).

5

6 How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.

7 How? Audio location: microphone array Audio content: speaker identification. Video: one or several cameras. Combination.

8 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete)

9 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution 0.016 s

10 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution 0.016 s Link locations across space and time Resolution 0.016 s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

11 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution 0.016 s Link locations across space and time Resolution 0.016 s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

12 Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV).

13 Task 1: AV16.3 corpus At IDIAP: 16 microphones, 3 cameras. 40 short recordings, about 1h30 overall. –‘meeting’: seated. –‘surveillance’: standing. –pathological test cases (A, V, AV). 3D mouth annotation. Used in the AMI project. http://mmm.idiap.ch/Lathoud/av16.3_v6

14 AV16.3 corpus

15

16 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution 0.016 s Link locations across space and time Resolution 0.016 s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

17 Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where?

18 Sector-based Approach Question: is there at least one active source in a given sector?

19 Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine).

20 Task 2: Multisource Localization Problem: –Detect: how many speakers? –Localize: where? Sectors (coarse-to-fine). Tested on real data: AV16.3 corpus. To do: –Finalize (optimization, multi-level). –Compare with existing.

21 Task 2: Single Speaker Example

22 Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.0 2 loudspeakers simultaneously active

23 Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 2 loudspeakers simultaneously active

24 Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 >=1 detected100%99.8% Average nb detected 3.02.5 3 loudspeakers simultaneously active

25 Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 2 speakers simultaneously active (includes short silences)

26 Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 3 speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6

27 Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Task 2: develop robust multisource strategies (in progress) Active speakers’ locations Resolution 0.016 s Link locations across space and time Resolution 0.016 s Links up to 0.25 s Task 3: joint segmentation and tracking (complete)

28 Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF).

29 Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering.

30 Task 3: Segmentation/Tracking Speech: –Short and sporadic utterances. –Overlaps. –Filtering is difficult (Kalman, PF). Alternative: short-term clustering. Short-term = 0.25 s. Threshold-free, online, unsupervised. Unknown number of objects.

31 Example: iteration 1 (partition)

32 Example: iteration 1 (merge)

33 Example: iteration 2 (partition)

34 Example: iteration 2 (merge)

35 Example: iteration 3 (partition)

36 Example: iteration 3 (merge)

37 Example: iteration 4 (partition)

38 Example: iteration 4 (merge)

39 Example: result

40

41

42 Task 3: Application Link locations across space and time Resolution 0.016 s Links up to 0.25 s Task 3: joint segmentation and tracking (complete) Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch Single source localization

43 Application (2)

44 Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Task 3: Metrics

45 Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. Task 3: Metrics

46 Precision (PRC): –An active speaker is detected in the result. –PRC = probability that he is truly active. Recall (RCL): –A speaker is truly active. –RCL = probability to detect him in the result. F-measure: F = 2 * PRC * RCL PRC + RCL Task 3: Metrics

47 Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% F = 2 * PRC * RCL PRC + RCL

48 Task 3: Results Entire data: ProposedLapel baseline PRC79.7%84.3% RCL94.6%93.3% F86.5%88.6% Overlaps only: ProposedLapel baseline PRC55.4%46.6% RCL84.8%66.4% F67.0%54.7% F = 2 * PRC * RCL PRC + RCL

49 Conclusion Spontaneous speech = multisource problem. AV16.3 corpus recorded, annotated. Approach: detect, localize, track, segment. Location is not identity! –Fusion with monochannel analysis. –Fusion with video.

50 Thank you!

51 Detection: Energy and Localization

52 Task 2: Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)

53 Task 2: Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected 2.01.81.9 2 loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected 3.01.92.5 3 loudspeakers simultaneously active

54 Task 2: Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~1.31.01.3 2 humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~2.01.41.6 3 humans simultaneously active


Download ppt "Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP."

Similar presentations


Ads by Google