Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.

Similar presentations


Presentation on theme: "Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP."— Presentation transcript:

1 Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

2 Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

3 Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications.

4 Context Automatic analysis of recordings: –Meeting annotation. –Speaker tracking for speech acquisition. –Surveillance applications. Questions to answer: –Who? What? Where? When? Location can be used for very precise segmentation.

5 Microphone Array

6

7 Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps.

8 Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms.

9 Why Multiple Sources? Spontaneous multi-party speech: –Short. –Sporadic. –Overlaps. Problem: frame-level multisoure localization and detection. One frame = 16 ms. Many localization methods exist…But: –Speech is wideband. –Detection issue: how many?

10 Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

11 Sector-based Approach Question: is there at least one active source in a given sector?

12 Sector-based Approach Question: is there at least one active source in a given sector?  Answer it for each frequency bin separately

13 Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency.

14 Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03].

15 Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]. 0 9 2 0 10 0 1

16 Frame-level Analysis f s Sector of space Frequency bin One time frame every 16 ms. Discretize both space and frequency. Sparsity assumption [Roweis 03]. 0 9 2 0 10 0 1

17 Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. P=M(M-1)/2

18 Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. For each sector s, compare measured phases  (f) with the centroid  s : pseudo-distance d(  (f),  s ). P=M(M-1)/2 sector f d(  f  1  d(  f  2  d(  f  3  d(  f  7  …

19 Frequency Bin Analysis Compute phase between 2 microphones:  (f) in  Repeat for all P microphone pairs  f  1 (f) …  P (f)]. For each sector s, compare measured phases  (f) with the centroid  s : pseudo-distance d(  (f),  s ). Apply sparsity assumption: –The best one only is active. P=M(M-1)/2

20 Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

21 Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

22 Real Data: Single Speaker With sparsity assumption (this work) Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

23 Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

24 Real Data: Multiple Loudspeakers

25 Task 2: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.0 2 loudspeakers simultaneously active

26 Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 2 loudspeakers simultaneously active

27 Real Data: Multiple Loudspeakers MetricIdealResult >=1 detected100% Average nb detected 2.01.9 >=1 detected100%99.8% Average nb detected 3.02.5 3 loudspeakers simultaneously active

28 Outline Context and problem. Approach. –Discretize: ( sector, time frame, frequency bin ). –Example. Experiments. –Multiple loudspeakers. –Multiple humans. Conclusion.

29 Real data: Humans

30 MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 2 speakers simultaneously active (includes short silences)

31 Real data: Humans MetricIdealResult >=1 detected~89.4%90.8% Average nb detected ~1.31.3 3 speakers simultaneously active (includes short silences) >=1 detected~96.5%95.1% Average nb detected ~2.01.6

32 Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data.

33 Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods.

34 Conclusion Sector-based approach. Localization and detection. Effective on real multispeaker data. Current work: –Optimize centroids. –Multi-level implementation. –Compare multilevel with existing methods. Possible integration with Daimler.

35 Thank you!

36 Pseudo-distance Measured phases  f  1 (f) …  P (f)]  in  P  For each sector a centroid  s =[  s,1 …  s,P ]. d(  f ,  s ) =  p sin 2 ( (  p (f) –  s,p ) / 2 ) cos(x) = 1 – 2 sin 2 ( x / 2 )  argmax beamformed energy = argmin d

37 Delay-sum vs Proposed (1/3) With optimized centroids (this work) With delay-sum centroids (this work)

38 Delay-sum vs Proposed (2/3) MetricIdealDelay-sumProposed >=1 detected100%99.9%100% Average nb detected 2.01.81.9 2 loudspeakers simultaneously active >=1 detected100%99.2%99.8% Average nb detected 3.01.92.5 3 loudspeakers simultaneously active

39 Delay-sum vs Proposed (3/3) MetricIdealDelay-sumProposed >=1 detected~89.4%80.0%90.8% Average nb detected ~1.31.01.3 2 humans simultaneously active >=1 detected~96.5%86.7%95.1% Average nb detected ~2.01.41.6 3 humans simultaneously active

40 Energy and Localization


Download ppt "Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP."

Similar presentations


Ads by Google