Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Localization: introduction to system evaluation

Similar presentations


Presentation on theme: "Speaker Localization: introduction to system evaluation"— Presentation transcript:

1 Speaker Localization: introduction to system evaluation
Center for Scientific and Technological Research Via Sommarive, Povo TRENTO - ITALY Speaker Localization: introduction to system evaluation Maurizio Omologo with contributions by: Alessio Brutti, Luca Cristoforetti, Piergiorgio Svaizer ITC-irst, Povo, Trento, Italy NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005

2 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
Outline Acoustic source/Speaker LOCalization (SLOC) and tracking: general issues The localization problem in the lecture scenario of CHIL Evaluation Criteria Software developed at IRST Experimental results Examples Description of IRST systems being evaluated

3 Speaker localization: general issues
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Speaker localization: general issues Problem definition: locate and track, in 2D or 3D, active speakers (one or more) in a given multi-speaker scenario 2D localization: the source is assumed to be located in the plane where acoustic sensors are placed Forced assumption: the speaker, and any other acoustic source, are assumed to be point sources, in general slowly moving and emitting large-band unstationary signals. Radiation effects are neglected. Key technical aspect: acoustic sensor signals are very different each other. Their characteristics depend on speaker positions, room acoustics (reflections, reverberation), background noise,etc Most common approach: 0) Detect an acoustic event (see the Speech Activity Detection problem) 1) compute Time Difference of Arrival (TDOA) at different microphone pairs 2) derive source position estimate from geometry, and 3) apply possible constraints (e.g. to ignore locations outside the room)

4 Example of very near-field propagation
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Example of very near-field propagation Distance between the microphones of a pair = 12 cm Speed of sound = 340 m/s In general, TDOA is computed on the basis of coherence in the direct wavefront For far field, plane propagation can be assumed Animation courtesy of Dr. Dan Russell, Kettering University

5 T-shaped arrays in the CHIL room at IRST
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 T-shaped arrays in the CHIL room at IRST Speaker area Animation courtesy of Dr. Dan Russell, Kettering University

6 Speaker localization: general issues
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Speaker localization: general issues Sensory system characteristics: Number of microphones Sensitivity, spatial and spectral response of the microphones Position of each microphone in the room need of a calibration step Information required for the evaluation: Reference time stamps: sample level synchronous recordings of all the microphones, or need of the offset information to time align signals recorded by different acquisition platforms Offset information to time align audio and video recordings Ground truth 3D labels for each active speaker: in general they are derived from a set of calibrated video-cameras, a mouth tracking algorithm, and a final manual check In CHIL, 3D labels of the lecturer are updated every 667 ms. The output sequence of “time stamps + 3D labels”

7 CHIL: Speaker localization in lecture scenarios
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 CHIL: Speaker localization in lecture scenarios Common sensor set-up in CHIL consortium: 3 T-shaped mic.arrays, 2 close-talk, 1 MarkIII array optional use of all the microphones available in a site UKA set-up: 4 T-shaped arrays, One/two Country-man close-talk, One NIST-MarkIII (IRST-light version), table-top mics

8 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
Evaluation Criteria Accurate (Lecturer) vs Rough localization (Audience) Type of localization errors: Fine (error< 50cm for lecturer,100 cm for audience) Gross (otherwise)

9 Evaluation Criteria: fine and gross errors
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Evaluation Criteria: fine and gross errors Camera (fixed ) Lecturer Gross Error Pan - Tilt - Zoom Screen Audience Fine Error Camera Lecturer Fine Error NIST - Table for meetings MARKIII - Audience Gross Error IRST Light Microphone Array

10 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
Evaluation Criteria Accurate (Lecturer) vs Rough localization (Audience) Type of localization error: Fine (error< 50cm for lecturer,100 cm for audience) Gross (otherwise) Speech Activity Detection (external vs internal to the localization system): False alarm rate Deletion rate Average Frame rate (/s) Fine+gross represents the most relevant cue for SLOC accuracy evaluation

11 SLOC error computation in a time interval
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 SLOC error computation in a time interval 667ms

12 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
Evaluation Criteria Accurate (Lecturer) vs Rough localization (Audience) Type of localization error: Fine (error< 50cm for lecturer,100 cm for audience) Gross (otherwise) Speech Activity Detection (external vs internal to the localization system): False alarm rate Deletion rate Average Frame rate (/s) Bias (fine and gross) Localization precision Pcor=NFineErrors/Nlocalizations When is this evaluation meaningful? For each analysis segment we need to know if one or more acoustic sources (persons or noise) are active and, only in this case, an accurate set of x-y-z coordinates! Related evaluation software (developed at ITC-irst and available at NIST and CHIL web sites)

13 It consists of two steps:
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Evaluation software It consists of two steps: XML converter from manual transcription file and 3D label file to reference file c-code to derive results First step: Transcriptions Reference <Turn startTime="19.637" endTime=" " speaker="lecturer"> <Sync time=" "/> report some results at this <Sync time=" "/> <Event desc="nc-s" type="noise" extent="next"/> level $ % that is very preliminary $ <Sync time=" "/> <Event desc="pap" type="noise" extent="instantaneous"/> <Sync time=" "/> % so uh starting from a brief introd <Event desc="()" type="lexical" extent="previous"/> introduction of what are the main problems we have to face with $ <Sync time=" "/> lecturer lecturer lecturer lecturer lecturer lecturer lecturer lecturer lecturer lecturer audience audience - lecturer

14 Evaluation software Second step
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Evaluation software Second step Evaluation software –reference seminar.ref –inputFile seminar.loc –evalOutput seminar.out –evalSummary seminar.sum –thresholdLecturer 500 –thresholdAudience 1000 –timestep 667 Localization output Evaluation ND Ignored (Multiple Speakers) ND False Alarm ND No Speaker … … ND No Speaker ND Deletion Lecturer ND Ignored (Multiple Speakers) ND Ignored (Multiple Speakers) Fine Error Lecturer Lecturer Audience Overall Pcor Bias fine (x,y,z)[mm] (79,-3,-1) (106,-241,-22) (80,-7,-2) Bias fine+gross (x,y,z)[mm] (115,35,-3) (177,-34,-12) (116,34,-3) RMSE fine [mm] RMSE fine+gross [mm] Deletion rate False Alarm rate Loc. frames for error statistics N. output loc.frames=2242 Reference Duration= Average Frames/sec=2.41 N. reference frames=1283 Summary

15 NIST evaluation ’05 of SLOC systems
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 NIST evaluation ’05 of SLOC systems Participants: IRST,TU, UKA Seminar segments: 13 seminars recorded on November 23rd 2004, and in January and February 2005, at Karlsruhe University In this NIST evaluation, performance regarded only lecturers Evaluation software - parameters: Thresholds for fine and gross errors: 50 cm (lecturer), 100 cm (audience) Time Step=667 ms Evaluation summary - metrics: Average frame rate, N. of loc. frames for statistics on lecturer, False alarm rate, Deletion rate, Localization rate (Pcor), RMSE fine, RMSE fine+gross

16 Experimental Results IRST 2.25 2539 0.42 0.41 0.95 203 309 TU 2.68
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Experimental Results 13 Seminars – E1 segments N. of reference frames = 5788 (4014 s) TimeStep = 667 ms Av. Frame rate(/s) N. of loc. frames False Alarm rate Del. rate Loc. Rate (Pcor) RMSE fine (mm) RMSE fine+gross (mm) IRST 2.25 2539 0.42 0.41 0.95 203 309 TU 2.68 4221 0.98 0. 0.57 307 851 UKA 53.79 3863 0.84 0.09 263 569 IRST’ 1.94 2273 0.39 0.48 0.92 198 327

17 x-coordinate output examples
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 x-coordinate output examples Seminar

18 x-coordinate output examples
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 x-coordinate output examples Seminar

19 x-coordinate output examples
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 x-coordinate output examples Seminar

20 IRST speaker localization and tracking systems
Center for Scientific and Technological Research Via Sommarive, Povo TRENTO - ITALY IRST speaker localization and tracking systems Maurizio Omologo with contributions by: Alessio Brutti, Luca Cristoforetti, Piergiorgio Svaizer ITC-irst, Povo, Trento, Italy NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005

21 System description Two techniques:
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 System description Two techniques: 1a) Use of two T-shaped arrays (B and D), two pairs for 2D(x-y) location 1b) Use of two pairs for the z-coordinate: directions derived by CSP (GCC-PHAT) TDOA analysis 2) Use of three T-shaped arrays and of Global Coherence Field (GCF)

22 TDOA estimate based on microphone pairs and CSP (GCC-PHAT) analysis
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 TDOA estimate based on microphone pairs and CSP (GCC-PHAT) analysis (see Knapp-Carter 1976, Omologo-Svaizer, ICASSP , Trans. on SAP 1997, and U.S. Patent 5,465,302, October 1992)

23 IRST T-shaped microphone array
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 IRST T-shaped microphone array Technique based on CSP analysis Use of four microphones (3 pairs) Accurate 3-D speaker location using few microphones Since 1999, it is a product (AETHRA, Italy)

24 Global Coherence Field
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Global Coherence Field Sound source position Q = number of sensors; C = coherence at a given microphone pair; Time delay at pair (i,k) assuming that the source is in (x,y,z). 3D location based on TDOA of vertical mic. pairs, once a 2D location was derived by maximizing GCF in all x-y coordinates

25 Recent results on UKA lectures
NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005 Recent results on UKA lectures Av. Frame rate (/s) False Alarm rate Del. rate Loc. Rate (Pcor) RMSE fine (mm) RMSE fine+gross (mm) NIST eval pairs 2.25 0.42 0.41 0.95 203 309 NIST eval GCF 1.94 0.39 0.48 0.92 198 327 Jan.’05 – pairs 1.8 0.45 0.49 0.82 186 641

26 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
CSP analysis of a segment with two speakers (from Seminar_ _A_4) Delay [samples] Time [s] 5 10 15 channel 7 CSP channel 0

27 NIST Rich Transcription’05 Evaluation Workshop, Edinburgh, July 13th, 2005
Conclusions This NIST evaluation has been very important to establish an evaluation approach introduced in CHIL during the last year To better understand the potential of the SLOC technologies under study: Need to further improve reference transcriptions Need to reduce the number of metrics: for instance, combining false alarm rate and deletion rate in a unique feature; imposing the same external SAD; etc Need to address a real multi-speaker lecture scenario: much more challenging new annotation tools are needed For meetings: different evaluation criteria are maybe necessary Also person tracking based on audio-video fusion will require other evaluation criteria


Download ppt "Speaker Localization: introduction to system evaluation"

Similar presentations


Ads by Google