Presentation is loading. Please wait.

Presentation is loading. Please wait.

The NIST Speaker Recognition Evaluations Alvin F Martin Odyssey Singapore 27 June 2012.

Similar presentations

Presentation on theme: "The NIST Speaker Recognition Evaluations Alvin F Martin Odyssey Singapore 27 June 2012."— Presentation transcript:

1 The NIST Speaker Recognition Evaluations Alvin F Martin Odyssey Singapore 27 June 2012

2 Outline Some Early History Evaluation Organization Performance Factors Metrics Progress Future 27 June 2012Odyssey Singapore2

3 Some Early History Success of speech recognition evaluation – Showed benefits of independent evaluation on common data sets Collection of early corpora, including TIMIT, KING, YOHO, and especially Switchboard – Multi-purpose corpus collected (~1991) with speaker recognition in mind – Followed by Switchboard-2 and similar collections Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc. Earlier NIST speaker evaluations in ‘92, ’95 – ‘92 evaluation had several sites as part of DARPA program – ‘95 evaluation with 6 sites used some Switchboard-1 data – Emphasis was on speaker id rather than open set recognition 27 June 2012Odyssey Singapore3

4 27 June 2012Odyssey Singapore4

5 Martigny June 2012Odyssey Singapore5 Varying corpora and performance measures made meaningful comparisons difficult

6 Avignon June th February 1998: WORKSHOP RLA2C - Speaker Recognition **************************************************** RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C * **************************************************** la Reconnaissance Speaker du Locuteur Recognition et ses and its Applications Commercial Commerciales and Forensic et Criminalistiques Applications AVIGNON avril/april 1998 Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE Odyssey Singapore6 TIMIT was preferred corpus Sometimes bitter debate over forensic capabilities

7 Avignon Papers 27 June 2012Odyssey Singapore7

8 Crete June : A Speaker Odyssey - The Speaker Recognition Workshop June 18-22, 2001 Crete, Greece Odyssey Singapore8 First official “Odyssey More emphasis on evaluation

9 27 June 2012Odyssey Singapore9

10 27 June 2012Odyssey Singapore10

11 Toledo 2004 ISCA Archive ODYSSEY The Speaker and Language Recognition Workshop May 31 - June 3, 2004 Toledo, Spain 27 June 2012Odyssey Singapore11 First Odyssey with NIST SRE Workshop held in conjunction at same location First to include language recognition. Two notable keynotes on forensic recognition. Well attended. Odyssey held bi-annually since 2004.

12 27 June 2012Odyssey Singapore12

13 Etc. – Odyssey 2006, 2008, 2010, 2012, … Odyssey 2008: The Speaker and Language Recognition Workshop Stellenbosch, South Africa January 21-24, June 2012Odyssey Singapore13 Odyssey 2010: The Speaker and Language Recognition Workshop Brno, Czech Republic 28 June � 1 July 2010

14 Organizing Evaluations Which task(s)? Key principles Milestones Participants 27 June 2012Odyssey Singapore14

15 Which Speaker Recognition Problem? Access Control? – Text independent or dependent? – Prior probability of target high Forensic? – Prior not clear Person Spotting? – Prior probability of target low – Text independent NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve 27 June 2012Odyssey Singapore15

16 Some Basic Evaluation Principles Speaker spotting primary task Research system oriented Pooling across target speakers Emphasis on low false alarm rate operating point with scores and decisions (calibration matters) 27 June 2012Odyssey Singapore16

17 Organization Basics Open to all willing participants Research-oriented – Commercialized competition discouraged Written evaluation plans – Specified rules of participation Workshops limited to participants – Each site/team must be represented Evaluation data sets subsequently published by the LDC 27 June 2012Odyssey Singapore17

18 27 June 2012Odyssey Singapore18

19 1996 Evaluation Plan (cont’d) 27 June 2012 Odyssey Singapore19

20 1996 Evaluation Plan (cont’d) 27 June 2012Odyssey Singapore20 1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots

21 DET Curve Paper – Eurospeech ‘97 27 June 2012Odyssey Singapore21

22 Wikipedia DET Page 27 June 2012Odyssey Singapore22

23 Some Milestones 1992 – DARPA program limited speaker identification evaluation 1995 – Small identification evaluation 1996 – First SRE in current series 2000 – AHUMADA Spanish data, first non-English speech 2001 – Cellular data, 2001 – ASR transcripts provided 2002 – FBI “forensic” database 2002 – SuperSid Workshop following SRE 2005 – Multiple languages with bilingual speakers 27 June 2012Odyssey Singapore23

24 Some Milestones (cont’d) 2005 – Room mic recordings, cross-channel trials 2008 – Interview data 2010 – New decision cost function metric stressing even lower FA rate region 2010 – High and low vocal effort, aging 2010 – HASR (Human-Assisted Speaker Recognition) Evaluation 2011 – BEST Evaluation, broad range of test conditions, included added noise and reverb 2012 – Target Speakers Defined Beforehand 27 June 2012Odyssey Singapore24

25 Participation Grew from fewer than a dozen to 58 sites in 2010 MIT (Doug) provided workshop notebook covers listing participants Big increase in participants after 2001 Handling scores of participating sites becomes a management problem 27 June 2012Odyssey Singapore25

26 NIST 2004 Speaker Recognition Workshop Taller de Reconocimiento de Locutor 27 June 2012Odyssey Singapore26

27 27 June 2012Odyssey Singapore27

28 Participating Sites 27 June 2012Odyssey Singapore28 * Not in SRE series # Incomplete

29 27 June 2012Odyssey Singapore29 This slide is from 2001: A Speaker Odyssey in Crete

30 NIST Evaluation Data Set (cont’d) YearCommon Condition(s)Evaluation Features 2002One-session training on conv. phone data Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data 2003One-session training on conv. phone data Cellular data, extended training 2004Handheld landline conv. phone speech, English only Multi-language data with bilingual speakers 2005English only with handheld tel. setIncluded cross-channel trials with mic. test, both sides of 2-channel convs. provided 2006English only trials (including mic. test trials) Included cross-channel trials with mic. test 27 June 2012Odyssey Singapore30

31 NIST Evaluation Data Set (cont’d) YearCommon Condition(s)Evaluation Features – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise 27 June 2012Odyssey Singapore31

32 Performance Factors Intrinsic Extrinsic Parametric 27 June 2012Odyssey Singapore32

33 Intrinsic Factors Relate to the speaker – Demographic factors Sex Age Education – Mean pitch – Speaking style Conversational telephone Interview Read text – Vocal effort Some questions about definition and how to collect – Aging Hard to collect sizable amounts of data with years of time separation 27 June 2012Odyssey Singapore33

34 Extrinsic Factors Relate to the collection environment – Microphone or telephone channel – Telephone channel type Landline, cellular, VOIP In earlier times, carbon vs. electret – Telephone handset type Handheld, headset, earbud, speakerphone – Microphone type – matched, mismatched – Placement of microphone relative to speaker – Background noise – Room reverberation 27 June 2012Odyssey Singapore34

35 “Parametric” Factors Train/test speech duration – Have tested 10 s up to ~half hour, more in ‘12 Number of training sessions – Have tested 1 to 8, more in ‘12 Language English has been predominant, but a variety of others included in some evaluations – Is better performance for English due to familiarity and quantity of development data? – Cross-language trials a separate challenge 27 June 2012Odyssey Singapore35

36 Metrics Equal Error Rate – Easy to understand – Not operating point of interest – Calibration matters Decision Cost Function CLLR FA rate at fixed miss rate – E.g. 10% (lower for some conditions) 27 June 2012Odyssey Singapore36

37 Decision Cost Function C Det C Det = C Miss × P Miss|Target × P Target + C FalseAlarm × P FalseAlarm|NonTarget × (1-P Target ) Weighted sum of miss and false alarm error probabilities Parameters are the relative costs of detection errors, C Miss and C FalseAlarm, and the a priori probability of the specified target speaker, P Target. Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” ) 27 June 2012Odyssey Singapore37

38 Decision Cost Function C Det (cont’d) Parameters Parameters 2010 Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory 27 June 2012Odyssey Singapore38 C Miss C FalseAlarm P Target C Miss C FalseAlarm P Target

39 CLLR C llr = 1/(2*log2) * ((Σlog(1+1/s)/N TT )+ (Σlog(1+s))/N NT )) where first summation is over target trials, second is over non- target trials, N TT and N NT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio Information theoretic measure made popular in this community by Niko Covers broad range of performance operating points George has suggested limiting range to low FA rates 27 June 2012Odyssey Singapore39

40 Fixed Miss Rate Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate Easy to understand Practical for applications of interest May be viewed as cost of listening to false alarms For easier conditions, a 1% miss rate now more appropriate 27 June 2012Odyssey Singapore40

41 Recording Progress Difficult to assure test set comparability – Participants encouraged to run prior systems on new data Technology changes – In ‘96 landline phones predominated, with carbon button or electret microphones – Need to explore VOIP With progress, want to make the test harder – Always want to add new evaluation conditions, new bells and whistles More channel types, more speaking styles, languages, etc. – Externally added noise and reverb explored in 2011 with BEST Doug’s history slide - updated 27 June 2012Odyssey Singapore41

42 History Slide 27 June 2012Odyssey Singapore42

43 Future SRE12 Beyond 27 June 2012Odyssey Singapore43

44 SRE12 Plans Target speakers specified in advance – Speakers in recent past evaluations (in the thousands) – All prior speech data available for training – Some new targets with training provided at evaluation time – Test segments will include non-target speakers New interview speech provided in 16-bit linear pcm Some test calls collected in noisy environments Artificial noise added to some test segment data Will this be an effectively easier id task? – Will the provided set of known targets change system approaches? – Optional conditions include Assume test speaker is one of the known targets Use no information about targets other than that of the trial 27 June 2012Odyssey Singapore44

45 SRE12 Metric Log-likelihood ratios will now be required – Therefore, no hard decisions are asked for Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide) – Adds to stability of cost measure – Emphasizes need for good score calibration over wide range of log likelihoods Alternative metrics will be C llr and C llr-M10, where the latter is C llr limited to trials for which P Miss > 10% 27 June 2012Odyssey Singapore45

46 SRE12 Primary Cost Function Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used Writing DCF as P Miss + β * P FA where β = (CFA/ CMiss ) * (1 – P Target ) / P Target We take as cost function (DCF 1 + DCF 2 )/2 where P Target-1 = 0.01, P Target-2 = 0.001, with always C Miss = C FA = 1 27 June 2012Odyssey Singapore46

47 Future Possibilities SRE12 outcome will determine whether pre- specified targets will be further explored – Does this make the problem too easy? Artificially added noise and reverb may continue HASR12 will indicate whether human-in-the-loop evaluation gains traction SRE’s have become bigger undertakings – Fifty or more participating sites – Data volume approaching terabytes (as in BEST) – Tens or hundreds of millions of trials – Schedule could move to every three years 27 June 2012Odyssey Singapore47

Download ppt "The NIST Speaker Recognition Evaluations Alvin F Martin Odyssey Singapore 27 June 2012."

Similar presentations

Ads by Google