Odyssey Singapore 27 June 2012

Odyssey 2012 @ Singapore 27 June 2012
The NIST Speaker Recognition Evaluations Alvin F Martin Odyssey Singapore 27 June 2012

Outline Some Early History Evaluation Organization Performance Factors
Metrics Progress Future 27 June 2012 Odyssey Singapore

Some Early History Success of speech recognition evaluation
Showed benefits of independent evaluation on common data sets Collection of early corpora, including TIMIT, KING, YOHO, and especially Switchboard Multi-purpose corpus collected (~1991) with speaker recognition in mind Followed by Switchboard-2 and similar collections Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc. Earlier NIST speaker evaluations in ‘92, ’95 ‘92 evaluation had several sites as part of DARPA program ‘95 evaluation with 6 sites used some Switchboard-1 data Emphasis was on speaker id rather than open set recognition 27 June 2012 Odyssey Singapore

27 June 2012 Odyssey Singapore

Martigny 1994 Varying corpora and performance measures made meaningful
comparisons difficult 27 June 2012 Odyssey Singapore

19th February 1998: WORKSHOP RLA2C - Speaker Recognition
Avignon 1998 19th February 1998: WORKSHOP RLA2C - Speaker Recognition **************************************************** RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C * **************************************************** la Reconnaissance Speaker du Locuteur Recognition et ses and its Applications Commercial Commerciales and Forensic et Criminalistiques Applications AVIGNON avril/april 1998 Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE TIMIT was preferred corpus Sometimes bitter debate over forensic capabilities 27 June 2012 Odyssey Singapore

Avignon Papers 27 June 2012 Odyssey Singapore

2001: A Speaker Odyssey - The Speaker Recognition Workshop
Crete 2001 First official “Odyssey More emphasis on evaluation 2001: A Speaker Odyssey - The Speaker Recognition Workshop June 18-22, 2001 Crete, Greece 27 June 2012 Odyssey Singapore

ODYSSEY 2004 - The Speaker and Language Recognition Workshop
Toledo 2004 ISCA Archive ODYSSEY The Speaker and Language Recognition Workshop May 31 - June 3, 2004 Toledo, Spain First Odyssey with NIST SRE Workshop held in conjunction at same location First to include language recognition. Two notable keynotes on forensic recognition. Well attended. Odyssey held bi-annually since 2004. 27 June 2012 Odyssey Singapore

Etc. – Odyssey 2006, 2008, 2010, 2012, … Odyssey 2008: The Speaker and Language Recognition Workshop Stellenbosch, South Africa January 21-24, 2008 Odyssey 2010: The Speaker and Language Recognition Workshop Brno, Czech Republic 28 June � 1 July 2010 27 June 2012 Odyssey Singapore

Organizing Evaluations
Which task(s)? Key principles Milestones Participants 27 June 2012 Odyssey Singapore

Which Speaker Recognition Problem?
Access Control? Text independent or dependent? Prior probability of target high Forensic? Prior not clear Person Spotting? Prior probability of target low Text independent NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve 27 June 2012 Odyssey Singapore

Some Basic Evaluation Principles
Speaker spotting primary task Research system oriented Pooling across target speakers Emphasis on low false alarm rate operating point with scores and decisions (calibration matters) 27 June 2012 Odyssey Singapore

Organization Basics Open to all willing participants Research-oriented
Commercialized competition discouraged Written evaluation plans Specified rules of participation Workshops limited to participants Each site/team must be represented Evaluation data sets subsequently published by the LDC 27 June 2012 Odyssey Singapore

1996 Evaluation Plan (cont’d)

1996 Evaluation Plan (cont’d)
1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots 27 June 2012 Odyssey Singapore

DET Curve Paper – Eurospeech ‘97

Wikipedia DET Page 27 June 2012 Odyssey Singapore

Some Milestones 1992 – DARPA program limited speaker identification evaluation 1995 – Small identification evaluation 1996 – First SRE in current series 2000 – AHUMADA Spanish data, first non-English speech 2001 – Cellular data, 2001 – ASR transcripts provided 2002 – FBI “forensic” database 2002 – SuperSid Workshop following SRE 2005 – Multiple languages with bilingual speakers 27 June 2012 Odyssey Singapore

Some Milestones (cont’d)
2005 – Room mic recordings, cross-channel trials 2008 – Interview data 2010 – New decision cost function metric stressing even lower FA rate region 2010 – High and low vocal effort, aging 2010 – HASR (Human-Assisted Speaker Recognition) Evaluation 2011 – BEST Evaluation, broad range of test conditions, included added noise and reverb 2012 – Target Speakers Defined Beforehand 27 June 2012 Odyssey Singapore

Participation Grew from fewer than a dozen to 58 sites in 2010
MIT (Doug) provided workshop notebook covers listing participants Big increase in participants after 2001 Handling scores of participating sites becomes a management problem 27 June 2012 Odyssey Singapore

NIST 2004 Speaker Recognition Workshop
Taller de Reconocimiento de Locutor 3-4 June 2004 Toledo, Spain 27 June 2012 Odyssey Singapore

NIST 2006 Speaker San Juan Recognition Puerto Rico Workshop
26-27 June 2006 27 June 2012 Odyssey Singapore

Participating Sites * Not in SRE series # Incomplete 27 June 2012
Odyssey Singapore

This slide is from 2001: A Speaker Odyssey in Crete

NIST Evaluation Data Set (cont’d)
Year Common Condition(s) Evaluation Features 2002 One-session training on conv. phone data Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data 2003 Cellular data, extended training 2004 Handheld landline conv. phone speech, English only Multi-language data with bilingual speakers 2005 English only with handheld tel. set Included cross-channel trials with mic. test, both sides of 2-channel convs. provided 2006 English only trials (including mic. test trials) Included cross-channel trials with mic. test 27 June 2012 Odyssey Singapore

NIST Evaluation Data Set (cont’d)
Year Common Condition(s) Evaluation Features 2008 8 – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages 2010 9 – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR 2012 5 – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise 27 June 2012 Odyssey Singapore

Performance Factors Intrinsic Extrinsic Parametric 27 June 2012
Odyssey Singapore

Intrinsic Factors Relate to the speaker Demographic factors Mean pitch
Sex Age Education Mean pitch Speaking style Conversational telephone Interview Read text Vocal effort Some questions about definition and how to collect Aging Hard to collect sizable amounts of data with years of time separation 27 June 2012 Odyssey Singapore

Extrinsic Factors Relate to the collection environment
Microphone or telephone channel Telephone channel type Landline, cellular, VOIP In earlier times, carbon vs. electret Telephone handset type Handheld, headset, earbud, speakerphone Microphone type – matched, mismatched Placement of microphone relative to speaker Background noise Room reverberation 27 June 2012 Odyssey Singapore

“Parametric” Factors Train/test speech duration
Have tested 10 s up to ~half hour, more in ‘12 Number of training sessions Have tested 1 to 8, more in ‘12 Language English has been predominant, but a variety of others included in some evaluations Is better performance for English due to familiarity and quantity of development data? Cross-language trials a separate challenge 27 June 2012 Odyssey Singapore

Metrics Equal Error Rate Decision Cost Function CLLR
Easy to understand Not operating point of interest Calibration matters Decision Cost Function CLLR FA rate at fixed miss rate E.g. 10% (lower for some conditions) 27 June 2012 Odyssey Singapore

Decision Cost Function CDet
CDet = CMiss × PMiss|Target × PTarget CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget) Weighted sum of miss and false alarm error probabilities Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, PTarget. Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” ) 27 June 2012 Odyssey Singapore

Decision Cost Function CDet (cont’d)
Parameters Parameters 2010 Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory CMiss CFalseAlarm PTarget 10 1 0.01 CMiss CFalseAlarm PTarget 1 0.001 27 June 2012 Odyssey Singapore

CLLR Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT))
where first summation is over target trials, second is over non- target trials, NTT and NNT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio Information theoretic measure made popular in this community by Niko Covers broad range of performance operating points George has suggested limiting range to low FA rates 27 June 2012 Odyssey Singapore

Fixed Miss Rate Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate Easy to understand Practical for applications of interest May be viewed as cost of listening to false alarms For easier conditions, a 1% miss rate now more appropriate 27 June 2012 Odyssey Singapore

Recording Progress Difficult to assure test set comparability
Participants encouraged to run prior systems on new data Technology changes In ‘96 landline phones predominated, with carbon button or electret microphones Need to explore VOIP With progress, want to make the test harder Always want to add new evaluation conditions, new bells and whistles More channel types, more speaking styles, languages, etc. Externally added noise and reverb explored in 2011 with BEST Doug’s history slide - updated 27 June 2012 Odyssey Singapore

History Slide 27 June 2012 Odyssey Singapore

Future SRE12 Beyond 27 June 2012 Odyssey Singapore

SRE12 Plans Target speakers specified in advance
Speakers in recent past evaluations (in the thousands) All prior speech data available for training Some new targets with training provided at evaluation time Test segments will include non-target speakers New interview speech provided in 16-bit linear pcm Some test calls collected in noisy environments Artificial noise added to some test segment data Will this be an effectively easier id task? Will the provided set of known targets change system approaches? Optional conditions include Assume test speaker is one of the known targets Use no information about targets other than that of the trial 27 June 2012 Odyssey Singapore

SRE12 Metric Log-likelihood ratios will now be required
Therefore, no hard decisions are asked for Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide) Adds to stability of cost measure Emphasizes need for good score calibration over wide range of log likelihoods Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10% 27 June 2012 Odyssey Singapore

SRE12 Primary Cost Function
Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used Writing DCF as PMiss + β * PFA where β = (CFA/CMiss) * (1 – PTarget) / PTarget We take as cost function (DCF1 + DCF2)/2 PTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1 27 June 2012 Odyssey Singapore

Future Possibilities SRE12 outcome will determine whether pre-specified targets will be further explored Does this make the problem too easy? Artificially added noise and reverb may continue HASR12 will indicate whether human-in-the-loop evaluation gains traction SRE’s have become bigger undertakings Fifty or more participating sites Data volume approaching terabytes (as in BEST) Tens or hundreds of millions of trials Schedule could move to every three years 27 June 2012 Odyssey Singapore

Odyssey Singapore 27 June 2012

Similar presentations

Presentation on theme: "Odyssey Singapore 27 June 2012"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Odyssey Singapore 27 June 2012

Similar presentations

Presentation on theme: "Odyssey Singapore 27 June 2012"— Presentation transcript:

Similar presentations

About project

Feedback