Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop

Similar presentations


Presentation on theme: "Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop"— Presentation transcript:

1 Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop http://www.nist.gov/speech/tests/std

2 Outline Spoken Term Detection Task STD definition of a “Search Term” Expected STD System Architecture Evaluation process Evaluation corpus Future activities

3 The Spoken Term Detection Task System goal: to find all occurrences of a search term as fast as possible in heterogeneous audio sources Applications –Audio mining – Indexing and searching huge data stores Goals of the evaluation –Understand speed/accuracy tradeoffs –Understand technology solution tradeoffs: e.g., word vs. phone recognition –Understand technology issues for the three STD languages: Arabic, English, and Mandarin

4 IndexerSearcher STD System System Developer’s Domain The Typical Two Stage STD System Architecture The indexer never gets access to the terms The searcher never gets access to the audio AudioExcerpt List Metadata Store System Input Files One time Poitners Many times Detected Terms Term List

5 STD Definition of a Search Term A search term is a sequence of one or more words represented orthographically –A narrow pragmatic definition because: Limited corpora, time, and funding resources Specific term attributes –Case is ignored –Homophones not distinguished “wind” (air movement) “wind” (to twist) This would require phonetically annotated search terms –Orthographic variants of the same sense are distinguished “catalogs” is NOT a target for the search term “catalog” Meta-level application would perform term expansion

6 Outline Spoken Term Detection Task STD definition of a “Search Term” Expected STD System Architecture Evaluation process –STD system output –Defining reference occurrences –Assessing performance Term occurrence alignment STD Value Detection Error Tradeoff Curves Computation resource profiling Evaluation corpus Future enhancements

7 Evaluation Process STD System Output For each hypothesized term occurrence –The location in the audio stream of the term –A “Decision Score” indicating how likely the term exists with more positive values indicating more likely occurrences –An “Actual” decision as to whether the detected term is a term: “YES” the putative term is likely true occurrence “NO” it is not. System operation measurements –Indexing time –Indexing memory consumption (high water mark) –Index size (on disk) –Search time for each term –Searching memory consumption (high water mark) –Computing platform benchmarks

8 Evaluation Process Defining Reference Occurrences Reference occurrences found by searching human- generated transcripts –Use Speech-To-Text evaluation transcriptions –Each word has been time marked Forced alignments for English transcripts STT-system mediated time alignment for Mandarin and Arabic Rules for locating reference occurrences –The occurrence must be a contiguous sequence of LEXEMEs. Terms can not contain Disfluencies –The utterance “fire m- maker” is not an occurrence of the term “fire maker” –Terms can’t cross speaker and channel boundaries –The gap between adjacent words in a term must be <= 0.5 second 0.5 sec. captures all but 0.04% of reference occurrences with 1.0 sec. –NON-LEX events are treated as silence

9 Evaluation Process: Assessing Performance Challenges: STD isn’t a typical detection task –STD systems operate on continuous audio so there is no analog to a discrete “trial” A “trial” is an answer to the question “Is this exemplar an example of object X” System and reference occurrences need to be ALIGNED Two methods of assessing performance –Value function An application model that assigns value to correct output and negative value for incorrect output A weighted linear combination of Missed Detection and False Alarm Probabilities –Detection Error Tradeoff (DET) curve A graphical representation of the tradeoff between Missed Detections and False Alarms

10 Evaluation Process: Reference/System Term Alignments Solved using the Hungarian Algorithm for solving bipartite graphs –Each term occurrence is a node in the graph –A minimum one-to-one mapping is found Optimization function is: –Linear combination of time congruence and detection score Infinite if the gap between reference and system occurrences is greater than 0.5 seconds Reference Terms System Terms

11 Error Counting Illustration 10 seconds YES 0.7 YES 0.6 NO 0.5 NO 0.4 Reference Occ. System Occ. (with Actual Dec.) Counting Errors For Actual Decisions Corr FAMiss “NO” decisions ignored Aligned occurrence

12 Evaluation Process: Estimates for Error Rates Given the alignments, compute the following for each term: –Probability of Missed Detections P Miss (term,θ) = 1 – N correct (term,θ)/N True (term,θ) –not defined when N True (term,θ) = 0 –Probability of False Alarms P FA (term,θ) = N Spurious (term,θ)/N NT (term,θ) NT is the number of Non-Target trials –But this is NOT countable – Instead we used an Estimate N NT (term,θ) = TrialsPerSecond * SpeechDuration - N True (term,θ) TrialsPerSecond arbitrarily set to 1 for all languages –In hind sight, this is a unsatisfactory solution For the next STD eval, we’d like to use a rate to measure False Alarms (e.g., False Alarms per hour). Θ is a decision criteria to differentiate likely vs. unlikely putative term occurrences

13 Evaluation Process STD Value Function Value : An application model that assigns value to correct output and negative value for incorrect output –Value = 1 is perfect output –Value = 0 is the score for no output –Value < 0 is possible Occurrence-Weighted Value (OWV) V = 1.0 = Value (benefit) of a correct response C = 0.1 = Cost of an incorrect response Con: frequently occurring terms will dominate Equal weighting by terms would be better

14 Evaluation Process STD Value Function (cont.) Term Weighted Value (TWV) –Derived from the OWV (won’t go into it here.) –Restricted to terms with reference occurrences Choosing Θ –Developers choose a decision threshold for their “Actual Decisions” to optimize their term-weighted value : All the “YES” system occurrences Called the “Actual Term Weighted Value” (ATWV) –The evaluation code searches for the system’s optimum decision score threshold Called the “Maximum Term Weighted Value” (MTWV) C/V weights the error types Adjusts for the priors

15 Error Counting Illustration 10 seconds YES 0.7 YES 0.6 NO 0.5 NO 0.4 Reference Occ. System Occ. (with Actual Dec.) Counting Errors For Actual Decisions Corr FAMiss “NO” decisions ignored Aligned occurrence Counting Errors For Θ = 0.5 (DET Curves) Corr FAMiss Corr “NO” decisions used

16 Evaluation Process Detection Error Tradeoff (DET) Curves Plot of P Miss vs. P FA Axis are warped to the normal deviate scale Term-Weighted DET Curve –Created by computing term-averaged P Miss and P FA over the full range of a system’s decision score space –Confidence estimates easily used for confidence curves Better +/- 2 Std. Error Curves Maximum Term-Weighted Value (MTWV) occurs a decision score –0.595 The curves stop because systems don’t generate complete output

17 Processing Resource Profiling Since speed is a major concern, participants quantified the consumed processing resources Five measurements defined for the evaluation –Indexing time Reported as a rate: Processing hours/Hour of Speech –Search time Reported as a rate: Processing seconds/Hour of Speech –Index size (on disk) Reported as a rate: Megabytes/Hour of Speech –Indexing memory consumption (high water mark) Reported as Megabytes NIST provided a tool, ProcGraph.pl, to track memory usage –Searching memory consumption (high water mark) Reported as Megabytes Useless without computing platform calibration –Sites report the output of NBench which is a public domain computing speed benchmark

18 Term Selection Procedure The “Candidate Lists” are made from a list of high-frequency n-grams which are comprised only of low-frequency word constituents. Selection is uniform over n- gram tokens. Pr(occ) high-frequency > 0.1% > pr(occ) low-frequency Term TypeStepProcedureEnglish Selection (as an Example) Trigrams (Target of 100): 90 In Corpus(InC), 10 Out-of-Corpus (OoC) 1Select 90 InC from the “Candidate List” 79 trigrams, 11 4-grams 2Select 10 OoC from current news10 trigrams Brigrams (Target of 400): 390 InC, 10 OoC 1Expand selected Trigrams to Bigram sub-terms to start the InC set 201 bigrams 2Add InC selections from the “Candidate List” to meet the 390 goal 189 bigrams 3Select 10 OoC from current news20 bigrams (from OoC trigrams) Unigrams (Target of 500): 465 InC, 25 High Frequency, 10 OoC 1Expand selected bigrams to unigram sub-terms to start the InC set 573 unigrams 2Add InC selections from the “Candidate List” to meet the 465 goal None needed 3Select 10 OoC from current news12 unigrams (from OoC bigrams) 4Select 25 High Frequency terms that are not stop/functions words 27 unigrams

19 2006 STD Corpora Speech Types with Dialects Broadcast News (BNews) Conversational Telephone Speech (CTS) Meeting Domain (CONFMTG) Num Terms/ Reference Occurrences ArabicMSA ~1 hour Levantine ~1 hour NONE1101/2433 (Diacritized) 937/2807 (Non-Diacritized) EnglishAmerican ~3 hours American ~3 hours American ~2 hours 1100/14421 ChineseMandarin ~1 hour Mandarin ~1 hour NONE1120/3684 Why such a small data set? Ans: That’s all we had of high quality transcripts

20 Future Activities Additional results analysis –Transcript accuracy requirements –Minimum number of terms –Text normalization –Methods for finding transcription errors given system output Modifications to the evaluation infrastructure –Improvements/clarifications to file definitions –Modify term-weighted value to include terms without reference occurrences –Score systems over all domains, (not split by domain) –Add a composite Speed/Accuracy measure –Develop methods to enable bigger test sets Statistical Analysis


Download ppt "Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop"

Similar presentations


Ads by Google