A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Slides:



Advertisements
Similar presentations
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Advertisements

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Confidence Measures for Speech Recognition Reza Sadraei.
Statistical NLP: Lecture 11
1 Dynamic Match Lattice Spotting Spoken Term Detection Evaluation Queensland University of Technology Roy Wallace, Robbie Vogt, Kishan Thambiratnam, Prof.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Data Mining Techniques Outline
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Natural Language Understanding
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Speech and Language Processing
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Speech Processing Laboratory
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Sridhar Raghavan and Joseph Picone URL:
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Olivier Siohan David Rybach
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
College of Engineering
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
EEG Recognition Using The Kaldi Speech Recognition Toolkit
EE513 Audio Signals and Systems
Presentation transcript:

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory, Queensland University of Technology (QUT), Brisbane, Australia Presented by Patty Liu

2 Outline Abstract Introduction Spoken term detection system (i) Indexing (ii) Search Evaluation procedure (i) Performance metrics (ii) Evaluation data Results and discussion Conclusions

3 Abstract The QUT system uses phonetic decoding and Dynamic Match Lattice Spotting to rapidly locate search terms, combined with a neural network-based verification stage. The use of phonetic search means the system is open vocabulary and performs usefully (Actual Term-Weighted Value of 0.23) whilst avoiding the cost of a large vocabulary speech recognition engine.

4 Introduction(1/4) I. STD Task In 2006 the National Institute of Standards and Technology (NIST) established a new initiative called Spoken Term Detection (STD) Evaluation. The STD task (also known as keyword spotting) involves the detection of all occurrences of a specified search term, which may be a single word or multiple word sequence.

5 Introduction(2/4) II. Approaches (i) Word-level transcription or lattice : To use a LVCSR engine to generate a word-level transcription or lattice, which is then indexed in a searchable form. These systems are critically restricted in that the terms that are able to be located are limited to the recognizer vocabulary used at decoding time, meaning that occurrences of out-of-vocabulary (OOV) terms cannot be detected.

6 Introduction(3/4) (ii) Phonetic search : Searching requires the translation of the term into a phonetic sequence, which is then used to detect exact or close matching phonetic sequences in the index. Showing promise for other languages with limited training resources. The system developed in QUT’s Speech and Audio Research Laboratory uses Dynamic Match Lattice Spotting to detect occurrences of phonetic sequences which closely match the target term.

7 Introduction(4/4) (iii) Fusion : This has been consistently shown to improve performance and also allows for open-vocabulary search. However, this approach does not avoid the costly training, development and runtime requirements associated with LVCSR engines.

8 Spoken term detection system The system consists of two distinct stages : I. Indexing During indexing, phonetic decoding is used to generate lattices, which are then compiled into a searchable database. II. Search A dynamic matching procedure is used to locate phonetic sequences which closely match the target sequence.

9 Spoken term detection system—Indexing (1/2) I. Indexing Feature extraction : Perceptual Linear Prediction Decoding : (i) Using a Viterbi phone recognizer to generate a recognition phone lattice. (ii) Tri-phone HMM’s and a bi-gram phone language model are used. Rescoring : A 4-gram phone language model is used during rescoring.

10 Spoken term detection system—Indexing (2/2) A modified Viterbi traversal is used to emit all phone sequences of a fixed length, N, which terminate at each node in the lattice. A value of N = 11 was found to provide a suitable trade-off between index size and search efficiency. The resulting collection of phone sequences is then compiled into a sequence database. A mapping from phones to their corresponding phonetic classes (vowels, nasals, etc.) is used to generate a hyper-sequence database, which is a constrained domain representation of the sequence database.

11 Spoken term detection system—Search (1/7) II. Search (i) Pronunciation Generation When a search term is presented to the system, the term is first translated into its phonetic representation using a pronunciation dictionary. If any of the words in the term are not found in the dictionary, letter-to-sound rules are used to estimate the corresponding phonetic pronunciations.

12 Spoken term detection system—Search (2/7) (ii) Dynamic Matching The task is to compare the target and indexed sequences and emit putative occurrences where a match or near-match is detected. To allow for phone recognition errors, the Minimum Edit Distance (MED) is used to measure inter-sequence distance by calculating the minimum cost of transforming an indexed sequence to the target sequence.

13 Spoken term detection system—Search (3/7) The MED score associated with transforming the indexed sequence to the target sequence, is defined as the sum of the cost of each necessary substitution. : indexed phone : target phone : variable substitution costs

14 Spoken term detection system—Search (4/7) represents the information associated with the event that was actually uttered given that was recognized. : recognition likelihoods : phone prior probabilities : emission probabilities

15 Spoken term detection system—Search (5/7) The incorporation of, acoustic log likelihood ratio, allows for differentiation between occurrences with equal MED scores, and promotes occurrences with higher acoustic probability. For each indexed phone sequence, X, associated with a MED score below a specified threshold, let represent the set of individual occurrences of X, as stored in the index. For each, the score for the occurrence is formed by linearly fusing the MED score with an estimated acoustic log likelihood ratio score,.

16 Spoken term detection system—Search (6/7) Because the index database contains sequences of a fixed length, when searching for a term longer than N = 11 phones, the term must first be split (at syllable boundaries) into several smaller, overlapping sub- sequences. The score for each complete occurrence is approximated by a linear combination of scores from the sub-sequence occurrences.

17 Spoken term detection system—Search (7/7) (iii) Verification Because the MED score is not directly comparable between terms of different phone lengths, a final verification stage is required. Longer terms have a higher expected MED score as there are more phones which may have been potentially misrecognised. Using a neural network (single hidden layer, four hidden nodes), Score (P, Y ) is fused with the number of phones, Phones (Y ), and number of vowels, Vowels (Y ), in the term, to produce a final detection confidence score for each putative term occurrence.

18 Evaluation procedure(1/3) I. Performance metrics : Term-Weighted Value : the number of non-target trials : the number of seconds of speech in the test data : the number of true occurrences of term

19 Evaluation procedure(2/3) Selecting an operating point using the confidence score threshold, θ, allowed for the creation of Detection Error Trade-off (DET) plots and calculation of a Maximum TWV. In addition, a binary “Yes/No” decision output by the system for each putative occurrence was used to calculate Actual TWV. Possible TWV values 1 : a perfect system 0 : no output negative values : systems which output many false alarms

20 Evaluation procedure(3/3) II. Evaluation data English evaluation data : 3 hours of American English broadcast news (BNews) 3 hours of conversational telephone speech (CTS) 2 hours of conference room meetings. (The results of the conference room meeting data will not be discussed, as training resources were not available for that particular domain.) The evaluation term list included 898 terms for the BNews data, and 411 for the CTS data. Each term consisted of between one and four words, with a varying number of syllables. BNews : 2.5 occurrences per term per hour sCTS : 4.8 occurrences per term per hour

21 Results and discussion(1/3) I. Training data Two sets of tied-state 16 mixture tri-phone HMM’s (one for BNews and one for CTS) were trained for speech recognition using the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus and CSR-II (WSJ1) corpus, then adapted using 1997 English Broadcast News (HUB4) for BNews and Switchboard-1 Release 2 for the CTS models. Phone bi-gram and 4-gram language models, and phonetic confusion statistics were trained using the same data. Overall, around 120 hours of speech were used for the BNews models, and around 160 for the CTS models. Letter-to-sound rules were generated from The Carnegie Mellon University Pronouncing Dictionary. Neural network training examples were generated by searching for 1965 development terms and using the resulting (Score (P, Y ), Phones (Y ),Vowels (Y ), y (P, Y )) tuples, where y represented the class label, which was set to 1 for true occurrences and 0 for false alarms.

22 Results and discussion(2/3) II. Overall results Table 1 lists the Actual and Maximum TWV for each source type, along with the 1-best phone error rate (PER) of the phonetic decoder on development data similar to that used in the evaluation. III. Effect of term length Short phonetic sequences are difficult to detect, as they are more likely to occur as parts of other words and detections must be made based on limited information.

23 Results and discussion(3/3) IV. Processing efficiency V. Use of letter-to-sound rules The system described uses very little word-level information to perform decoding, indexing and search. In fact, in the absence of the pronunciation dictionary, no word-level information is directly used at all.

24 Conclusions A phonetic search system presented demonstrates that phonetic search can lead to useful spoken term detection performance. However, further performance improvements to verification and confidence scoring, particularly for short search terms, are required to compete with systems which incorporate an LVCSR engine. The system allows for completely open-vocabulary search, avoiding the critical out-of-vocabulary problem associated with word-level approaches. The feasibility of using phonetic search for languages with limited training data, or for large-scale data mining applications, are also promising areas of further research.