Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Confidence Measures for Speech Recognition Reza Sadraei.

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.

INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

1 Voice Command Generation for Teleoperated Robot Systems Authors : M. Ferre, J. Macias-Guarasa, R. Aracil, A. Barrientos Presented by M. Ferre. Universidad.

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.

CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM ICSLP’ 98 CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM J. Ferreiros,

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,

LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.

Why is ASR Hard? Natural speech is continuous

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

1 Incorporating In-domain Confidence and Discourse Coherence Measures in Utterance Verification ドメイン内の信頼度と談話の整合性を用いた音声認識誤りの検出 Ian R. Lane, Tatsuya Kawahara.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Speech and Language Processing

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

7-Speech Recognition Speech Recognition Concepts

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.

1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Sign Classification Boosted Cascade of Classifiers using University of Southern California Thang Dinh Eunyoung Kim

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Olivier Siohan David Rybach

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

EEG Recognition Using The Kaldi Speech Recognition Toolkit

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speaker Identification:

Presentation transcript:

Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J. Ferreiros, P. Martín, J.M. Pardo Grupo de Tecnología del Habla. Universidad Politécnica de Madrid. Spain {lapiz, macias, jfl, ppajaro, Recognition Experiments (Dictionaries: 1,000 5,000 and 10,000):  2100 utterances in the training set  300 utterances in evaluation set 1: calculating penalties values.  300 utterances in evaluation set 2: developing.  300 utterances in testing set.  6-Round-Robin Training. The results presented are the average of all of them. Confidence annotation experiments (Dictionary: 10,000):  1200 utterances in the training set: for training the Neural Networks.  300 utterances in evaluation.  300 utterances in testing.  6-Round-Robin Training. The results presented are the average of all of them. SYSTEM ARCHITECTURE Previous work presented at ICSLP’00: Description of the Spelling task for Spanish. Recognition of continuously spelled names. Over the Telephone line. Comparison of different recognition architectures: two levels, integrated and hypothesis-verification. Noise models integration in all recognition architectures. In this paper: New adjustments in the Hypothesis- Verification architecture. Confidence measures for detecting recognition errors and out of dictionary names. Neural Networks for combining confidence features in order to obtain a unique confidence measure. SUMMARY New version for the Spanish Spelled Name Recognizer over the telephone.  More than 90.0% recognition rate for a 10,000 names dictionary. Features proposals for confidence annotation in Hypothesis-Verification systems. 57.9% 68.3%  57.9% of incorrectly recognized names and 68.3% of names out of the spelling dictionary are detected at 5% false rejection rate. CSD-3  Best feature for detecting Recognition Errors: CSD-3 SR-3  Best feature for detecting Names out of the dictionary: SR-3 To discriminate between Recognition Errors and Out Of Dictionary names is a difficult task. CONFIDENCE ANNOTATION EXPERIMENTAL SETUP CONCLUSIONS The average confusion (in parenthesis) is the average number of pairs from the dictionary that differ only by one letter substitution. M: number of candidates passed from the hypothesis stage to the verification stage. RECOGNITION RESULTS FEATURES Hypothesis Stage From the HMM recognizer (F-1):  Best Score (BS-1): acoustic score of the 1 st letter sequence divided by the number of frames.  Score Difference (SD-1): acoustic score difference between the 1 st and 2 nd letter sequences divided by the number of frames. From the DP alignment (F-2):  Best Cost (BC-2): lowest alignment cost between the N-best letter sequences and the names of the dictionary divided by the length of the 1 st letter sequence.  Cost Difference (CD-2): difference between the two best alignment costs divided by the length of the 1 st letter sequence.  Cost Mean (CM-2): average cost along the 50 best alignment costs divided by the length of the 1 st letter sequence.  Cost Variance (CV-2): cost variance along the 50 best alignment costs divided by the length of the 1 st letter sequence. Verification Stage  Candidate Score (CS-3): acoustic score for the best candidate name obtained after the verification stage divided by the number of frames.  Candidate Score Difference (CSD-3): acoustic score difference between the two best candidates obtained in the verification stage divided by the number of frames.  Candidate Score Mean (CSM-3): average score along the 50 best candidate names divided by the number of frames.  Candidate Score Variance (CV-3): score variance along the 50 best candidate names divided by the number of frames.  Score Ratio (SR-3): difference between the score of the 1 st letter sequence (hypothesis stage) and the score of the best candidate name (verification stage) divided by the number of frames. RECOGNITION ERRORS DETECTION Speech Analysis HYPOTHESIS STAGE VERIFICATION STAGE ConstrainedGrammar Recognised Name HMM Recogniser N-gramletters Letter-Graph DP Alignment Letter Models Dictionary M-Best Names N-BestLettersSequences Penalty values RASTA-PLP parameterisation. 40 Continuous Letters Models: 30 standard pronunciations, 6 seconds pronunciations and 4 noise models. Penalties Values in the DP alignment have been trained with an evaluation set. Verification Stage: constrained grammar considering possible noise models between letters. OUT OF DICTIONARY DETECTION RECOGNITION ERRORS AND OUT OF DICTIONARY DETECTION Features Correct Detection Rates 2.5% 5.0% Class. Error Baseline: 9.7% Hypothesis Verification F-1 F-2 F-3 BC-2 CD-2 CM-2 CV-2 CS-3 CSD-3 CSM-3 CSV-3 SR-3 F-2 and F-3 7.1% 12.9% 20.0% 26.5% 18.3% 26.0% 22.3% 29.5% 40.5% 54.3% 27.1% 38.2% 30.1% 37.4% 46.7% 57.4% 44.7% 57.9% 9.7% 9.4% 9.2% 8.0% 9.0% 7.6% 7.5% Features Correct Detection Rates 2.5% 5.0% Class. Error Baseline: 21.5% Hypothesis Verification F-1 F-2 F-3 BC-2 CD-2 CM-2 CV-2 CS-3 CSD-3 CSM-3 CSV-3 SR-3 F-2 and F-3 2.9% 5.7% 17.6% 33.4% 3.0% 5.3% 17.5% 34.5% 9.3% 15.5% 3.0% 6.3% 53.0% 66.3% 53.5% 67.9% 56.2% 68.3% 21.5% 17.7% 21.5% 17.7% 21.5% 11.2% 10.9% 10.9% Features Correct Detection Rates 2.5% 5.0% Class. Error Baseline: 29.2% F-2 and F-3 sets 54.8% 65.8%13.1% CRN Confusion Matrix IRN ODN CRN ODNIRN 94.9% (1213) 18.0% (25) 72.4% (279) 49.7% (68) 24.0% (92) 0.9% (13) 3.6% (14) 4.2% (52) 32.3% (44) Correct detection of recognition errors or names out of the dictionary and Minimum classification error Confusion matrix for name classification as Correctly Recognized Name (CRN), Incorrectly Recognized Name (IRN) or Out of Dictionary Name (ODN).