Presentation on theme: "Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept."— Presentation transcript:
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept. of Information and Communications Gwangju Institute of Science and Technology (GIST), Gwangju 500-712, Korea ICASSP 2006 SLP-L6: Advances in LVCSR Algorithms Presenter: Ting-Wei Hsu
2 Reference Making a Speech Recognizer Tolerate Non-native Speech through Gaussian Mixture Merging, John J. Morgan, Proceedings of InSTIL/ICALL2004 – NLP and Speech Technologies in Advanced Language Learning Systems – Venice 17-19 June, 2004
3 Outline 1.Introduction 2.Effect of non-native speech 3.Acoustic model adaptation for non-native speech recognition 4.Experiments 5.Conclusion
4 1. Introduction Nowadays we have many chances to use a different language from the mother tongue by the stream of the internationalization. Moreover, there is an increasing demand on the automatic systems using the speech recognition. –Ex: Computer assisted language learning (CALL) … However, the performance of an automatic speech recognition (ASR) system tested by the non-native speech degrades significantly, compared with that by the native speech. The main reason –A target language, with which the speech recognition system has been already trained –The mother tongue of the non-native speaker have different pronunciation spaces of the vowel and consonant sounds.
5 1. Introduction (cont.) An ASR for the non-native speech requires kind of adaptation to compensate for this fact. –Pronunciation modeling Making a nonnative speech recognition system to include the pronunciation variants by non-native speakers for each word –Acoustic modeling Adapting the acoustic models by one of adaptation methods such MLLR, MAP adaptation –Language modeling Adapting the language model The combination of these approaches can be used for more improvement.
6 1. Introduction (cont.) In this paper, the pronunciation variability is first investigated and then the acoustic model adaptation is performed for the phonetic units. Pronunciation variability –Modeled by a phoneme confusion matrix for pronunciation from native to non-native speech. –Clustering the state of acoustic models of target language. Acoustic model adaptation –Making the states of the variant units tied. –The mixture of each acoustic model is increased
7 3. Acoustic model adaptation for non-native speech recognition (AM0) English Single mixture (AM1) English,Korean (AM2)
8 2. Effect of non-native speech 2.1. English baseline ASR –Training set Wall Street Journal database, WSJ0, 7,138 utterances, microphone, sampled at a rate of 16 kHz –Recognition feature 12 mel-frequency cepstral coefficients (MFCC) with a logarithmic energy for every 10 ms analysis frame, 39- dimensional feature vector.cepstral mean normalization and energy normalization to the feature vectors. –Acoustic models 3-state left-to-right, context-dependent, 4-mixture, and cross- word triphone models, trained by using the HTK version 3.2 toolkit
9 2. Effect of non-native speech (cont.) 2.2. Speech database for English spoken by Koreans –Korean-Spoken English Corpus (K-SEC) Developing sets (Korean) Evaluating sets 1 (Korean) Evaluating sets 2 (English) 2.3. Effect of native and non-native speech on the baseline ASR –Taking the lexicon only from the text of the test set. (CMU pronunciationg dictionary) –A backed-off bigram is used for a language model
10 3. Acoustic model adaptation for non-native speech recognition 3.1 Analysis of the pronunciation variability –Data-driven approach Obtaining the relationship between the target pronunciation and the incorrectly recognized pronunciation. Generating a phoneme confusion matrix O O O Recognition A p(j) B p(j) C p(j) A p(i) B p(i) C p(i) pronunciation variants Threshold - Consonant : 0.16 - Vowel : 0.13 Native speech
11 3. Acoustic model adaptation for non-native speech recognition (cont.) 3.2. Proposed acoustic model adaptation for non- native speech recognition –state tying /iy/→/ih/ /p/
12 4. Experiments (cont.) As a result, the six informative pronunciation variants were obtained from the confusion matrix.
14 5. Conclusion We proposed the acoustic model adaptation method for non-native speech recognition. The proposed method, which is a data-driven approach, first ranked the phonetic units that gave most informative pronunciation variability by recognizing nonnative speech using the acoustic models trained by native speech.
15 Another way… Korean speak English –English Model → English Model for Korean By phoneme confusion matrix to do state tying American speak Arabic –Arabic Model → Arabic Model for American Hidden Markov Model (HMM) phone sets are trained for English and Arabic, and then English phones are merged into the Arabic phones to make a new Arabic system. The phone level transcriptions of the training data had to be relabeled with English phones By phoneme confusion matrix to do model merging (MM) adaptation