Presentation is loading. Please wait.

Presentation is loading. Please wait.

Construction of phoneme-to-phoneme converters -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Similar presentations


Presentation on theme: "Construction of phoneme-to-phoneme converters -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"— Presentation transcript:

1 Construction of phoneme-to-phoneme converters -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure: Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription Transformation retrieval Generation of training examples: describe linguistic context  Previous and next phonemes and graphemes  Lexical context (Part Of Speech)  Prosodic context (stressed syllable or not)  Morphological context (word prefix/suffix)  External features: e.g. name type, name source, speaker tongue Rule induction  Learn decision tree per input (pattern): stochastic rules in leaf nodes  Rule formalism: if context → leaf node then [input pattern] → [output pattern] with probability P fir In generation mode: rules applied to initial G2P transcription of unseen name  variants with probabilities Towards improved proper name recognition Bert Réveil and Jean-Pierre Martens DSSP group, Ghent University, Department of Electronics and Information Systems Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium {breveil,martens}@elis.ugent.be Topic description -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced: Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P) converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan) Multiple plausible name pronunciations: within or across languages (e.g. Roger) Cross-lingual pronunciation variation: foreign names, foreign application users In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon. Experimental set-up ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Database: Autonomata Spoken Name Corpus (ASNC) 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin Non-overlapping train and test set (disjunctive names, speakers) Human expert transcriptions -TY: typical Dutch transcription (one for each name from TeleAtlas) -AV: auditory verified Dutch transcription (one for each name utterance) This work: only Dutch native utterances + non-native utterances of Dutch names Speech recognizer: state-of-the-art VoCon 3200 from Nuance Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others) RECOGNITION SYSTEM GPS Please guide me towards ‘A&u.stIn HMMs … “O” Lexicon … Austin 'O.stIn … Table 1: Number of utterances for all (speaker,name) pairs in train and test set SetDUENFRMOTU (DU,*)train996019099661245943 test4440851414555437 (*,DU)train99603000168033601560 test444018007201440840 Acknowledgments ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas.http://taalunieversum.org/taal/technologie/stevin/ References ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta Acoustic and lexical modeling strategies ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2). Strategy 1: Incorporating NN1 language knowledge Acoustic modeling: two model sets - AC-MONO: standard NAT Dutch model (trained on Dutch speech alone) - AC-MULTI: Dutch (20%) and NN1 training data (English, French and German)  Lexical modeling - G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)  Foreign transcriptions are nativized in combination with AC-MONO - Data-driven selection of one extra G2P converter per name origin Strategy 2: Creating pronunciation variants (lexical modeling) - Computed per (speaker, name) combination - Created from initial G2P transcriptions by means of automatically learned phoneme-to-phoneme (P2P) converters ~Dirk()Van Den ~Bo~ssche ‘dIrK_fAn_dEn_‘bO.s$ ‘dirk_vAn_d$m_~bO.s$ High level features Orthography Initial transcription Alignment process (letter-to-sound) Alignment process (sound-to-sound) Target transcription Transformation learning Example generation Learn morphological classes Stochastic rule induction Experimental assessment -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Incorporating NN1 language knowledge Including extra G2P transcriptions (acoustic model = AC-MONO) - Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names - Degradation for (DU,DU): reduced by selecting only one extra G2P Decoding with multilingual acoustic model - NAT speakers: loss for NAT names, boost for English names only  Dutch sounds not as well modeled as before  English better known than French?  English and Dutch sound inventories differ more than French and Dutch? - Foreign speakers: boost for both NN1 name origins -mother tongue sounds better modeled Plain multilingual G2P transcriptions bring no improvement Creating pronunciation variants Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets - Alternative P2Ps for (DU,NN1) and (NN1,DU) cells -create additional P2P that starts from NN1 G2P transcriptions -combine most probable variants generated by both P2P converters P2P variants lead to significant improvements for all (speaker, name) cells - 10.. 25% relative for NAT + foreign names, 5.. 17% for foreign speakers Table 2: Name Error Rate (%) for systems with G2P lexicons (spkr,name)SystemDUENFRMOTU (DU,*)AC-MONO + DUN G2P6.538.521.314.628.4 AC-MONO + 4G2P (nativized)7.222.79.99.517.2 AC-MONO + G2P-selection (nativized)6.520.87.29.018.1 AC-MULTI + G2P-selection (nativized)8.514.97.28.316.2 AC-MULTI + G2P-selection (plain)8.514.07.78.618.1 (*,DU)AC-MONO + DUN G2P6.525.133.226.940.8 AC-MONO + 4G2P (nativized)7.222.832.227.040.6 AC-MONO + G2P-selection (nativized)6.522.831.125.338.5 AC-MULTI + G2P-selection (nativized)8.517.622.625.238.6 AC-MULTI + G2P-selection (plain)8.518.222.625.840.4 Table 3: Name Error Rate (%) for systems with P2P transcription variants (spkr,name)SystemDUENFRMOTU (DU,*)AC-MULTI + G2P-selection (nativized)8.514.97.28.316.2 + 4 P2P variants (baseline)7.713.26.37.011.9 + 4 P2P variants (alternative)7.712.26.37.011.9 (*,DU)AC-MULTI + G2P-selection (nativized)8.517.622.625.238.6 + 4 P2P variants (baseline)7.717.219.924.035.2 + 4 P2P variants (alternative)7.716.418.824.035.2


Download ppt "Construction of phoneme-to-phoneme converters -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"

Similar presentations


Ads by Google