Methodologies for improving the g2p conversion of Dutch names Henk van den Heuvel, Nanneke Konings (CLST, Radboud Universiteit Nijmegen) Jean-Pierre Martens (ELIS-DSSP, Universiteit Gent)
NVFW, 9 June AUTONOMATA project (STEVIN, call1) Deliverables –Name transcription tools described in this presentation –Spoken name corpus to support research in ANR Partners
NVFW, 9 June Phonemic transcription of names Good name transcription imperative for success of many speech-based services directory assistance, car navigation, etc. Problem general purpose g2ps often perform poorly on names Possible solutions –train dedicated g2ps from name type specific pronunciation dictionaries –train dedicated p2ps that aim to correct mistakes of one common general purpose g2p
NVFW, 9 June Advantages p2p must only model phenomena that are specific for envisaged name type compact p2p converter small training lexicon to reach asymptotic performance p2p has access to suprasegmental knowledge being output of g2p (syllabification/stress assignment)
NVFW, 9 June Two approaches 1.Inductive: data-driven MBL 2.Deductive: knowledge-driven, rules
NVFW, 9 June Our data NLFL First names19,2527,253 Second names71,98730,025 Toponyms38,12194,727 Sizes test sets First names: 3000 Second names: 8380 Toponyms: 12480
NVFW, 9 June System architecture initial phonemic transcription initial phonemic transcription orthography general purpose g2p converter p2p converter final phonemic transcription final phonemic transcription automatically learned stochastic correction rules
NVFW, 9 June Stochastic correction rules Rule format –IF (input = F) & (context = C) (output = F’) with probability = P –F, F’ = phonemic patterns (sequences of phonemes) –C= features describing phonemic + orthographic cntxt Rule types –SS: stress substitution rules (no stress = stress level 0) –PSD: phonemic pattern substitution & deletion rules –PI : phonemic pattern insertion rules
NVFW, 9 June Rule learning process (4 steps) initial transcription initial transcription correct transcription correct transcription orthography Rule Learner (4) Rule Learner (4) Example Generator (3) Example Generator (3) Aligner (1) Aligner (1) Aligner (1) Aligner (1) Transformation Generator (2) Transformation Generator (2)
NVFW, 9 June (1) Align transcriptions Two types of alignments –p-to-p : align initial (I) to correct (C) transcription –p-to-g : align initial (I) to orthographic (O) transcription Unified approach –work with correspondence models per symbol of I : associated set of 2 nd transcription units –I-phoneme can be lined up with any unit of other transcription cost is low if unit belongs to associated set of this phoneme –I-prosodic mark can only be lined up with a unit of its associated set e.g. for p-to-p: associated set of prosodic marks
NVFW, 9 June (2) Generate transformations from I-to-C Step 1 : stress mark substitutions I: ’ r o. d $ ~. b A ~. ’2 x l a n C: ’ r o. d $ n. b A x. ~ ~ l a n Step 2 : remove stress marks I: r o. d $ ~. b A ~. x l a n C: r o. d $ n. b A x. ~ l a n Step 3: phonemic pattern transformations –left-to-right longest mismatch (prosodic marks ignored) –omit empty cells (~) /$ / /$ n/ and /. x/ /x. /
NVFW, 9 June (3) Generate training examples (PSD only) Segment I in rule inputs & phonemic units –transformation list = /.x/ /x./, /$/ /$n/ –segmentation result (rule inputs in red) I: r o. d $. b A. x l a n –outputs from I-to-C alignment Determine graphemes lined up with rule input I: r o. d $ ~. b A. x ~ l a n O: r o ~ d e n ~ b a ~ c h l aa n Extract linguistic features describing context –L/R phonemic symbol, stress level, 1 st grapheme, …
NVFW, 9 June (4) Learning rewrite rules Goal –train rules in hierarchy (decision tree) –one decision tree per focus
NVFW, 9 June (4) Learning rewrite rules R1=/ t /? L1=SVWL? YN / s. /: 1.0 / s. / SVWL: short vowel YN / s. /: 0.8 /. s / : 0.2 / s. /: 0.14 /. s / : 0.86
NVFW, 9 June Deductive approach Use human expert knowledge to qualify and correct transcription errors –Knowledge sources: Morphology, phonology, etymology, language origin Method: –Compare g2p and correct transcriptions (train set) –Quantify errors (Relative rule application rate) –Qualify underlying causes for the errors –Formulate corresponding generic rules (no probs) –Implement these in FONPARS –Evaluate the result (test set)
NVFW, 9 June Implemented deductive rules Removal of superfluous stresses –E.g. ‘van’ ‘de’ Expansion of contractions –E.g. Ciprianussteeg, Trigoniaerf Frisian names –E.g. ’-dyk’, ‘-wyk’ Syllabification –E.g. diminutives (k.j$ ->.kj$) French names /n/-deletion –E.g. Van Brienenoordbrug Degemination –E.g. Holland, Wellekens
NVFW, 9 June Results on Dutch toponyms (test set) –Inductive approach somewhat better than deductive approach –At the cost of more rules –Room for further improvement Method#RulesWER (%)PER (%) g2p only g2p + inductive> g2p + deductive
NVFW, 9 June Future work More name types (first names, family names) Deductive methodology in synergy with inductive approach deductive approach AFTER inductive approach define features to expose causes to learning tools e.g. syllabification errors by not respecting morphological integrity of entities such as ‘kamp’, ‘dijk’, … Compare to plain data-driven approaches –TIMBL on geographical names same WER, lower WIR, huge size compared to p2p –more tests needed