Presentation is loading. Please wait.

Presentation is loading. Please wait.

Methodologies for improving the g2p conversion of Dutch names Henk van den Heuvel, Nanneke Konings (CLST, Radboud Universiteit Nijmegen) Jean-Pierre Martens.

Similar presentations


Presentation on theme: "Methodologies for improving the g2p conversion of Dutch names Henk van den Heuvel, Nanneke Konings (CLST, Radboud Universiteit Nijmegen) Jean-Pierre Martens."— Presentation transcript:

1 Methodologies for improving the g2p conversion of Dutch names Henk van den Heuvel, Nanneke Konings (CLST, Radboud Universiteit Nijmegen) Jean-Pierre Martens (ELIS-DSSP, Universiteit Gent)

2 NVFW, 9 June 20062 AUTONOMATA project (STEVIN, call1) Deliverables –Name transcription tools described in this presentation –Spoken name corpus to support research in ANR Partners

3 NVFW, 9 June 20063 Phonemic transcription of names Good name transcription imperative for success of many speech-based services directory assistance, car navigation, etc. Problem general purpose g2ps often perform poorly on names Possible solutions –train dedicated g2ps from name type specific pronunciation dictionaries –train dedicated p2ps that aim to correct mistakes of one common general purpose g2p

4 NVFW, 9 June 20064 Advantages p2p must only model phenomena that are specific for envisaged name type  compact p2p converter  small training lexicon to reach asymptotic performance p2p has access to suprasegmental knowledge being output of g2p (syllabification/stress assignment)

5 NVFW, 9 June 20065 Two approaches 1.Inductive: data-driven MBL 2.Deductive: knowledge-driven, rules

6 NVFW, 9 June 20066 Our data NLFL First names19,2527,253 Second names71,98730,025 Toponyms38,12194,727 Sizes test sets First names: 3000 Second names: 8380 Toponyms: 12480

7 NVFW, 9 June 20067 System architecture initial phonemic transcription initial phonemic transcription orthography general purpose g2p converter p2p converter final phonemic transcription final phonemic transcription automatically learned stochastic correction rules

8 NVFW, 9 June 20068 Stochastic correction rules Rule format –IF (input = F) & (context = C)  (output = F’) with probability = P –F, F’ = phonemic patterns (sequences of phonemes) –C= features describing phonemic + orthographic cntxt Rule types –SS: stress substitution rules (no stress = stress level 0) –PSD: phonemic pattern substitution & deletion rules –PI : phonemic pattern insertion rules

9 NVFW, 9 June 20069 Rule learning process (4 steps) initial transcription initial transcription correct transcription correct transcription orthography Rule Learner (4) Rule Learner (4) Example Generator (3) Example Generator (3) Aligner (1) Aligner (1) Aligner (1) Aligner (1) Transformation Generator (2) Transformation Generator (2)

10 NVFW, 9 June 200610 (1) Align transcriptions Two types of alignments –p-to-p : align initial (I) to correct (C) transcription –p-to-g : align initial (I) to orthographic (O) transcription Unified approach –work with correspondence models per symbol of I : associated set of 2 nd transcription units –I-phoneme can be lined up with any unit of other transcription cost is low if unit belongs to associated set of this phoneme –I-prosodic mark can only be lined up with a unit of its associated set e.g. for p-to-p: associated set of prosodic marks

11 NVFW, 9 June 200611 (2) Generate transformations from I-to-C Step 1 : stress mark substitutions I: ’ r o. d $ ~. b A ~. ’2 x l a n C: ’ r o. d $ n. b A x. ~ ~ l a n Step 2 : remove stress marks I: r o. d $ ~. b A ~. x l a n C: r o. d $ n. b A x. ~ l a n Step 3: phonemic pattern transformations –left-to-right longest mismatch (prosodic marks ignored) –omit empty cells (~)  /$ /  /$ n/ and /. x/  /x. /

12 NVFW, 9 June 200612 (3) Generate training examples (PSD only) Segment I in rule inputs & phonemic units –transformation list = /.x/  /x./, /$/  /$n/ –segmentation result (rule inputs in red) I: r o. d $. b A. x l a n –outputs from I-to-C alignment Determine graphemes lined up with rule input I: r o. d $ ~. b A. x ~ l a n O: r o ~ d e n ~ b a ~ c h l aa n Extract linguistic features describing context –L/R phonemic symbol, stress level, 1 st grapheme, …

13 NVFW, 9 June 200613 (4) Learning rewrite rules Goal –train rules in hierarchy (decision tree) –one decision tree per focus

14 NVFW, 9 June 200614 (4) Learning rewrite rules R1=/ t /? L1=SVWL? YN / s. /: 1.0 / s. / SVWL: short vowel YN / s. /: 0.8 /. s / : 0.2 / s. /: 0.14 /. s / : 0.86

15 NVFW, 9 June 200615 Deductive approach Use human expert knowledge to qualify and correct transcription errors –Knowledge sources: Morphology, phonology, etymology, language origin Method: –Compare g2p and correct transcriptions (train set) –Quantify errors (Relative rule application rate) –Qualify underlying causes for the errors –Formulate corresponding generic rules (no probs) –Implement these in FONPARS –Evaluate the result (test set)

16 NVFW, 9 June 200616 Implemented deductive rules Removal of superfluous stresses –E.g. ‘van’ ‘de’ Expansion of contractions –E.g. Ciprianussteeg, Trigoniaerf Frisian names –E.g. ’-dyk’, ‘-wyk’ Syllabification –E.g. diminutives (k.j$ ->.kj$) French names /n/-deletion –E.g. Van Brienenoordbrug Degemination –E.g. Holland, Wellekens

17 NVFW, 9 June 200617 Results on Dutch toponyms (test set) –Inductive approach somewhat better than deductive approach –At the cost of more rules –Room for further improvement Method#RulesWER (%)PER (%) g2p only51.27.1 g2p + inductive> 30040.84.9 g2p + deductive4042.15.5

18 NVFW, 9 June 200618 Future work More name types (first names, family names) Deductive methodology in synergy with inductive approach  deductive approach AFTER inductive approach  define features to expose causes to learning tools e.g. syllabification errors by not respecting morphological integrity of entities such as ‘kamp’, ‘dijk’, … Compare to plain data-driven approaches –TIMBL on geographical names  same WER, lower WIR, huge size compared to p2p –more tests needed


Download ppt "Methodologies for improving the g2p conversion of Dutch names Henk van den Heuvel, Nanneke Konings (CLST, Radboud Universiteit Nijmegen) Jean-Pierre Martens."

Similar presentations


Ads by Google