Presentation is loading. Please wait.

Presentation is loading. Please wait.

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State.

Similar presentations


Presentation on theme: "SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State."— Presentation transcript:

1 SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an Internet Explorer nutzt Hardwarebeschleunigung Websites werden schneller geladen damit Sie noch reibungsloser surfen können Nimm deine Lieblingsmusik überallhin mit kommt der iPod shuffle mit Speicher genug für hunderte von Songs alle wichtigen Songs fürs Training Wiedergabelisten Genius Mixes Podcasts und Hörbücher Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: A Case Study on our German IT Corpus Sebastian leidig, Tim Schlippe, Tanja Schultz

2 215-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios From Microsoft's German website “Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.” “Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

3 315-May-2014 Motivation Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language To economically build up lexical resources with automatic or semi-automatic methods  detect and treat them separately

4 415-May-2014 Overview Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios combinationfeatures Input grapheme perplexity g2p confidence hunspell lookup (native) hunspell lookup (English) Wiktionary lookup Google hit count voting decision tree SVM Output word list word1 word2 word3 word4 word5 word6 classification

5 515-May-2014 Outline 1.Motivation and Overview 2.Test Sets 3.Single Features 4.Combinations 5.Summary and Future Work Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

6 615-May-2014 Test Sets - Domains German IT website 4.6k unique words German general news 6.6k unique words Afrikaans NCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013) 9.4k unique words Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

7 715-May-2014 Test Sets - Domains Tag for “English”: e.g. Software, Brain, … Foreign hybrids Compound words e.g. Schadsoftware, … Grammatically adapted words e.g. downloaden, … Decisions based on Agreement of annotators duden.de . Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Different word categories: Abbreviations: e.g. UV, CIA, … Other foreign words Compound words e.g. Français, Niveau, …

8 815-May-2014 Foreign words in different test sets Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

9 915-May-2014 Single Features – Design Criteria Features trained on commonly available resources Word lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google Thresholds without supervised training Comparison between English and native models New approaches Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

10 1015-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

11 1115-May-2014 Grapheme Perplexity Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

12 1215-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Phonetisaurus confidence scores (costs)

13 1315-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

14 1415-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms 2 features performed best

15 1515-May-2014 Hunspell Lookup Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios classification word list word1 word2 word3 word4 spellchecker dictionary English: Hunspell-en classification Hunspell dictionary lookup derive word forms classification word list word1 word2 word3 word4 spellchecker dictionary German: Hunspell-de classification Hunspell dictionary lookup derive word forms

16 1615-May-2014 Wiktionary Lookup Check crowdsourced information from matrix language Wiktionary Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

17 1715-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

18 1815-May-2014 Google Hit Count Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

19 1915-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

20 2015-May-2014 Grapheme-to-Phoneme Confidence Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

21 2115-May-2014 Result: Single Features Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios  On Spiegel-de test set: Higher ratio of words classified as English are wrong

22 2215-May-2014 Result: Combination Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

23 2315-May-2014 Performance after filtering difficult words (oracle) Challenges Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

24 2415-May-2014 Conclusion and Future Work Features based on available sources New approaches: G2P confidence Wiktionary Further features: Part-of-speech (POS) Context, trigger words Capitalization Translate and compare Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

25 2515-May-2014 благодари ́ м за внима ́ ние! Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

26 2615-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References

27 2715-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios References


Download ppt "SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia www.kit.edu KIT – University of the State."

Similar presentations


Ads by Google