Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transliteration Linguistic Enrichment of Statistical

Similar presentations


Presentation on theme: "Transliteration Linguistic Enrichment of Statistical"— Presentation transcript:

1 Transliteration Linguistic Enrichment of Statistical
लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल ट्रांसलिटरेशन MTP Final Stage Presentation Guided by: Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye ( ) Department of Computer Science & Engineering IIT Bombay

2 Presentation Pathway Problem Statement Motivation
What is Transliteration? Syllables and their Structure Sonority Theory Concept of Schwa Proposed Transliteration Model Experiments and Results Discussions Conclusion and Future Work References

3 Problem Statement To exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

4 Motivation An important component of Machine Translation
When you cannot Translate – Transliterate. Critical in tackling problem of OOV words and proper nouns Proves acute in translating Named entities for CLIR Transliteration – a Phonetic translation process; Apt to exploit phonetic and phonological properties

5 What is Transliteration?
A process of phonetically translating words like named entities or technical terms from source to target language alphabet. Examples:- Gandhiji – गाँधीजी OOV words like नमस्कार - Namaskar

6 Humans translate/transliterate frequently for different reasons
An example of how transliteration comes to rescue when no translations exist

7 Overview of Transliteration
x Overview of Transliteration Source Word Target Word Transliteration Units Transliteration Units Character n-grams Syllables

8 Basic of syllables “Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.” Vowels are the heart of a syllable(Most Sonorous Element) Consonants act as sounds attached to vowels.

9 Basic Structure doesn’t suffice
Syllable Structure Simple syllables – Baba, दादा Complex syllables – Andrew Ba + ba दा + दा Alert!!! Basic Structure doesn’t suffice An drew VC? CVC?

10 Possible syllable structures
The Nucleus is always present Onset and Coda may be absent Possible structures V CV VC CVC

11 Introduction to sonority theory
“The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.” Some sounds are more sonorous Words in a language can be divided into syllables Sonority theory distinguishes syllables on the basis of sounds.

12 Sonority Hierarchy Obstruents can be further classified into:-
Fricatives Affricates Stops

13 Sonority sequencing principle
“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.” Peak (Nucleus) Onset Coda

14 example ABHIJEET Sonority Profile 1 A I E E H J B T Sonority Profile 2

15 The concept of schwa First alphabet of IAL – {a}
Unstressed and Toneless neutral vowel Some schwas deleted and some are not Schwa deletion – important issue for grapheme to phoneme conversion Handled using a well-established schwa deletion algorithm Example:- Priyatama – Last “a” changes the Gender प्रियतम प्रियतमा

16 Proposed Transliteration Model
Source Language Words Source Language Syllables Syllabification Modules Target Language Words Target Language Syllables Moses Training Target Language Model SRILM Phrase translation tables Moses Decoder Source Language Words Transliterated output

17 Transliteration system workflow
Syllabification of parallel list of names in Roman and Devanagari Using these parallel list for:- Alignment of syllables Training Moses translation toolkit Language model generation using SRILM Decoding using trained phrase-translation tables and language model Comparing results to analyze performance

18 Experiments and Results
Syllabification of Roman and Devanagari words Fig : Syllabification Algorithm

19 Syllabification results
A few examples Language Correct Incorrect English Soloman Akbarkhan Venkatachalam Hindi सोहनमल सोमेश्वर शिरोमणि सिराजउद्दीन

20 Transliteration Process
Syllabification of list of parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables. Training Language Models for target language using SRILM toolkit. Training MOSES with aligned corpus of 7500 names and target language model as input. Testing with a list of 2500 proper names using the trained model for transliteration.

21 Roman to Devanagari Transliteration
Fig : Result for Roman to Devanagari Transliteration Fig : Top-n Inclusion results

22 Devanagari to Roman Transliteration
Fig : Result for Devanagari to Roman Transliteration Fig : Top-n translation results

23 Comparison with Character n-gram based model
Same Experimental setup; Transliteration units changed to n-grams Bigrams (Sandeep  Sa, an, nd, de, ee, ep) Trigrams (Sandeep  San, and, nde, dee, eep) Quadrigrams (Sandeep  Sand, ande, ndee, deep) Observations suggest performance improvement using syllables as transliteration units n-gram based models prove to be ignorant to phonological properties like unstressed vowels Fig : Comparison with N-gram based model

24 Comparison with State-of-the-art Systems
Google transliteration engine and Quillpad used as benchmarks for comparison A list of 1000 words written in Roman alphabet used as test input Our system outperforms Quillpad and just falls short of Google’s results. A more intense training with larger training set might improve system performance. Fig : Comparison with State-of-the-art transliteration systems

25 Discussions Accents Mapping of sounds Silent Letters
थोड़ा : Thoda or thora? Mapping of sounds Mahaan – महान Kahaan - कहाँ Silent Letters Psychatrist - सायकेट्रिस्ट

26 Discussions (cntd…) Improper Schwa deletion
Venkatachalam – वेंकटचलम Improper placement (Onset or Coda) सिराजउद्दीन - सि राज उद् दिन or सि रा जउद्  दिन Similar phonological structure but different pronunciation सोमलता and कोमलता वें + कट + च + लम वेंक + टच + लम सोम लता को मल ता

27 Conclusion and Future work
Transliteration can prove critical in supporting Machine Translation Phonologically aware transliteration units like syllables show strong signs of performance improvement Syllable-based transliteration performs at least up to the state-of-the-art systems. Syllabification algorithms should be subjected to further improvement Developed system should be supplied with larger and more accurate training set. Some linguistic issues discussed above are very challenging cases for future work on transliteration

28 References Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Gao W., Lam W., and Wang K Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing. Osamu F Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. Phillip Koehn et.al Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic. Laver J Principles of Phonetics. Cambridge University Publications. PG. 114. Knight L. and Graehl J Machine Transliteration. Proceedings of ACL Pg Stolcke A SRILM – An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing. Choudhury M. and Bose A A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

29 Background Theory

30 Approaches towards Transliteration

31 Complex Syllable structure
Fig : Detailed syllable structure Fig : Complex syllables fitting in above structure

32 Sonority theory & syllables
“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.” Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

33 Sonority Hierarchy for English and Hindi
Sound Segment Letters contained Vowels {"a", "e", "i", "o", "u"} Liquids {"y", "r", "l", "v", "w"} Nasals {"n", "m"} Fricatives {"s", "z", "f", "th", "h", "sh", "x"} Affricates {"ch", "j"} Stops {"b", "d", "g", "p", "t", "k", "q", “c”} Fig : Sonority hierarchy for English Sound Segment Letters contained Vowels {"अ","आ","इ","ई","उ","ऊ","ऋ","ए","ऐ","ओ","औ","अं","अः"} Matras {"ा","ि","ी","ु","ू","े","ै","ो","ौ","ं"} Liquids {"य","र","ल","व"} Nasals {"न", "म", "ण" ,"ञ" ,"ङ", "ज्ञ"} Fricatives {"स", "झ", "फ", "थ", "ह", "श", "ष", "क्ष", "त्र"} Affricates {"च", "ज", "छ"} Stops {"ब", "ड", "ग", "प", "ट", "त", "क", "भ", "ध", "ढ", "घ", "फ", "थ", "ठ", "ख", "द"} Fig : Sonority hierarchy for Hindi

34 Maximal Onset Principle
“The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.” In case of words having two valid syllable set, one with maximum onset length would be preferred. Example – Diploma Di + plo + ma Dip + lo + ma

35 Schwa deletion algorithm
Procedure delete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. If a consonant marked U is followed by a full vowel, then mark that consonant as F. While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. If the last consonant is marked U, mark it H. If any consonant marked U is immediately followed by a consonant marked H, mark it F. While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

36 Example of Schwa deletion
Fig : Application of Schwa deletion Algorithm

37 Examples Correct Transliterations Incorrect Transliteration
Source Language Target Language Examples English Hindi REYMOND  रेमंड MUKUNDAN  मुकुंदन सलीमाबेगम  SALEEMABEGAM राजगुरू  RAJGURU Source Language Target Language Examples English Hindi VENKATACHALAM  वेंकाटाचलं DHRUVA  UNKव कोमलता  COMLATA


Download ppt "Transliteration Linguistic Enrichment of Statistical"

Similar presentations


Ads by Google