Transliteration Linguistic Enrichment of Statistical

Slides:



Advertisements
Similar presentations
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st March, 2011
Advertisements

Why prioritise marked consonants?
Introduction to Phonemic Awareness & Phonics. “I know how to spell S” “E - S”
What are the aims? Increase parental understanding of reading at Reception level Support children’s progress Learn various techniques to aid development.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 35– Phonetics and phonology; syllabification) Pushpak Bhattacharyya CSE Dept.,
Pushpak Bhattacharyya CSE Dept. IIT Bombay 1st Nov, 2012
Syllables and Stress, part II October 22, 2012 Potentialities There are homeworks to hand back! Production Exercise #2 is due at 5 pm today! First off:
Lecture 4 The Syllable.
Syllable. Definition A syllable is a unit of sound composed of a central peak of sonority (usually a vowel), and the consonants that cluster around this.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
CS : Speech, NLP and the Web/Topics in AI
SYLLABLE Pertemuan 6 Matakuliah: G0332/English Phonology Tahun: 2007.
Grade 1: Phonics and Word Study
Introduction to Linguistics Ms. Suha Jawabreh Lecture 10.
Syllable.
FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 1 Department of Computer Science, Gujarat.
Phonology Phonology is essentially the description of the systems and patterns of speech sounds in a language. It is, in effect, based on a theory of.
Syllabification Principles
Lecture 3Part 1 Phonology Suprasegmental phonology the syllable
The sound patterns of language
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Research on teaching and learning pronunciation
Chapter three Phonology
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Consonants and vowel January Review where we’ve been We’ve listened to the sounds of “our” English, and assigned a set of symbols to them. We.
Phonics Ann Morrison, Ph.D..
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay.
MTP I Stage Project Presentation Guided by- Presented by- Prof. Pushpak Bhattacharyya Abhijeet Padhye Department of Computer Science and Engineering Indian.
Entropy in Machine Transliteration & Phonology Bhargava Reddy B.Tech Project.
Phonological Awareness. Involves analyzing the sounds of language and how these sounds make up words and sentences.
Phonology, phonotactics, and suprasegmentals
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
Phonetics and Phonology
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
© Crown copyright 2004 Bingo  Smallest unit of sound in a word, 44 in English, it can be represented by 1,2,3 or 4 letters phoneme.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-27: Phonology (quiz took place on 12/10/09; Lect 26.
Phonology Moats Ch. 3. Phonetics vs. Phonology  Remember, phonetics is the ability to pronounce individual speech sounds  Phonology is the awareness.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Phonology, Part VI: Syllables and Phonotactics November 4, 2009.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Big Ideas in Reading: Phonemic Awareness
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
THE SOUND PATTERNS OF LANGUAGE
Reading. What are the aims? Increase parental understanding of reading at Reception level Support children’s progress Learn various techniques to aid.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Providing Learning Innovations and Curriculum Solutions Strengthening Our Teaching Skills in Reading & Writing Mary Mount Easter Institute Bogota, Columbia.
Outline  I. Introduction  II. Reading fluency components  III. Experimental study  1) Method and participants  2) Testing materials  IV. Interpretation.
Technische Universität München Introduction to English Pronunciation Syllable Structure.
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Syllable.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
Lecture 4 The Syllable.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Job Google Job Title: Linguistic Project Manager
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Review.
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Presented by Dr. Gracie Guerrero
Indian Institute of Technology Bombay
What is syllable?.
Presentation transcript:

Transliteration Linguistic Enrichment of Statistical लिंगुइस्टिक एनरिच्मेंट ऑफ़ स्टटिस्टिकल ट्रांसलिटरेशन MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering IIT Bombay

Presentation Pathway Problem Statement Motivation What is Transliteration? Syllables and their Structure Sonority Theory Concept of Schwa Proposed Transliteration Model Experiments and Results Discussions Conclusion and Future Work References

Problem Statement To exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

Motivation An important component of Machine Translation When you cannot Translate – Transliterate. Critical in tackling problem of OOV words and proper nouns Proves acute in translating Named entities for CLIR Transliteration – a Phonetic translation process; Apt to exploit phonetic and phonological properties

What is Transliteration? A process of phonetically translating words like named entities or technical terms from source to target language alphabet. Examples:- Gandhiji – गाँधीजी OOV words like नमस्कार - Namaskar

Humans translate/transliterate frequently for different reasons An example of how transliteration comes to rescue when no translations exist

Overview of Transliteration x Overview of Transliteration Source Word Target Word Transliteration Units Transliteration Units Character n-grams Syllables

Basic of syllables “Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.” Vowels are the heart of a syllable(Most Sonorous Element) Consonants act as sounds attached to vowels.

Basic Structure doesn’t suffice Syllable Structure Simple syllables – Baba, दादा Complex syllables – Andrew Ba + ba दा + दा Alert!!! Basic Structure doesn’t suffice An drew VC? CVC?

Possible syllable structures The Nucleus is always present Onset and Coda may be absent Possible structures V CV VC CVC

Introduction to sonority theory “The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.” Some sounds are more sonorous Words in a language can be divided into syllables Sonority theory distinguishes syllables on the basis of sounds.

Sonority Hierarchy Obstruents can be further classified into:- Fricatives Affricates Stops

Sonority sequencing principle “The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.” Peak (Nucleus) Onset Coda

example ABHIJEET Sonority Profile 1 A I E E H J B T Sonority Profile 2

The concept of schwa First alphabet of IAL – {a} Unstressed and Toneless neutral vowel Some schwas deleted and some are not Schwa deletion – important issue for grapheme to phoneme conversion Handled using a well-established schwa deletion algorithm Example:- Priyatama – Last “a” changes the Gender प्रियतम प्रियतमा

Proposed Transliteration Model Source Language Words Source Language Syllables Syllabification Modules Target Language Words Target Language Syllables Moses Training Target Language Model SRILM Phrase translation tables Moses Decoder Source Language Words Transliterated output

Transliteration system workflow Syllabification of parallel list of names in Roman and Devanagari Using these parallel list for:- Alignment of syllables Training Moses translation toolkit Language model generation using SRILM Decoding using trained phrase-translation tables and language model Comparing results to analyze performance

Experiments and Results Syllabification of Roman and Devanagari words Fig : Syllabification Algorithm

Syllabification results A few examples Language Correct Incorrect English Soloman Akbarkhan Venkatachalam Hindi सोहनमल सोमेश्वर शिरोमणि सिराजउद्दीन

Transliteration Process Syllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables. Training Language Models for target language using SRILM toolkit. Training MOSES with aligned corpus of 7500 names and target language model as input. Testing with a list of 2500 proper names using the trained model for transliteration.

Roman to Devanagari Transliteration Fig : Result for Roman to Devanagari Transliteration Fig : Top-n Inclusion results

Devanagari to Roman Transliteration Fig : Result for Devanagari to Roman Transliteration Fig : Top-n translation results

Comparison with Character n-gram based model Same Experimental setup; Transliteration units changed to n-grams Bigrams (Sandeep  Sa, an, nd, de, ee, ep) Trigrams (Sandeep  San, and, nde, dee, eep) Quadrigrams (Sandeep  Sand, ande, ndee, deep) Observations suggest performance improvement using syllables as transliteration units n-gram based models prove to be ignorant to phonological properties like unstressed vowels Fig : Comparison with N-gram based model

Comparison with State-of-the-art Systems Google transliteration engine and Quillpad used as benchmarks for comparison A list of 1000 words written in Roman alphabet used as test input Our system outperforms Quillpad and just falls short of Google’s results. A more intense training with larger training set might improve system performance. Fig : Comparison with State-of-the-art transliteration systems

Discussions Accents Mapping of sounds Silent Letters थोड़ा : Thoda or thora? Mapping of sounds Mahaan – महान Kahaan - कहाँ Silent Letters Psychatrist - सायकेट्रिस्ट

Discussions (cntd…) Improper Schwa deletion Venkatachalam – वेंकटचलम Improper placement (Onset or Coda) सिराजउद्दीन - सि राज उद् दिन or सि रा जउद्  दिन Similar phonological structure but different pronunciation सोमलता and कोमलता वें + कट + च + लम वेंक + टच + लम सोम लता को मल ता

Conclusion and Future work Transliteration can prove critical in supporting Machine Translation Phonologically aware transliteration units like syllables show strong signs of performance improvement Syllable-based transliteration performs at least up to the state-of-the-art systems. Syllabification algorithms should be subjected to further improvement Developed system should be supplied with larger and more accurate training set. Some linguistic issues discussed above are very challenging cases for future work on transliteration

References Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Gao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing. Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic. Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114. Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135. Stolcke A. 2002. SRILM – An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing. Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

Background Theory

Approaches towards Transliteration

Complex Syllable structure Fig : Detailed syllable structure Fig : Complex syllables fitting in above structure

Sonority theory & syllables “A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.” Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

Sonority Hierarchy for English and Hindi Sound Segment Letters contained Vowels {"a", "e", "i", "o", "u"} Liquids {"y", "r", "l", "v", "w"} Nasals {"n", "m"} Fricatives {"s", "z", "f", "th", "h", "sh", "x"} Affricates {"ch", "j"} Stops {"b", "d", "g", "p", "t", "k", "q", “c”} Fig : Sonority hierarchy for English Sound Segment Letters contained Vowels {"अ","आ","इ","ई","उ","ऊ","ऋ","ए","ऐ","ओ","औ","अं","अः"} Matras {"ा","ि","ी","ु","ू","े","ै","ो","ौ","ं"} Liquids {"य","र","ल","व"} Nasals {"न", "म", "ण" ,"ञ" ,"ङ", "ज्ञ"} Fricatives {"स", "झ", "फ", "थ", "ह", "श", "ष", "क्ष", "त्र"} Affricates {"च", "ज", "छ"} Stops {"ब", "ड", "ग", "प", "ट", "त", "क", "भ", "ध", "ढ", "घ", "फ", "थ", "ठ", "ख", "द"} Fig : Sonority hierarchy for Hindi

Maximal Onset Principle “The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.” In case of words having two valid syllable set, one with maximum onset length would be preferred. Example – Diploma Di + plo + ma Dip + lo + ma

Schwa deletion algorithm Procedure delete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. If a consonant marked U is followed by a full vowel, then mark that consonant as F. While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. If the last consonant is marked U, mark it H. If any consonant marked U is immediately followed by a consonant marked H, mark it F. While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

Example of Schwa deletion Fig : Application of Schwa deletion Algorithm

Examples Correct Transliterations Incorrect Transliteration Source Language Target Language Examples English Hindi REYMOND  रेमंड MUKUNDAN  मुकुंदन सलीमाबेगम  SALEEMABEGAM राजगुरू  RAJGURU Source Language Target Language Examples English Hindi VENKATACHALAM  वेंकाटाचलं DHRUVA  UNKव कोमलता  COMLATA