Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906.

Similar presentations


Presentation on theme: "Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906."— Presentation transcript:

1 Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

2 Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger

3 Motivation An important component of machine translation When you cannot translate, transliterate Generally used for named entities, technical terms and out of vocabulary words (OOV) Issues specific to sounds, scripts and accents Can a machine do this? If yes, how?

4 Task of converting a word from one alphabetic script to another Used for: Named entities : Gandhiji Out of vocabulary words : Bank What is transliteration?

5 Accents : Thoda or thora? Mapping of sounds Mahaan:Kahaan: Back-transliteration Linguistic issues

6 Arabic Chinese Hindi / Japanese Arabic b -> English p or b English word: Paul transliterates to Arabic word: Baul (issue in Back-transliteration) Origin of the proper noun determines the symbol in Chinese language Ideographic symbols in Chinese Several English symbols do not map to any Japanese symbols. So, often mapped to closest sounding symbol ice cream  aisukuriimu Linguistic Issues : Mapping of sounds Symbols map to different symbols based on their position America Difference in origin Restaurant constant

7 x Overview Source String Transliteration Units Target String Transliteration Units

8 Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based

9 Phoneme-based approach Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language P( p s | w s ) P ( p t | p s ) P ( w t | p t ) Note: Phoneme is the smallest linguistically distinctive unit of sound. P(w t ) W t * = argmax (P (w t ). P (w t | p t ). P (p t | p s ). P (p s | w s ) )

10 Phoneme-based approach Step I : Consider each character of the word Transliterating ‘BAPAT’ BA PA T P/ə//a:/ /ə//a:/B T Source word to phonemes P/ə//a:/ /ə//a:/B T Source phonemes to target phonemes t t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq.

11 Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : /a:/ : T: t: Output :

12 Concerns Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language Check if the world is valid In target language Check if environment Is noise-free

13 Unknown pronunciations Back-transliteration can be a problem Johnson  Jonson Issues in phonetic model sanhita samhita

14 Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based

15 Maps source word sequences to target word sequences (i.e. direct word to word) The transliteration score: P(w) Spelling-based model Letter trigram model included Thus, we can accommodate the words not included in the dictionary Pronunciation in Source language Pronunciation In target language Word in Source language Word in Target language

16 Comparison of the two methods

17 Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based Joint Source Channel

18 Particularly developed for Chinese Chinese : Highly ideographic Example : Two main steps: The Third Method - Why? Image courtesy: wikimedia-commons ModelingDecoding

19 Modeling Step A bilingual dictionary in the source and target language From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important John Georgia Geology Geo Modeling step

20 Modeling step … N-gram Mapping : This concludes the modeling step Modeling step …

21 Decoding Step Consider the transliteration of the word “George”. Alignments of George: Geo rge G eo rge Decoding step

22 Decision to be made between…. The context mapping is present in the map-dictionary Using…… Decoding step …

23 Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary How to align this dictionary? Ans. : Using EM-algorithm Transliteration Alignment

24 EM Algorithm Bootstrap Expectation Maximization Transliteration Units Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Derive a list of transliteration units from final alignment

25 Evaluation E2C Error rates for n-gram testsE2C v/s C2E for TM Tests

26 Conclusion Transliteration can make use of phonemes as an intermediate layer to move from a script to another Spelling-based approach connects the word sequences of the two languages The joint source channel method integrates optimization of alignment and transliteration no pre-alignment needed reduction in development efforts

27 ( the end )

28 References For all Devnagari transliterations, www.quillpad.in/hindi/ H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166. www.wikipedia.org Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English- Arabic cross language information retrieval. In CIKM, pages 139–146. Joint source-channel model Phoneme and spelling-based models


Download ppt "Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906."

Similar presentations


Ads by Google