Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 35– Phonetics and phonology; syllabification) Pushpak Bhattacharyya CSE Dept.,
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Cmput 650 Final Project Probabilistic Spelling Correction for Search Queries.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
The development of writing
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
Chapter 3 The Development of Writing. Is Writing as early as speaking? Writing is relatively new - it was invented for the first time by the Sumerians.
Automatic Continuous Speech Recognition Database speech text Scoring.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Transliteration Linguistic Enrichment of Statistical
Transliteration System
CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
II. HOW TO TEACH PHONETICS?. through teaching phonemic alphabet by using evocative / reminiscent associative elements & symbols, using monolingual dictionary,
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Letter to Phoneme Alignment Using Graphical Models N. Bolandzadeh, R. Rabbany Dept of Computing Science University of Alberta 1 1.
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Dictionary Skills: What You Need to Know to Help You Learn.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
Language Model for Machine Translation Jang, HaYoung.
AAPPL Assessment Follow Up June What is AAPPL Measure? The ACTFL Assessment of Performance toward Proficiency in Languages (AAPPL) is a performance-
Leveraging supplemental transcriptions and transliterations via re-ranking Aditya Bhargava April 19, 2011.
English Pronunciation
Introduction to Linguistics
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Statistical NLP: Lecture 13
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
CSA3180: Natural Language Processing
Chapter 10: Compilers and Language Translation
Presentation transcript:

Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat

Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger

Motivation An important component of machine translation When you cannot translate, transliterate Generally used for named entities, technical terms and out of vocabulary words (OOV) Issues specific to sounds, scripts and accents Can a machine do this? If yes, how?

Task of converting a word from one alphabetic script to another Used for: Named entities : Gandhiji Out of vocabulary words : Bank What is transliteration?

Accents : Thoda or thora? Mapping of sounds Mahaan:Kahaan: Back-transliteration Linguistic issues

Arabic Chinese Hindi / Japanese Arabic b -> English p or b English word: Paul transliterates to Arabic word: Baul (issue in Back-transliteration) Origin of the proper noun determines the symbol in Chinese language Ideographic symbols in Chinese Several English symbols do not map to any Japanese symbols. So, often mapped to closest sounding symbol ice cream  aisukuriimu Linguistic Issues : Mapping of sounds Symbols map to different symbols based on their position America Difference in origin Restaurant constant

x Overview Source String Transliteration Units Target String Transliteration Units

Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based

Phoneme-based approach Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language P( p s | w s ) P ( p t | p s ) P ( w t | p t ) Note: Phoneme is the smallest linguistically distinctive unit of sound. P(w t ) W t * = argmax (P (w t ). P (w t | p t ). P (p t | p s ). P (p s | w s ) )

Phoneme-based approach Step I : Consider each character of the word Transliterating ‘BAPAT’ BA PA T P/ə//a:/ /ə//a:/B T Source word to phonemes P/ə//a:/ /ə//a:/B T Source phonemes to target phonemes t t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq.

Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : /a:/ : T: t: Output :

Concerns Word in Source language Pronunciation in Source language Word in Target language Pronunciation In target language Check if the world is valid In target language Check if environment Is noise-free

Unknown pronunciations Back-transliteration can be a problem Johnson  Jonson Issues in phonetic model sanhita samhita

Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based

Maps source word sequences to target word sequences (i.e. direct word to word) The transliteration score: P(w) Spelling-based model Letter trigram model included Thus, we can accommodate the words not included in the dictionary Pronunciation in Source language Pronunciation In target language Word in Source language Word in Target language

Comparison of the two methods

Contents Source String Transliteration Units Target String Transliteration Units Phoneme- based Spelling- based Joint Source Channel

Particularly developed for Chinese Chinese : Highly ideographic Example : Two main steps: The Third Method - Why? Image courtesy: wikimedia-commons ModelingDecoding

Modeling Step A bilingual dictionary in the source and target language From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important John Georgia Geology Geo Modeling step

Modeling step … N-gram Mapping : This concludes the modeling step Modeling step …

Decoding Step Consider the transliteration of the word “George”. Alignments of George: Geo rge G eo rge Decoding step

Decision to be made between…. The context mapping is present in the map-dictionary Using…… Decoding step …

Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary How to align this dictionary? Ans. : Using EM-algorithm Transliteration Alignment

EM Algorithm Bootstrap Expectation Maximization Transliteration Units Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Derive a list of transliteration units from final alignment

Evaluation E2C Error rates for n-gram testsE2C v/s C2E for TM Tests

Conclusion Transliteration can make use of phonemes as an intermediate layer to move from a script to another Spelling-based approach connects the word sequences of the two languages The joint source channel method integrates optimization of alignment and transliteration no pre-alignment needed reduction in development efforts

( the end )

References For all Devnagari transliterations, H. Li,M. Zhang, and J. Su A joint source-channel model for machine transliteration. In ACL, pages 159– Y. Al-Onaizan and K. Knight Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. K. Knight and J. Graehl Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey Statistical transliteration for English- Arabic cross language information retrieval. In CIKM, pages 139–146. Joint source-channel model Phoneme and spelling-based models