English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.

Slides:



Advertisements
Similar presentations
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Advertisements

Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
TURKALATOR A Suite of Tools for English to Turkish MT Siddharth Jonathan Gorkem Ozbek CS224n Final Project June 14, 2006.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Evidence from Content INST 734 Module 2 Doug Oard.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Alignment by Bilingual Generation and Monolingual Derivation Toshiaki Nakazawa and Sadao Kurohashi Kyoto University.
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Tokenization & POS-Tagging
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
Brno, TSD, Using TectoMT as a Preprocessing Tool for Phrase-Based SMT Daniel Zeman ÚFAL MFF Univerzita Karlova v Praze Charles University in.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Mountain... पहाड mountain... पर्वत Mr... श्री Aligarh District... अलीगढ जिला Allahabad University... इलाहाबाद विश्वविद्यालय Amit Vilasrao Deshmukh... अमित.
Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Language Identification and Part-of-Speech Tagging
Ankit Srivastava CNGL, DCU Sergio Penkale CNGL, DCU
Translation of Unknown Words in Low Resource Languages
Suggestions for Class Projects
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
Build MT systems with Moses
Complex sentences.
Improving Parsing Accuracy by Combining Diverse Dependency Parsers
Improved Word Alignments Using the Web as a Corpus
Statistical Machine Translation Papers from COLING 2004
Statistical NLP Spring 2011
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
SANSKRIT ANALYZING SYSTEM
Presentation transcript:

English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha

2 ICON 2008, पुणे, Data Parallel (en-hi) –TIDES (50k training sentences, 1.2M hi words) –EILMT (7k training sentences, 181k hi words) –EMILLE (200k en words) –Daniel Pipes (322 texts) –Agriculture (17k en ~ 13k hi words) Monolingual (hi) –Hindi news web sites (18M sentences, 309M words)

3 ICON 2008, पुणे, Impact of additional data Larger parallel data helps –Test data: EILMT –Training & dev data: EILMT18.88 ± 2.05 EILMT+TIDES19.27 ± 2.22 EILMT+TIDES+20k web sents20.07 ± 2.21

4 ICON 2008, पुणे, Impact of additional data Larger Hindi LM data does not help –Test data: EILMT –Parallel training data: EILMT + TIDES + 20k web sentences –LM training data: EILMT + web (>300M words):18.82 ± 2.13 EILMT (181k words):20.07 ± 2.21 –Out of domain –Incompatible tokenization?

5 ICON 2008, पुणे, Moses setup Alignment heuristics: grow-diag-final-and (GDFA) –4 times more extracted phrases than GDF –BLEU + 5 points (table) 5

6 ICON 2008, पुणे, Alignment heuristics EILMTall grow-diag-final13.82 ± ± 1.46 grow-diag-final- and ± ± 2.21

7 ICON 2008, पुणे, Alignment heuristics: CS-EN CS to ENEN to CS grow-diag-final17.37 ± ± 0.88 grow-diag-final- and ± ± 0.87

8 ICON 2008, पुणे, Moses settings Alignment using first four characters (“light stemming”) –helps with GDF (not significantly) –does not help with GDFA (not significantly) MERT tuning of feature weights –(not included in official baseline)

9 ICON 2008, पुणे, Rule-based reordering Move finite verb forms to the end of the sentence (not crossing punctuation, “that”, WH-words). Transform prepositions to postpositions TectoMT, Morče tagger (perceptron), McDonald’s MST parser 9

10 ICON 2008, पुणे, Reordering example Technology is the most obvious part : the telecommunications revolution is far more pervasive and spreading more rapidly than the telegraph or telephone did in their time. Technology the most obvious part is : the telecommunications revolution far more pervasive is and spreading more rapidly than the telegraph or telephone their time in did. 10

11 ICON 2008, पुणे, Unsupervised stem-suffix segmentation Factors in Moses –Lemma + tag: but we do not have a tagger –Stem + suffix: unsupervised learning is language independent –A tool by Dan Zeman (Morpho Challenge 2007, 2008)

12 ICON 2008, पुणे, Core Idea Assumption: 2 morphemes: stem+suffix –Suffix can be empty All splits of all words –(into a stem and a suffix) Set of suffixes seen with the same stem is a paradigm –In a wider sense, paradigm = set of suffixes + set of stems seen with the suffixes

13 ICON 2008, पुणे, Paradigms get filtered Remove the paradigm if: –There are more suffixes than stems –All suffixes begin with the same letter –There is only one suffix Merge paradigms A and B if: –B is subset of A –A is the only superset of B

14 ICON 2008, पुणे, Paradigm Examples (en) Suffixes: e, ed, es, ing, ion, ions, or Stems: calibrat, decimat, equivocat, … Suffixes: e, ed, es, ing, ion, or, ors Stems: aerat, authenticat, disseminat, … Suffixes: 0, d, r, r’s, rs, s Stems: analyze, chain-smoke, collide, …

15 ICON 2008, पुणे, Paradigm Examples (hi) Suffixes: 0, ा, े, ों Stems: अहात, खांच, घुटन, चढ़ाव, … Suffixes: 0, ं, ंगे, गा Stems: कराए, दर्शाए, फेंके, बदले, … Suffixes: 0, ि, ियां, ियों Stems: अनुभूत, अभिव़्यक्त, …

16 ICON 2008, पुणे, Learning Phase Outcomes List of paradigms List of known stems List of known suffixes List of stem-suffix pairs seen together How can we use that to segment a word?

17 ICON 2008, पुणे, Morphemic Segmentation Consider all possible splits of the word 1. Stem & suffix known and allowed together 2. Stem & suffix known but not together 3. Stem is known 4. Suffix is known 5. Both unknown We use 4 (longest known suffix)

18 ICON 2008, पुणे, Impact of our preprocessing