Domain Adaptation for Statistical Machine Translation

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Statistical Machine Translation Part II: Word Alignments and EM

Translation of Unknown Words in Low Resource Languages

--Mengxue Zhang, Qingyang Li

Applying Key Phrase Extraction to aid Invalidity Search

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Statistical Machine Translation Papers from COLING 2004

The XMU SMT System for IWSLT 2007

Presentation transcript:

Domain Adaptation for Statistical Machine Translation University of Macau Domain Adaptation for Statistical Machine Translation Master Defense By Longyue WANG, Vincent MT Group, NLP2CT Lab, FST, UM Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong 20/08/2014 Good morning everyone, thanks for your coming. Today I am so happy to be here to share with you my research work in my master degree. My name is Longyue Wang, majoring in software engineering. My supervisors are Prof. Chow and Prof. Wong. Today, the topic of my presentation is …

Computational Linguistics Research Scope Computational Linguistics Machine Translation Text Translation Domain-Specific Statistical MT Hybrid MT Rule-based MT Speech Translation Let us have a look at where we are. This tree shows our research scope, where we mainly explore domain adaptation approaches for statistical machine translation systems. Domain-Specific Statistical MT Figure 1: Our Research Scope [1] [2] [1] Daniel Jurafsky and James Martin (2008) An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Second Edition. Prentice Hall. [2] Wikipedia, http://en.wikipedia.org/wiki/Machine_translation.

Agenda Introduction Proposed Method I: New Criterion Proposed Method II: Combination Proposed Method III: Linguistics Domain-Specific Online Translator My presentation today will follow this agenda. We have a brief introduction includes background and problems. From the part 2-4, we will go straight into our proposed methods and experiments. Part 5 shows you how to apply the domain adaptation techniques for a real-life applications. Finally followed with a conclusion. Conclusion

Part I: Introduction We will follow three questions.

What is Statistical Machine Translation? The First Question What is Statistical Machine Translation?

Statistical Machine Translation I draw this framework to show you how does it work. Figure 2: Phrase-based SMT Framework SMT translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3]. Currently, the most successful approach of SMT is phrase-based SMT, where the smallest translation unit is n-gram consecutive words. [3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. 19:263–311.

Statistical Machine Translation Parallel Corpus Monolingual Corpus Fitstly, Statistical models are learnt from corpus, which is a collection of texts. [ˈʒɒnrə] for example, this is chiese-english parallel corpus and if we only consider one side, it is monolingual corpus. Figure 2: Phrase-based SMT Framework Corpus is a collection of texts. e.g., IWSLT2012 official corpus. Bilingual corpus is a collection of text paired with translation into another language. Monolingual corpus, in one (mostly are the target side) language. Corpus may come from different genres, topics etc.

Statistical Machine Translation Word Alignment Translation Table Reordering Model Then Word-alignment information can be mined from bitext using EM algorithm. And then extract phrase pairs for translation table and reordering table. Expectation maximization ensure the foreign phrase to match target ones Figure 2: Phrase-based SMT Framework Word alignment can be mined by the help of EM algorithm. Then extract phrase pairs from word alignment to generate translation table. Distance-based reordering model is a penalty of changing position of translated phrases.

Statistical Machine Translation Language Model which ensure the output to be fluent. Figure 2: Phrase-based SMT Framework Language model assigns a probability to a sequence of words. (n-gram) [4] (1) [4] F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280..

Statistical Machine Translation Source Text Searching Translation Candidates Decoding Target Text After training part, all models are generated. The decoding function tries to retrieve the best translation candidate. Figure 2: Phrase-based SMT Framework Decoding function consists of three components: the phrase translation table, which ensure the foreign phrase to match target ones; reordering model, which reorder the phrases appropriately; and language model, which ensure the output to be fluent. (2)

What is Domain-Specific SMT System? The Second Question What is Domain-Specific SMT System?

Typical SMT vs. Domain-Specific SMT Typical SMT systems are trained on a large and broad corpus (i.e., general-domain) and deal with texts with ignoring domain. Performance depends heavily upon the quality and quantity of training data. Outputs preserve semantics of the source side but lack morphological and syntactic correctness. Understandable translation quality. BBC News Example [5]. Input: Hollywood actor Jackie Chan has apologised over his son's arrest on drug-related charges, saying he feels "ashamed" and "sad". Google Output: 好萊塢影星成龍已經道歉了他兒子的被捕與毒品有關的指控，說他感覺“羞恥”和“悲傷”。 Let us firstly have a look at what is the typical one. What is their performance. As a native speaker, we know the sentence is wrong. But we still know the general meanings. [5] Available at http://www.bbc.com/news/world-asia-china-28871698. (BBC News 20 August 2014.)

Typical SMT vs. Domain-Specific SMT Domain-Specific SMT systems are trained on a small but relative corpus (i.e., in-domain) and deal with texts from one specific domain. Consider relevance between training data and what we want to translate (test data). Outputs preserve semantics of the source side, morphological and syntactic correctness. Publishable quality. Patent Document Example [6] Input: 本发明涉及新的tetramic酸型化合物，它从CCR－5活性复合物中分离出来，在控制条件下通过将生物纯的微生物培养液(球毛壳霉Kunze SCH 1705 ATCC 74489)发酵来制备复合物。[5] ICONIC Translator Output: Novel tetramic acid-type compounds isolated from a CCR-5 active complex produced by fermentation under controlled conditions of a biologically pure culture of the microorganism, Chaetomium globosum Kunze SCH 1705, ATCC 74489 ., pharmaceutical compositions containing the compounds. Good Lexicon choice, New -> novel speretated->isolated. Chemistrial function. Number and hyphed word. [6] Chinese Patent WO01/74772《受体拮抗剂趋化因子》.

What is Domain-Specific Translation Challenge? The Third Question What is Domain-Specific Translation Challenge? So to achieve this goal, what is the main challenges.

Challenge 1 – Ambiguity Multi-meaning may not coincide in bilingual environment. The English word Mouse refers to both animal and electronic device. While in the Chinese side, they are two words. Choosing wrong translation variants is a potential cause for miscomprehension. 1 Different translation systems may prefer different translation candidates, because they have different knowledge. 2 Figure 3: Translation ambiguity example

Challenge 2 – Language Style News Domain Try to deliver rich information with very economical language. Short and simple-structure sentence make it easy to understand. A lot of abbreviation, date, named entitles. China's Li Duihong won the women's 25-meter sport pistol Olympic gold with a total of 687.9 points early this morning Beijing time. (Guangming Daily, 1996/07/02) 我国女子运动员李对红今天在女子运动手枪决赛中，以687.9环战胜所有对手，并创造新的奥运记录。（《光明日报》 1996年7月2日） I take two domain texts for example. By observing the sentence, we found that

Challenge 2 – Language Style Law Domain Very rigorous even with duplicated terms. Use fewer pronouns, abbreviations etc. to avoid any ambiguity. High frequency words of shall, may, must, be to. Long sentence with long subordinate clauses. When an international treaty that relates to a contract and which the People’s Republic of China has concluded on participated into has provisions of the said treaty shall be applied, but with the exception of clauses to which the People’s Republic of China has declared reservation. 中华人民共和国缔结或者参加的与合同有关的国际条约同中华人民共和国法律有不同规定的,适用该国际条约的规定。但是,中华人民共和国声明保留的条款除外。 [sə'bɔːdɪnət]

Challenge 3 – Out-Of-Vocabulary Terminology: words or phrases that mainly occur in specific contexts with specific meanings. Variants, increasing, combination etc. 91.64% 8.36% BHT 2,6-二叔丁基-4-甲基苯酚 Figure 4: Out-of-Vocabulary Example

Domain Adaptation As SMT is corpus-driven, domain-specificity of training data with respect to the test data is a significant factor that we cannot ignore. There is a mismatch between the domain of available training data and the target domain. Unfortunately, the training resources in specific domains are usually relatively scarce. In such scenarios, various domain adaptation techniques are employed to improve domain-specific translation quality by leveraging general-domain data. Although larger amount of general-domain corpora are easy to access, SMT systems trained on the relative domain data could surprisingly perform much better than on a larger amount of irrelevant training data. Actually, there is no uniform definition in natural language processing (NLP) or SMT research community yet. Different domains may vary by topic or text style.

Domain Adaptation for SMT Domain adaptation can be employed in different SMT components: word-alignment model, language model, translation model and reordering model. [6] [7] Model Figure 5: Domain Adaptation Approaches [6] Hua, Wu, Wang Haifeng, and Liu Zhanyi. "Alignment model adaptation for domain-specific word alignment." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005. [7] Koehn, Philipp, and Josh Schroeder. "Experiments in domain adaptation for statistical machine translation." Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2007.

Domain Adaptation for SMT Various resources can be used for domain adaptation: monolingual corpora, parallel corpora, comparable corpora, dictionaries and dictionary. [8] Resources Figure 5: Domain Adaptation Approaches [8] Wu, Hua, Haifeng Wang, and Chengqing Zong. "Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora." Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 2008.

Domain Adaptation for SMT Considering supervision, domain adaptation approaches can be decided into supervised, semi-supervised and unsupervised. [9] Supervision Figure 5: Domain Adaptation Approaches [9] Snover, Matthew, Bonnie Dorr, and Richard Schwartz. "Language and translation model adaptation using comparable corpora." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.

My Thesis Data Selection: solve the ambiguity and language style problems by moving the data distribution of training corpora to target domain. Domain Focused Web-Crawling: reduce the OOVs by mining in-domain dictionary, parallel and monolingual sentences from comparable corpus (web). It aims at moving the probability distribution towards the target domain translations in case of ambiguity (Sennrish, 2013) and OOVs. Figure 6: My Domain Adaptation Approaches

Part II: Data Selection

Definition Selecting data suitable for the domain at hand from large general-domain corpora, under the assumption that a general corpus is broad enough to contain sentences that are similar to those that occur in the domain. SMT System … Spoken Domain One of the most dominant approaches finding such appropriate data from large general-domain corpora. Figure 7: Data Selection Definition

Framework – TM Adaptation Source Language Target Language Domain Estimation We define the set {<Si>, <Ti>, <Si,Ti>} as Vi. MR is an abstract model representing the target domain. Source Language Target Language Two kinds of corpora are available. We use a score function to measure the relevance between two kinds corpora. Figure 8: My Data Selection Framework

Framework – TM Adaptation Source Language Target Language Source Language Target Language Domain Estimation Rank sentence pairs according to score. Select top K% of general-domain data. K is a tunable threshold. Source Language Target Language Select top K% of general-domain data as pseudo in-domain corpus. Figure 8: My Data Selection Framework

Framework – TM Adaptation Source Language Target Language Translation Model (IN) Translation Model (Final) Log-linear /linear Interpolation Source Language Target Language Translation Model (Pseudo) Domain Estimation Source Language Target Language Figure 8: My Data Selection Framework

Framework – LM Adaptation Target Language Language Model (IN) Target Language Domain Estimation Language Model (Pseudo) Target Language Log-linear/Linear Interpolation Various domain adaptation approaches have been proposed, which can be divided into different kinds of category from different perspectives. In this thesis, we summarize the domain adaptation according to their methods (Sennrich, 2013; Chen et al., 2013). Language Model (Final) Figure 8: My Data Selection Framework

Framework – LM Adaptation Finally, we use these adapted models to do the translation task and evaluate the translation quality. Figure 8: My Data Selection Framework

Related Work Vector space model (VSM), which converts sentences into a term-weighted vector and then applies a vector similarity function to measure the domain relevance. The sentence Si is represented as a vector: Standard tf-idf weight: Each sentence Si is represented as a vector (wi1, wi2,…, win), and n is the size of the vocabulary. So wij is calculated as follows: Cosine measure: The similarity between two sentences is then defined as the cosine of the angle between two vectors. (3) (4) There are already some methods. in which tfij is the term frequency (TF) of the j-th word in the vocabulary in the Sentence Si, and idfj is the inverse document frequency (IDF) of the j-th word calculated. Where is the intersection (i.e. the dot product) of the document (d2 in the figure to the right) and the query (q in the figure) vectors, is the norm of vector d2, and is the norm of vector q. (5)

Related Work Perplexity-based model, which employs n-gram in-domain language models to score the perplexity of each sentence in general-domain corpus. Cross-entropy is the average of the negative logarithm of the word probabilities. Perplexity pp can be simply transformed with a base b with respect to which the cross-entropy is measured (e.g., bits or nats). Perplexity and cross-entropy are monotonically related. (6) where p denotes the empirical distribution of the test sample. p(x) =n/N if x appeared n times in the test sample of size N. q(wi) is the probability of event wi estimated from the training set. (7)

Related Work Until now, there are three perplexity-based variants: The first basic one [13]: The second is called Moore-Lewis [14]: which tries to select the sentences that are more similar to in-domain but different to out-of-domain. The third is modified Moore-Lewis [15]: which considers both source and target language. (8) (9) Let HI(x) and HO(x) be the cross-entropy of string x according to language model LMI and LMO which are respectively trained by in-domain dataset I and general-domain dataset G. Considering the source (src) and target (tgt) sides of training data. (10) [13] Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Information Processing (TALIP). 1:3–33. [14] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. Proceedings of ACL: Short Papers. pp. 220–224. [15] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP. pp. 355–362.

Discussion: Grain Level By reviewing their work, I found VSM-based methods can obtain about 1 BLEU point improvement using 60% of general-domain data [10, 11 and 12]. Perplexity-based approaches allow to discard 50% - 99% of the general corpus resulted in an increase of 1.0 - 1.8 BLEU points [13, 14, 15, 16 and 17]. [10] Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query models. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Geneva, Switzerland. [11] Almut Silja Hildebrand, Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Adaptation of the translation model for statistical machine translation information retrieval. In 10th Annual Conference of the European Association for Machine Translation (EAMT 2005). Budapest, Hungary. [12] Yajuan Lü, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and optimization. Proceedings of EMNLP-CoNLL. pp. 343–350.. [15] Keiji Yasuda and Eiichiro Sumita. 2008. Method for building sentence-aligned corpus from wikipedia. In 2008 AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI08). [16] George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adaptation in statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459. Association for Computational Linguistics, Cambridge, Massachusetts. Why they have different performance.

Discussion: Grain Level VSM-based similarity is a simple co-occurrence based matching, which only weights single overlapping words. Perplexity-based similarity considers not only the distribution of terms but also the n-gram word collocation. String-difference can comprehensively consider word overlap, n-gram collocation and word position. So I wonder if there is a higher grain level method to achieve better results. Therefore, good results can be expected using the high constraint level similar function to select data. Try to find the most similar sentences Figure 9: Data Selection Pyramid

The First Proposed Method Edit Distance: A New Data Selection Criterion for SMT Domain Adaptation

New Criterion String-difference metric is a better similarity function [21], with higher grain level. Edit-distance is proposed as a new selection criterion. Given a sentence sG from general-domain corpus and a sentence sI from in-domain corpus, the edit distance for these two sequences is defined as the minimum number of edits, i.e. symbol insertions, deletions and substitutions, needed to transform sG into sI. The normalized similarity score (fuzzy matching score, FMS) is given by Koehn and Senellart [22] in translation memory work. (11) In practice, we use edit-distance. in which ED(sG, sR) is a distance function and |s| is the number of tokens of sentence s. In this study, we employed a word-based Levenshtein edit distance function (LEDword). [21] Wang, Longyue, et al. "Edit Distance: A New Data Selection Criterion for Domain Adaptation in SMT." RANLP. 2013. [22] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180.

General-domain Corpus New Criterion For each sentence in general-domain corpus, we traverse all in-domain sentences to calculate FMS score and then average them. (12) General-domain Corpus In-domain Corpus • Figure 10: Edit-distance based data selection

Experiment: Corpora (Chinese-English) General-domain parallel corpus (in-house) includes sentences comparing a various genres such as movie subtitles, law literature, news and novels. In-domain parallel corpus, dev set, test set are randomly selected from the IWSLT2010 Dialog [37], consisting of transcriptions of conversational speech in travel. We use parallel corpora for TM training and the target side for LM training. Data Set Sentences Ave. Len. Test Set 3,500 9.60 Dev Set 3,000 9.46 In-domain 17,975 9.45 General-domain 5,211,281 12.93 To conduct our experiment Table 1: Corpora Statistics (English-Chinese) [37] Available at http://iwslt2010.fbk.eu/node/33.

Experiment: System Setting Baseline: SMT trained on all general-domain corpus; VSM-based system (VSM): SMT trained on top K% of general-domain corpus ranked by Cosine tf-idf metric; Perplexity-based system (PL): SMT trained on top K% of general-domain corpus ranked by basic cross-entropy metric; String-difference system (SD): SMT trained on top K% of general-domain corpus ranked by Edit-distance metric; We investigate K={20, 40, 60, 80}% of ranked general-domain data as pseudo in-domain corpus for SMT training, where K% means K percentage of general corpus are selected as a subset. Then we build several systems

Experiment: Results Three adaptation methods do better than baseline. VSM can improve nearly 1 BLEU using 80% (more) entire data. PL is a simple but effective method, which increases by 1.1 BLEU using 60% (less) data. SD performs best, which achieve higher BLEU than other two methods with less data. System 20% 40% 60% 80% Baseline 29.34 VSM 29.00 (-0.34) 29.50 (+0.16) 30.02 (+0.68) 30.31 (+0.97) PL 29.45 (+0.11) 29.65 (+0.31) 30.44 (+1.10) 29.78 (+0.44) SD 29.25 (-0.09) 30.22 (+0.88) 30.97 (+1.63) 30.21 (+0.87) This is the translation quality and relative improvement. Table 2: Translation Quality of Adapted Models

Discussion SD > PL > VSM > Baseline. Higher grained similarity metrics perform better than lower grained ones. However, different grained level methods have their own advanced nature. How about combining the individual models. VSM-based: keywords; Perplexity-based: word collocation; String-difference: more close or similar sentence. Figure 9: Data Selection Pyramid

A Hybrid Data Selection Model for SMT Domain Adaptation The Second Proposed Method A Hybrid Data Selection Model for SMT Domain Adaptation

General-domain Corpus General-domain Corpus Combination We investigate the combination of the above three individual models at two levels [23]. Corpus level: weight the pseudo in-domain sub-corpora selected by different methods and then join them together. General-domain Corpus VSM Combined Corpus • General-domain Corpus ED Figure 11: Combination Approach [23] Wang, Longyue, et al. "iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer Berlin Heidelberg, 2013. 280-290.

Combination Model level: perform linear interpolation on the translation models trained on difference sub-corpora. where i = 1, 2, and 3 denoting the phrase translation probability and lexical weights trained on the VSM, perplexity and edit-distance’s subsets. αi and βi are the tunable interpolation parameters, subject to (13) (14)

Experiment: Corpora (Chinese-English) General-domain parallel corpus includes sentences comparing a various genres such as movie subtitles, law literature, news and novels etc. In-domain parallel corpus, dev set, test set are disjoinedly and randomly selected from LDC corpus [38] (Hong Kong law domain). Domain Sent. No. % News 279,962 24.60% Novel 304,932 26.79% Law 48,754 4.28% Others 504,396 44.33% Total 1,138,044 100.00% We still use a large and broad general-domain corpus. Table 3: Translation Quality of Adapted Models [38] LDC2004T08, https://catalog.ldc.upenn.edu/LDC2004T08.

Experiment: Corpora (Chinese-English) Data Set Lang. Sentences Tokens Av. Len. Test Set EN 2,050 60,399 29.46 ZH 59,628 29.09 Dev Set 2,000 59,924 29.92 59,054 29.53 In-domain 45,621 1,330,464 29.16 1,321,655 28.97 Training Set 1,138,044 28,626,367 25.15 28,239,747 24.81 Table 4: Corpora Statistics Corpus size, data-type distribution, in/gen domain ratio are different. Data selection performance may be different. We use parallel corpora for TM training and the target side for LM training.

Experiment: System Setting Baseline: the general-domain baseline (GC-Baseline) are respectively trained on entire general corpus. Individual Model: Cosine tf-idf (Cos), proposed edit-distance based (ED) and three perplexity-based variants: cross-entropy (CE), Moore-Lewis (ML) and modified Moore-Lewis (MML). Combined Model: combined Cos, ED and the best perplexity-based model at corpus level (iCPE-C) and model level (named iCPE-M). We report selected corpora in a step of 2x starting from using 3.75% of general corpus K={3.75, 7.5, 15, 30, 60}%. We set our tunable threshold like this,

Experiment: Individual Model Results Perplexity-based variants are all effective methods. MML performs best: improve highest (nearly 2 BLEU) with least data (15%). MML> ED > CE > ML > Cos > Baseline System 3.75% 7.5% 15% 30% 60% GC-Baseline 39.15 CE 37.10 (-) 39.82 (+0.67) 40.39 (+1.24) 40.79 (+1.64) 39.43(+0.28) ML 38.07 (-) 40.33 (+1.18) 40.08 (+0.93) 40.46 (+1.31) 40.27 (+1.12) MML 38.26(-) 40.91 (+1.76) 41.12 (+1.97) 40.02 (+0.87) Cos 37.87 (-) 38.44 (-) 39.45 (+0.30) 40.17 (+1.02) 39.88 (+0.73) ED 37.70 (-) 39.00 (-) 40.88 (+1.73) 40.24 (+1.09) 40.00 (+0.85) The quality on both sides are not consistent. Source side (English) are better. Table 5: Translation Quality of Adapted Models

Experiment: Results Good performances are at K={7.5, 15, 30}%, thus we conduct combination methods in this section. Considering different nature of them, we will further combine MML (best perplexity-based), Cos and ED. So we combine these three individual models at these tuning points. Figure 12: Combination Approach

Experiment: Combination Model Results Two combination methods perform better than the best individual model. (slightly) Model-level combination is better than corpus-level one. (+0.23 BLEU) Combination models > individual models > Baseline System 7.5% 15% 30% GC-Baseline 39.15 MML 40.91 (+1.76) 41.12 (+1.97) 40.02 (+0.87) iCPE-C 41.01 (+1.86) 41.95 (+2.80) 41.98 (+2.83) iCPE-M 41.13 (+1.98) 42.21 (+3.06) 41.84 (+2.69) They do better than the best individual model and baseline. Table 6: Translation Quality of Adapted Models

Discussion We compare many data selection methods: VSM-based: cosine tf-idf. Perplexity-based: basic cross-entropy, Moore-Lewis and modified Moore-Lewis. String-difference: edit-distance. Combination: Corpus-level and Model-level Above methods only consider word itself (surface information). Languages have a larger set of different words leads to sparsity problems. Weak at capturing language style, sentence structure, sematic information.

Linguistically-augmented Data Selection for SMT Domain Adaptation The Third Proposed Method Linguistically-augmented Data Selection for SMT Domain Adaptation

Linguistic DS We explore two more linguistic information for data selection approach [25]: Surface form (f), word itself, have rich lexicon information. Named Entity categories (n) group together proper nouns that belong to the same semantic class (person, location, organization) [26]. Part-Of-Speech tags (t) group together words that share the same grammatical function (e.g. adjectives, nouns, verbs) [27]. we anticipate that this type of information could be useful as well for data selection. Lemmas group together word forms that share the same root, which can effectively reduce the sparsity, especially for highly inflected languages. [25] Antonio Toral, Pavel Pecina, Longyue Wang, Josef van Genabith. (2014). “Linguistically-augmented Perplexity-based Data Selection for Language Models.” Computer Speech and Language, (accepted and in minor revisions).. [26] E. W. D. Whittaker, P. C. Woodland, Comparison of language modelling techniques for russian and english, in: ICSLP, ISCA, 1998. [27] P. A. Heeman, Pos tags and decision trees for language modeling, in: 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999, pp. 129{137.

Linguistic DS Change the original corpus (f) into linguistic format (fn, ft and t) and use them for LM training and sentence scoring. The core metric is the modified Moore-Lewis. According to the scores, select data from original corpus (surface) to train adapted SMT models. Need 4 LM models: 1, in-domain corpus in source language 2, in-domain corpus in target language 3, out-of-domain corpus in source language 4, out-of-domain corpus in target language In practice, Figure 13: Linguistically-based Data Selection Method

Linguistic-based DS Based on individual models, we further combine different types of linguistic knowledge: Corpus level: given the sentences selected by all the individual models considered for a given threshold, we traverse the first ranked sentence by each of the methods, then we proceed to the set of second best ranked sentences, and so forth. Model level: Similar. The traversed sentences are kept in different sets. Build LMs on each set and then interpolate them. They are same as the second experiment. we keep all the distinct sentences

Experiment: Corpora (Chinese-English) General-domain parallel corpus combined with general-domain corpora: CWMT2013 [39], UMCorpus [40], News Magazine [41] etc. In-domain parallel corpus, dev set, test set are the IWSLT2014 TED Talk (talk domain) [42]. Data Set (EN/ZH) Sentences Ave. Len. Test Set 1,570 26.54/23.41 Dev Set 887 26.47/23.24 In-domain 177,477 26.47/23.58 General-domain 10,021,162 23.02/21.36 Table 7: Corpora Statistics [39] http://www.liip.cn/cwmt2013/. [40] http://nlp2ct.cis.umac.mo/um-corpus/. [41] LDC2005T10. https://catalog.ldc.upenn.edu/LDC2005T10. [42] http://workshop2014.iwslt.org/.

Experiment: System Setting All adapted systems are log-linearly interpolated with the in-domain model to further improve performance. Baseline: GI-Baseline is trained on all in-domain corpus and general corpus. Individual Model: surface form based (f), POS based (t), surface+named entity based (fn), surface+POS (ft) . Combined Model: corpus level (Comb-C) and model level (Comb-M). We investigate K={25, 50, 75}% of ranked general-domain data as pseudo in-domain corpus for SMT training.

Experiment: Individual Model Results After adding more linguistic information, fn and ft can improve baseline by about 1 BLEU. t (only POS) perform poorly due to lack of lexicon information. Considering their performance, we will combine f, fn and ft. System 25% 50% 75% GI-Baseline 40.20 f 31.91 (-8.29) 38.83 (-1.37) 41.37 (+1.17) t 21.20 (-19.00) 27.90 (-12.30) fn 31.93 (-8.27) 37.86 (-2.34) 40.93 (+0.73) ft 30.00 (-10.20) 38.74 (-1.46) 41.81 (+1.61) ft does slightly better than f by 0.44 point, which indicates replacing some non-NN and non-VV word by its POS tags can reduce the sparsety and keep the language style of in-domain sentence. fn performs no better than f: although NER tags can reduce the surface variants, but these name words (location, person, organization) are usually very important to define domain. Table 8: Translation Quality of Adapted Models

Experiment: Combination Model Results Both combination methods are better than best individual model (from +0.64 to +0.11 BLEU). Combination may success the advantages of each linguistic-based methods. (lexicon, spacity, language style) High-inflected languages such as English-German may have better performance with more linguistic information. System 25% 50% 75% GI-Baseline 40.20 f 31.91 (-8.29) 38.83 (-1.37) 41.37 (+1.17) ft 30.00 (-10.20) 38.74 (-1.46) 41.81 (+1.61) Comb-C 33.01 (-7.19) 39.07 (-1.13) 41.92 (+1.72) Comb-M 32.74 (-7.46) 38.95 (-1.25) 42.01 (+1.81) Table 9: Translation Quality of Adapted Models

Part III: Real-Life System

Real-life Environment To prove the robustness and language-independence of some domain adaptation approaches, we evaluation it in real-life system. WMT (since 2005) is most famous workshop with high-quality shared task on machine translation. We attended WMT2014 medical translation task [43]: Czech-English, French-English, German-English. (6 pairs) Very large resources: up to 36 million general-domain parallel sentences and 4 million in-domain parallel sentences. Medical texts are more complex. Chemical formulae, e.g “-CH2-(OCH2CH2)n-”. Why we call it real-life environment. [43] http://www.statmt.org/wmt14/.

WMT2014 Medical Translation Task By observing the text of medical text, we present a number of detailed domain adaptation techniques and approaches: Task Oriented Pre-processing. Language Model Adaptation. Translation Model Adaptation. Numeric Adaptation. Hyphenated Word Adaptation. Combination above all methods. Finally, 1st rank on three language pairs, and 2nd rank on others. Figure 14: Results and Rankings of Our System

BenTu System Based these models (medical domain), we develop my first online translator, BenTu, which is a domain-specific multi-tire SMT system [44]. Three layers: pre-processing, decoder and post-processing Easy to add new language pairs and domains Figure 15: Framework of BenTu System [44] The architecture is designed referring to PluTO project: Tinsley, John, Andy Way, and Paraic Sheridan. "PLuTO: MT for online patent translation." Association for Machine Translation in the Americas, 2010..

BenTu System Figure 16: User Interface of BenTu System

Part V: Conclusion

Thesis Contribution To solve the problems in domain-specific SMT, we proposed Data Selection methods as described. New data selection criterion Combination model Linguistically-augmented data selection Domain Focused Web-Crawling Integrated models for cross-language document alignment Combining topic classifier and perplexity for filtering Real-life domain-specific SMT based on a number of adapted models are developed. Actually, we do more out of this presentation. For example,

Total Contribution Figure 17: My work in the past three years we do more out of my thesis. I draw this tree to show all my word on artificial intelligence. Figure 17: My work in the past three years

Future Work Data Selection Graphical model and label propagation Neural language model Domain Focused Web-Crawling Improve the performance by mining the in-domain dictionary. Real-life domain-specific SMT Extend to more language pairs: Chinese, Japanese etc. Extend to more domains: science technology, laws and news

My Publications Journal Papers 1, Antonio Toral, Pavel Pecina, Longyue Wang, Josef van Genabith. 2014. Linguistically-augmented Perplexity-based Data Selection for Language Models. Computer Speech and Language (accepted). (IF=1.463) 2, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, and Junwen Xing. 2013. A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation. The Scientific World Journal, vol. 2014, Article ID 745485, 10 pages. (IF=1.730) 3, Long-Yue WANG, Derek F. WONG, Lidia S. CHAO. 2012. TQDL: Integrated Models for Cross-Language Document Retrieval. International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP), pages 15-32. (THCI Core) Conference Papers 4, Longyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco Oliveira. 2014. Combining Domain Adaptation Approaches for Medical Text Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation. (ACL Anthology and EI)

My Publications 5, Yi Lu, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco Oliveira. (2014) "Domain Adaptation for Medical Text Translation using Web Resources". In Proceedings of the Ninth Workshop on Statistical Machine Translation. (ACL Anthology and EI) 6, Yiming Wang, Longyue Wang, Xiaodong Zeng, Derek F. Wong, Lidia S.Chao, Yi Lu. 2014. Factored Statistical Machine Translation for Grammatical Error Correction”, In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL 2014), pages 83-90. (ACL Anthology and EI) 7, Longyue Wang, Derek F. Wong, Lidia S. Chao, Junwen Xing, Yi Lu, Isabel Trancoso. 2013. Edit Distance: A New Data Selection Criterion for SMT Domain Adaptation. In Proceedings of Recent Advances in Natural Language Processing, pages 727-732. (ACL Anthology and EI) 8, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, Junwen Xing. 2013. iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation. In Proceedings of the 12th China National Conference on Computational Linguistics (12th CCL), Lecture Notes in Artificial Intelligence (LNAI) Springer series, pages 280-290. (EI)

My Publications 9, Junwen Xing, Longyue Wang, Derek F. Wong, Lidia S. Chao, Xiaodong Zeng. 2013. UMChecker: A Hybrid System for English Grammatical Error Correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL 2013), pages 34-42. (ACL Anthology and EI) 10, Longyue WANG, Shuo Li, Derek F. WONG, Lidia S. CHAO. 2012. A Joint Chinese Named Entity Recognition and Disambiguation System. In Proceeding of the 2th CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2012), pages 146-151. (ACL Anthology) 11, Longyue WANG, Derek F. WONG, Lidia S. CHAO, Junwen Xing. 2012. CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data. In Proceedings of the Second CIPSSIGHAN Joint Conference on Chinese Language Processing (CLP2012), pages 51-57. (ACL Anthology) 12, Long-Yue Wang, Derek F. WONG, Lidia S. CHAO. 2012. An Experimental Platform for Cross-Language Document Retrieval. The 2012 International Conference on Applied Science and Engineering (ICASE2012), pages 3325-3329. (EI)

My Publications 13, Longyue Wang, Derek F. WONG, Lidia S. CHAO. 2012. An Improvement in Cross-Language Document Retrieval Based on Statistical Models. The Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012), pages 144-155. (ACL Anthology and EI) 14, Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo Quaresma, Francisco Oliveira, Yi Lu, Shuo Li, Yiming Wang, Longyue Wang. 2014. UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In Proceedings of the 9th Edition of its Language Resources and Evaluation Conference (LREC2014), pages 1837-1842. (EI)

Thank You! 謝謝！ Finally, I want to deeply say thanks to all of you in my life. Obrigado!

Appendix

Related Work Zhao et al. [10] firstly use this information retrieval techniques to retrieve sentences from monolingual corpus to build a LM, and then interpolate it with general-background LM. Hildebrand et al. [11] extend it to sentence pairs, which are used to train a domain-specific TM. Lü et al. [12] further proposed re-sampling and re-weighting methods for online and offline TM optimization. [10] Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query models. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Geneva, Switzerland. [11] Almut Silja Hildebrand, Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Adaptation of the translation model for statistical machine translation information retrieval. In 10th Annual Conference of the European Association for Machine Translation (EAMT 2005). Budapest, Hungary. [12] Yajuan Lü, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and optimization. Proceedings of EMNLP-CoNLL. pp. 343–350..

Related Work In language modeling, Gao et al. [13], Moore and Lewis [14] have used perplexity-based scores adapt LMs. Then it was firstly applied for SMT adaptation by Yasuda et al. [15] and Foster et al. [16]. Axelrod et al. [17] further improve the performance of TM adaptation by considering bilingual information. Various domain adaptation approaches have been proposed, which can be divided into different kinds of category from different perspectives. In this thesis, we summarize the domain adaptation according to their methods (Sennrich, 2013; Chen et al., 2013). [13] Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Information Processing (TALIP). 1:3–33. [14] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. Proceedings of ACL: Short Papers. pp. 220–224. [15] Keiji Yasuda and Eiichiro Sumita. 2008. Method for building sentence-aligned corpus from wikipedia. In 2008 AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI08). [16] George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adaptation in statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459. Association for Computational Linguistics, Cambridge, Massachusetts. [17] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP. pp. 355–362.

Related Work After selection, we obtain pseudo in-domain sub-corpus and in-domain one is available, mixture-modeling is to integrate different language models or translation models. Foster and Kuhn [18] investigate linear and log-linear interpolation for individual language models trained by different corpora. Linear interpolation for SMT has been used a lot [19]. Alternatively, the translation models can be added to the global log-linear SMT model as features, with weights optimized through minimum-error-rate training (MERT) [20]. As in-domain [18] George Foster and Roland Kuhn. 2007. Mixture-model adaptation for SMT. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, pages 128–135. Association for Computational Linguistics, Prague, Czech Republic. [19] Graeme Blackwood, Adrià de Gispert, Jamie Brunning, and William Byrne. 2008. European language translation with weighted finite state transducers: The CUED MT system for the 2008 ACL workshop on SMT. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 131–134. Association for Computational Linguistics, Columbus, Ohio. [20]Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180.

Experimental Setup Overall Running Time: The environment is HPC Cluster Pearl. Computing Node CPU Intel Xeon X5675, 24 cores, 180 GB. Data Selection: SMT: Method 2.5 million 5 million 7.5 million 10 million VSM (GPU) 8 hr 15 hr 29 hr 41 hr Perplexity 20 min 25 min 30 min 40 min String-Diff. 22 hr 40 hr 62 hr 70 hr Task 2.5 million 5 million 7.5 million 10 million Training 4 hr 13 hr 23 hr 32 hr Tuning 1 hr 2 hr 6 hr

Experimental Setup Corpus Processing: Propose better data processing steps [29] for domain adaptation task. For Chinese segmentation, we use in-house system [30]. For other languages, we use European tokenizer [31]. Linguistic information are extracted by Stanford CoreNLP toolkits [32]. Others such as case-processing (truecase), length-cleaning (1-80) ect., we use Moses scripts. Various domain adaptation approaches have been proposed, which can be divided into different kinds of category from different perspectives. In this thesis, we summarize the domain adaptation according to their methods (Sennrich, 2013; Chen et al., 2013). [29] Longyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco Oliveira. (2014) "Combining Domain Adaptation Approaches for Medical Text Translation". In Proceedings of the Ninth Workshop on Statistical Machine Translation. [30] Longyue WANG, Derek F. WONG, Lidia S. CHAO, Junwen Xing. (2012). "CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data." Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2012), pages 51–57. [31] Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit. Vol. 5. pp. 79–86. [32] Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60

Experimental Setup SMT: Moses decoder [33], a state-of-the-art open-source phrase-based SMT system. The translation and the re-ordering model relied on “grow-diag-final” symmetrized word-to-word alignments built using GIZA++ [34]. A 5-gram language model was trained using the IRSTLM toolkit [35], exploiting improved modified Kneser-Ney smoothing, and quantizing both probabilities and back-off weights. Various domain adaptation approaches have been proposed, which can be divided into different kinds of category from different perspectives. In this thesis, we summarize the domain adaptation according to their methods (Sennrich, 2013; Chen et al., 2013). [33] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180. [34] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics. 29:19–51. [35] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. Proceedings of Inter-speech. pp. 1618–1621.

Experimental Setup Data Selection: For Cosine tf-idf and Edit-distance, we develop them on GPU. For Perplexity-based methods, we perform SRILM toolkit [36] to conduct 5-gram LMs with interpolated modified Kneser-Ney discounting. We use end-to-end evaluation method: using BLEU [37] as an evaluation metric to reflect the domain-specific translation quality. Various domain adaptation approaches have been proposed, which can be divided into different kinds of category from different perspectives. In this thesis, we summarize the domain adaptation according to their methods (Sennrich, 2013; Chen et al., 2013). [36] Andreas Stolcke and others. 2002. SRILM-an extensible language modeling toolkit. Proceedings of the International Conference on Spoken Language Processing. pp. 901–904. [37] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. Proceedings of ACL. pp. 311–318.