Cache-based Document-level Statistical Machine Translation Prepared for I 2 R Reading Group Gongzhengxian 10 OCT 2011.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Iterative Bilingual Lexicon Extraction from Comparable Corpora Using Topic Model and Context Based Methods Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extraction with Paraphrases Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Machine translation Context-based approach Lucia Otoyo.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Module 4: Systems Development Chapter 13: Investigation and Analysis.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Natural Language Processing
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi Graduate.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
FEISGILTT Dublin 2014 Yves Savourel ENLASO Corporation QuEst Integration in Okapi This presentation was made possible by This project is sponsored by the.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008 Improving SMT Preformance by Training Data Selection and Optimization.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Citation Provenance FYP/Research Update WING Meeting 28 Sept 2012 Heng Low Wee 1/5/
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Statistical Machine Translation Part II: Word Alignments and EM
Joint Training for Pivot-based Neural Machine Translation
Dynamical Statistical Shape Priors for Level Set Based Tracking
--Mengxue Zhang, Qingyang Li
Yuri Pettinicchi Jeny Tony Philip
Statistical Machine Translation Papers from COLING 2004
Topic: Semantic Text Mining
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Cache-based Document-level Statistical Machine Translation Prepared for I 2 R Reading Group Gongzhengxian 10 OCT 2011

Outline Introduction Cache-based document-level SMT Experiments Discussion

Introduction Bond (2002) suggested nine ways to improve machine translation by imitating the best practices of human translators (Nida, 1964), with parsing the entire document before translation as the first priority. However, most SMT systems still treat parallel corpora as a list of independent sentence-pairs and ignore document-level information.

Introduction(2) Document-level translation has drawn little attention from the SMT research community. The reasons are manifold. –First of all, most of parallel corpora lack the annotation of document boundaries (Tam, 2007). –Secondly, although it is easy to incorporate a new feature into the classical log-linear model (Och, 2003), it is difficult to capture document-level information and model it via some simple features. –Thirdly, reference translations of a test document written by human translators tend to have flexible expressions in order to avoid producing monotonous texts. This makes the evaluation of document-level SMT systems extremely difficult.

Introduction(3) Tiedemann (2010) showed that the repetition and consistency are very important when modeling natural language and translation. He proposed to employ cache- based language and translation models in a phrase- based SMT system for domain adaptation.

Introduction(4) This paper proposes a cache-based approach for document-level SMT –The dynamic cache is similar to Tiedemann’s; –The static cache is employed to store relevant bilingual phrase pairs extracted from similar bilingual document pairs; –However, such a cache-based approach may introduce many noisy/unnecessary bilingual phrase pairs in both the static and dynamic caches. In order to resolve this problem, this paper employs a topic model to weaken those noisy/unnecessary bilingual phrase pairs by recommending the decoder

The Workflow of Cache-based SMT Given a test document, our system works as follows: clears the static, topic and dynamic caches when switching to a new test document d x ; retrieves a set of most similar bilingual document pairs dd s for d x from the training parallel corpus using the cosine similarity with tf-idf weighting; fills the static cache with bilingual phrase pairs extracted from dd s ; fills the topic cache with topic words extracted from the target-side documents of dd s ; for each sentence in the test document, translates it using cache- based SMT and continuously expands the dynamic cache with bilingual phrase pairs obtained from the best translation hypothesis of the previous sentences.

Dynamic cache Our dynamic cache is mostly inspired by Tiedemann (2010), which adopts a dynamic cache to store relevant bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document. (5) where

Static Cache all the document pairs in the training parallel corpus are aligned at the phrase level using 2-fold cross-validation. Given a test document, we first find a set of similar source documents by computing the Cosine similarity using the TF-IDF weighting scheme and their corresponding target documents 出口 ||| exports 放慢 ||| slowdown 股市 ||| stock market 现行 ||| leading 汇率 ||| exchange 活力 ||| vitality 加快 ||| speed up the 经济学家 ||| economists 出口 增幅 ||| export growth 多种 原因 ||| various reasons 国家 著名 ||| a well-known international 议会 委员会 ||| congressional committee 不 乐观 的 预期 ||| pessimistic predictions 保持 一定 的 增长 ||| maintain a certain growth 美元 汇率 下跌 ||| a drop in the dollar exchange rate Table 1: Phrase pairs extracted from a document pair with an economic topic

Topic Cache Given is a topic word in the topic cache, the topic cache feature is designed as follows: where use a LDA tool to build a topic model using the target-side documents in the training parallel corpus. Some distributions :P(w|z), P(z|d) etc. the topic of the target document is determined by its major topic, with the maximum p(z|d).

Topic Cache(2)

Experiments