An integrated platform for high-accuracy word alignment Dan Tufis, Alexandru Ceausu, Radu Ion, Dan Stefanescu RACAI – Research Institute for Artificial.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Text Categorization.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Natural Language Processing Expectation Maximization.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
FEISGILTT Dublin 2014 Yves Savourel ENLASO Corporation QuEst Integration in Okapi This presentation was made possible by This project is sponsored by the.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
Language Identification and Part-of-Speech Tagging
Natural Language Processing (NLP)
Statistical NLP: Lecture 9
N-Gram Model Formulas Word sequences Chain rule of probability
Improved Word Alignments Using the Web as a Corpus
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Natural Language Processing (NLP)
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

An integrated platform for high-accuracy word alignment Dan Tufis, Alexandru Ceausu, Radu Ion, Dan Stefanescu RACAI – Research Institute for Artificial Intelligence, Bucharest

Arona, Exploiting parallel corpora in up to 20 languages2 COWAL  The main task of COWAL is to combine the output of two or more comparable word-aligners  In order to achieve this task, COWAL is also an integrated platform with modules for: tokenization, POS- tagging, lemmatization, collocation detection, dependency annotation, chunking and word sense disambiguation.

Arona, Exploiting parallel corpora in up to 20 languages3 Word alignment algorithms (YAWA)  YAWA starts with all plausible links (those with ll-score higher than 11)  Then, using a competitive linking strategy, retains the links that maximizes sentence translation equivalence score, and minimizing the number of crossing links  In this way, it generates only 1-1 alignments. N-M alignments are possible only with chunking and/or dependency linking available.

Arona, Exploiting parallel corpora in up to 20 languages4 Word alignment algorithms (MEBA)  MEBA iterates several times over each pair of aligned sentences, at each iteration adding only the highest score links.  The links already established in previous iterations give support or create restrictions for the links to be added in a subsequent iteration.  MEBA uses different weights and different significance thresholds on each feature and iteration step.

Arona, Exploiting parallel corpora in up to 20 languages5 Features characterizing a link  A link is characterized by a set of features, the values of which are real numbers in the [0,1] interval.  context independent features – CIF, they refer only to the tokens of the current link  context dependent features – CDF, they refer to the properties of the current link with respect to the rest of links in a bi-text

Arona, Exploiting parallel corpora in up to 20 languages6 Context independent features  Translation equivalents (lemma and/or wordform )  Translation equivalents entropy (lemma)  Part-of-Speech affinity  Cognates

Arona, Exploiting parallel corpora in up to 20 languages7 Translation equivalents (TE)  YAWA, TREQ-AL use competitive linking based on ll- scores, plus the Ro-En aligned wordnets  MEBA uses GIZA++ generated candidates filtered with a log-likelihood threshold (11).  The TE candidates search space is limited by lemmatization and POS meta-classes (e.g. meta-class 1 includes only N, V, Aj and Adv; meta-class 8 includes only proper names)  For a pair of languages translation equivalents are computed in both directions. The value of the TE feature of a candidate link is 1/2 (PTR(TOKEN1, TOKEN2) + PTR(TOKEN2, TOKEN1).

Arona, Exploiting parallel corpora in up to 20 languages8 Entropy Score (ES)  The entropy of a word's translation equivalents distribution proved to be an important hint on identifying highly reliable links (anchoring links)  Skewed distributions favored against uniform ones  For a link, the link feature value is 0.5(ES(A)+ES(B))

Arona, Exploiting parallel corpora in up to 20 languages9 Part-of-speech affinity (PA)  An important clue in word alignment is the fact that the translated words tend to keep their part-of-speech and when they have different POSes, this is not arbitrary.  Tried to use GIZA++ (replacing tokens with their respective POSes) but there was too much noise!  The information was computed based on a gold standard (the revised NAACL2003), in both directions (source-target and target-source).  For a link PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|cat(A))

Arona, Exploiting parallel corpora in up to 20 languages10 Cognates (COG)  The cognates feature assigns a string similarity (using Levenstein distance) to the tokens of a candidate link  We estimated the probability of a pair of orthographically similar words, appearing in aligned sentences, to be cognates, with different string similarity thresholds. For the threshold 0.6 we didn’t find any exception. Therefore, the value of this feature is either 1 (if the similarity score is above the threshold or 0 otherwise).  Before computing the string similarity score, the words are normalized (duplicate letters are removed, diacritics are removed, some suffixes are discarded).

Arona, Exploiting parallel corpora in up to 20 languages11 Context dependent features  Locality  Links crossed  Relative position/Distortion  Collocation/Fertility  Coherence

Arona, Exploiting parallel corpora in up to 20 languages12 Collocation  Bi-gram lists (only content words) were built from each monolingual part of the training corpus, using the log-likelihood score (threshold of 10) and minimal occurrence frequency (3) for candidates filtering. Collocation probabilities are estimated for each surviving bi-gram.  If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is the maximum of the collocation probabilities of the link’s tokens. Competing links (starting or finishing in the same token) are licensed only and only if at least one of them have a non-null collocation score

Arona, Exploiting parallel corpora in up to 20 languages13 Distorsion/Relative position  Each token in both sides of a bi-text is characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes, gives the link’s “obliqueness”  The distorsion feature of a link is its obliqueness D(link)=OBL(SWi, TWj)

Arona, Exploiting parallel corpora in up to 20 languages14 Localization  This feature is relevant with or without chunking or dependency parsing modules. It accounts for the degree of the cohesion of links.  With the chunking module is available, and the chunks are aligned via the linking of their respective heads, the links starting in one chunk should finish in the aligned chunk.  When chunking information is not available, the link localization is judged against a window, the span of which is dependent on the aligned sentences length.  Maximum localization (1) is the one with all the tokens in the source window are linked to all tokens in the target window

Arona, Exploiting parallel corpora in up to 20 languages15 Crossed links  The crossed links feature computes (for a window size depending on the categories of the candidates and the sentences lengths) the links that were crossed.  The normalization factor (maximum number of crossable links) is empirically set, based on categories of the link’s tokens

Arona, Exploiting parallel corpora in up to 20 languages16 EVALUATION:Official ranking U.RACAI.Combined L.ISI.Run5.vocab. grow

Arona, Exploiting parallel corpora in up to 20 languages17

Arona, Exploiting parallel corpora in up to 20 languages18

Arona, Exploiting parallel corpora in up to 20 languages19

Arona, Exploiting parallel corpora in up to 20 languages20 Word alignment combiners  The COWAL(ACL2005) combiner is fine- tuned for the concerned language pair (rule-based)  The SMV filter is a language independent combiner (trainable on positive and negative examples)  Trade-off between human introspection and performance

Arona, Exploiting parallel corpora in up to 20 languages21 SVM filter  Combining word alignments requires the ability to distinguish among correct links and incorrect links of the two ore more merged alignments. SVM technology is specifically adequate for this task:  The SVM combiner is a classifier trained on both positive and negative examples.

Arona, Exploiting parallel corpora in up to 20 languages22 SVM filter evaluation MEBACOWAL MEBA filtered YAWA & MEBA filtered Precision Recall F-measure SVM filtering results. The SVM model was trained on NAACL 2003 gold standard.

Arona, Exploiting parallel corpora in up to 20 languages23 Romanian Acquis  The available Romanian documents were downloaded from CCVISTA (over Microsoft word documents)  We kept only files (some of them were different versions of the same document)  The remaining documents were converted into the same XML format of the ACQUIS corpus  From the Romanian files only 6256 are available for English in the JRC distribution

Arona, Exploiting parallel corpora in up to 20 languages24 Romanian Acquis Tokenization Sentence splitting POS-tagging Lemmatization Chunking Sentence aligning

Arona, Exploiting parallel corpora in up to 20 languages25