Michael Mohler, Rada Mihalcea Department of Computer Science University of North Texas BABYLON Parallel Text Builder:

Slides:

Advertisements

Similar presentations

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Advertisements

Chapter 5: Introduction to Information Retrieval

Large-Scale Entity-Based Online Social Network Profile Linkage.

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

Identifying Translations Philip Resnik, Noah Smith University of Maryland.

Aki Hecht Seminar in Databases (236826) January 2009

Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Machine translation Context-based approach Lucia Otoyo.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Multilingual Search Shibamouli Lahiri

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

General Architecture of Retrieval Systems 1Adrienn Skrop.

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

CS 430: Information Discovery

Improved Word Alignments Using the Web as a Corpus

Presentation transcript:

Michael Mohler, Rada Mihalcea Department of Computer Science University of North Texas BABYLON Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

Three Categories of Languages High-density  used globally (especially on the Web)‏  well integrated with technology  e.g., English, Spanish, Chinese, Arabic Medium-density  fewer resources globally  dominant language in certain regions or fields Low-density  majority of all languages  regional media (e.g., radio, newspapers) often in higher-density languages

The Web as a Parallel Text Repository PROS Data is free, plentiful, and omni- lingual NLP tools have achieved good results with little supervision Many websites are multilingual with translated content CONS Data on the Web is not formatted consistently Some languages are poorly represented The quality of translations is questionable

The Questions Can existing techniques to build parallel texts using the Web be successfully applied in a low-density language context? To what extent do parallel texts discovered from the Web enhance the quality (or coverage) of existing parallel texts?

Goals of the Babylon Project Apply existing parallel text gathering techniques to low-density languages paired with higher-density pages Remain as language- and resource-independent as possible Discover pages that contain “on-page” translations Existing systems would typically miss these translations Analyze the usability of Web-gathered parallel texts in a machine translation environment Note: The language pair used in our experiments is Quechua-Spanish

Babylon System Overview Stage 1: Discover seed URLs for Web crawl Stage 2: Find pages with minor-language content through a Web crawl Stage 3: Categorize pages Stage 4: Find major-language pages near minor-language pages Stage 5: Filter out non-parallel texts Stage 6: Align remaining texts Stage 7: Evaluate the texts in a machine translation environment

System Flow

Stage 1: Where to start? Find data in the minor language somewhere on the Web Starting from a monolingual text, up to 1,000 words are selected automatically  Try to find a balance between frequently occurring words and less common words Use these words to query Google using the SOAP API Use the pages returned by these queries as starting points

Stage 2: Find Minor-Language Pages Perform a modified BFS (Somboonviwat et al. 2006) starting from the seed pages from Stage 1  Outlinks from a page in the target language are preferred  The search is limited to the first one million pages downloaded  Pages are analysed if they were in any of the following formats: html, pdf, txt, doc, rtf Perform language identification using the text_cat tool

Stage 3: Categorization Categorize all the minor-language pages into one of two categories: “weak” or “strong”  “Weak” pages: primarily written with major-language content and suggest an “on-page” translation  “Strong” pages: primarily written in the minor language

Stage 4: Find Major-Language Pages There are two categories of major-language pages that are considered:  First: Pages that contain a translation “on-page” The major-language translation has already been stored These pages will not be revisited until stage 6.  Second: Pages that are near the “strong” minor-language page Webmasters design sites so that one translation is easily accessible from another. Download all the pages within two hyperlinks (undirected) from each “strong” minor-language page and keep all major- language pages for comparison

Stage 5: Find Possible Translations Determine if the minor and major language pairs are translations of one another:  URL matching: Webmasters frequently follow naming conventions with translation pages (e.g. index_es.html & index_qu.html)‏  Structure matching: The HTML tags for translation pages are often similar; only the content changes.  Content matching (without dictionary): Uses vectorial model to find overlap among proper nouns, numbers, some punctuation, etc.  Content matching (with dictionary): Same as above but with dictionary entries as well. Any pair that fails all four tests is discarded

Stage 5: URL Matching Previous work used a list of string pairs that webmasters use to indicate the language of a page  “spanish” vs “english”, “_en” vs “_de”, etc.  requires specific knowledge about how webmasters describe languages (e.g. “big5” for Chinese)‏ Circumvent the need for a general-purpose list by using an edit distance based approach Two URL strings match if the number of additions, substitutions, and deletions required to change one string into another is below a threshold

Stage 5: Structure Matching Following STRAND (Resnik 2003), convert each page to a tag-chunk representation for comparison Find the edit distance between each pair assuming that text chunks with similar length are equivalent If the edit distance is below a threshold, the pair is considered a match

Stage 5: Content Matching Following the PTI System (Chen, Chau, and Yeh 2004), generate the term frequency (tf) vector  If a dictionary is used, each word in language B is mapped to its corresponding language A word  Additionally, all language B words are mapped to themselves to account for numbers, proper nouns, punctuation, etc. The process is repeated after performing light stemming  reduce each word in the text and in the dictionary to its first four letters. (“apple” -> “manzilla” becomes “appl” -> “manz”)‏ Jaccard coefficients are found for the vectors for both mappings  scores are recombined by weighting the non-stemmed score at 75% of the final score

Stage 6: Alignment The final phase uses the alignment tool champollion  attempts to align the paragraphs of two files considering sentence length, numbers, cognates, and (optionally) dictionary entries. From this output, a final alignment score is computed: (one_to_one * one_to_many)/num_paragraphs The score favours alignments with many one-to-one matchings and disfavours alignments with many dropped paragraphs. For each minor-language text, the major-language text that has the highest alignment score above a given threshold is kept as its match.

Stage 7: Machine Translation Evaluation - Experiment Setup Use the Moses machine translation toolkit with the crawled parallel texts, alone and in conjunction with other parallel texts, to translate a set of texts Training data  Crawled parallel texts AND/OR  Machine-readable verse-aligned Bibles in both languages  Four Bible translations available in Spanish and one in Quechua

Stage 7 (cont)‏ Test data (removed from training)‏  Three complete books (Exodus, Proverbs, and Hebrews)‏  A subset of the crawled parallel text To determine the effect of domain transfer on translation needs Translation models  Six translation models are created  A cross product parallel text composed of all Spanish Bibles (4) matched against all Quechua Bibles (1) is also used  For each quantity of Biblical data (“none”, “Bible”, and “4 Bibles”), two translation models are created by including the crawled texts or not

Evaluation Translation models are evaluated using BLEU  measures the N-gram overlap between the translated text and a reference gold-standard translation  Each translation model is tested against both evaluation sets: “Bible” and “Crawled” Note: an expert-quality translation receives a BLEU score of around 30

Results Spanish to Quechua Quechua to Spanish

Conclusions The crawled texts do not contaminate the translation models  Little improvement for the Bible test set  Do not seem to degrade the translation quality Crawled texts are necessary for improving coverage  The Bible training set alone is insufficient for translating the crawled test set The crawled training set evaluated against the crawled test set outperforms all other training-test combinations

References Jiang Chen and Jian-Yun Nie, “Parallel Web Text Mining for Cross-Language IR,” Proceedings of RIAO-2000: Content-Based Multimedia Information Access, Jisong Chen, Rowena Chau, and Chung-Hsing Yeh, “Discovering Parallel Text from the World Wide Web,” ACSW Frontiers ‘04: Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalization, Xiaoyi Ma and Mark Y. Liberman, “BITS: A Method for Bilingual Text Search over the Web”, Philip Resnik, “Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text,” AMTA ‘98: Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and Information Soup, Philip Resnik and Noah A. Smith, “The Web as a Parallel Corpus,” Computational Linguistics 29 (2003). Kulwadee Sombooonviwat, Takayuki Tamura, and Masaru Kitsuregawa, “Finding Thai Web Pages in Foreign Web Spaces”, ICDEW ‘06: Proceedings of the 22 nd International Conference on Data Engineering Workshops, J. Tom ás, E. Sánchez-Villamil, L. Lloret, and F. Casacuberta, “WebMining: An Unsupervised Parallel Corpora Web Retrieval System,” Proceedings from the Coprus Linguistics Conference, 2005.