The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004. University of Maryland, Johns.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,
Information Retrieval in Practice
Search Engines and Information Retrieval
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Search Engines and Information Retrieval Chapter 1.
The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
KNN & Naïve Bayes Hongning Wang
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Information Retrieval in Practice
Search Engine Architecture
Sentiment analysis algorithms and applications: A survey
Text Based Information Retrieval
Rapidly Retargetable Translingual Detection
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 – 380, MIT Press,2004. University of Maryland, Johns Hopkins University

Abstract STRAND system for mining parallel text on the World Wide Web –reviewing the original algorithm and results –presenting a set of significant enhancements use of supervised learning based on structural features of documents to improve classification performance new content-based measure of translational equivalence adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale construction of a significant parallel corpus for a low-density language pair

Introduction Parallel corpora, bitexts; –for automatic lexical acquisition (Gale and Church 1991; Melamed 1997) –provide indispensable training data for statistical translation models (Brown et al. 1990; Melamed 2000; Och and Ney 2002) –provide the connection between vocabularies in cross-language information retrieval (Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997)

Recent works at UM&JHU exploit parallel corpora in order to develop monolingual resources and tools, using a process of annotation, projection, and training given a parallel corpus in English and a less resource- rich language project English annotations across the parallel corpus to the second language using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations (Cabezas, Dorr, and Resnik 2001; Diab and Resnik 2002;Hwa et al. 2002; Lopez et al. 2002; Yarowsky, Ngai, and Wicentowski 2001; Yarowsky and Ngai 2001; Riloff, Schafer, and Yarowsky 2002).

parallel corpora as a critical resource not readily available in the necessary quantities heavily on French-English –because the Canadian parliamentary proceedings (Hansards) in English and French were the only large bitext available –United Nations proceedings (LDC) –religious texts (Resnik, Olsen, and Diab 1999) –software manuals (Resnik and Melamed 1997; Menezes and Richardson 2001) tend to be unbalanced, representing primarily governmental or newswire-style texts

World Wide Web People tend to see the Web as a reflection of their own way of viewing the world –a huge semantic network –an enormous historical archive –a grand social experiment a great big body of text waiting to be mined a huge fabric of linguistic data often interwoven with parallel threads

STRAND (Resnik 1998, 1999) –structural translation recognition acquiring natural data –identify pairs of Web pages that are mutual translations Incorporating new work on content-based detection of translations (Smith 2001, 2002) efficient exploitation of the Internet Archive

Finding parallel text on the Web Location of pages that might have parallel translations Generation of candidate pairs that might be translations Structural filtering out of nontranslation candidate pairs

Locating Pages AltaVista search engine’s advanced search to search for two types of Web pages: parents and siblings. –Ask AV with regular expressions (anchor:"english" OR anchor:"anglais") (anchor:"french" OR anchor:"fran¸cais"). spider component –The results reported here do not make use of the spider.

Generating Candidate Pairs with URL – en.html, on which one combination of substitutions might produce the URL ch.html. Another possible criterion for matching is the use of document lengths. –text E in language 1 and text F in language 2, length(E) = C * length(F), where C is a constant tuned for the language pair

Structural Filtering linearize the HTML structure and ignore the actual linguistic content of the documents align the linearized sequences using a standard dynamic programming technique (Hunt and McIlroy 1975)

STRAND Results (1) Recall in this setting is measured relative to the set of candidate pairs that was generated Precision –“Was this pair of pages intended to provide the same content in the two different languages?” –Asking the question in this way leads to high rates of interjudge agreement, as measured using Cohen’s µ measure

STRAND Results (2) Using the manually set thresholds for dp and n,we have obtained 100% precision and 68.6% recall in an experiment using STRAND to find English-French Web pages (Resnik 1999) to obtain English-Chinese pairs and in a similar formal evaluation, we found that the resulting set had 98% precision and 61% recall

Assessing the STRAND Data a translation lexicon automatically extracted from the French-English STRAND data could be combined productively with a bilingual French-English dictionary in order to improve retrieval results using a standard cross- language IR test collection (English queries against the CLEF-2000 French collection) –backing off from the dictionary to the STRAND translation lexicon accounted for over 8% of the lexicon matches (by token) –reducing the number of untranslatable terms by a third and producing a statistically significant 12% relative improvement in mean average precision as compared to using the dictionary alone

30 human-translated sentence pairs from the FBIS (Release 1) English-Chinese parallel corpus, sampled at random. 30 Chinese sentences from the FBIS corpus, sampled at random, paired with their English machine translation output from AltaVista’s Babelfish paired items from Chinese-English Web data, sampled at random from “sentence-like” aligned chunks as identified using the HTML-based chunk alignment process of STRAND

Optimizing Parameters Using Machine Learning Using the English-French data, we constructed a ninefold cross-validation experiment using decision tree induction to predict the class assigned by the human judges –The decision tree software was the widely used C5.0

Content-Based Matching –Ma and Liberman (1999) point out, not all translators create translated pages that look like the original page –structure-based matching is applicable only in corpora that include markup –other applications for translation detection exist a generic score of translational similarity that is based upon any word-to-word translation lexicon –Define a CL sim score, tsim

Quantifying Translational Similarity (Melamed’s [2000] Method A) –link be a pair (x, y) –maximum-weighted bipartitie matching problem O(max(|X|,|Y|) 3 ) 原來的方法只用 Doc 的 structure 現在要用字字間的 sim – 所以要有 lexicon

Exploiting the Internet Archive The Internet Archive 120TB (+8TB per month) is a nonprofit organization attempting to archive the entire publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public. Wayback Machine –Temporal database, but not stored in temporal order –Need decompression –Almost data are stored in compressed plain-text files

STRAND on the Archive Extracting URLs from index files using simple pattern matching Combining the results from step 1 into a single huge list Grouping URLs into buckets by handles Generating candidate pairs from buckets

URLs similarity We arrived at an algorithmically simple solution that avoids this problem but is still based on the idea of language-specific substrings (LSSs). –The idea is to identify a set of language- specific URL substrings that pertain to the two languages of interest (e.g., based on language names, countries, character codeset labels, abbreviations, etc.)

Building an English-Arabic Corpus Finding English-Arabic Candidate Pairs on the Internet Archive –24 top-level national domains Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw) … –21.com emiratesbank.com, checkpoint.com …

Future Work Weight in the dictionary can be exploited –IR IDF scores for discerning Bootstrapping –Seed to form high-precision initial classifiers

Search Engine Estimate According to (Dec. 31, 2002) –Google3,033 millions –Altavista1,689 millions1:1.795 estimates are based on exact counts obtained from AlltheWeb multiplied by the percentage of Relative Size Showdown as compared to the number found by AlltheWebRelative Size Showdown Our estimate for Chinese page count –ASBC 去掉 標點符號 ABC ㄅㄆㄇ 未辨識 等符號 共使用 5915 個 term[ 單字詞 ] ( 共出現 7,844,439 次 ) 用這 5919 個 term 的中間 2000 名跟 google 和 altavista 的 relative count 5021 倍 2181 倍 google 是 altavista 的 2.3 倍