Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington

An early parallel text

Uses for Parallel Corpora Parallel corpora are valuable resources for natural language processing (NLP) – Machine translation – Cross-lingual information retrieval (IR) E.g. PanImages from the University of Washington Cross-lingual image search system http://www.panimages.org/ – Computer Aided Human Translation – Monolingual NLP via information projection – …

What is a Parallel Text or Parallel Corpus? Translated text/documents in two languages Ideally sentence-aligned (e.g. using method from Gale & Church 1993)

Examples for Parallel Corpora EUROPARL - European parliament proceedings – 10 language pairs – About 44 million words/language Canadian parliament proceedings (Hansard) – English – French Software documentation in multiple languages …

Motivation Problem: Parallel corpora exist only for a limited set of language pairs Problem: Available parallel corpora are often very domain-specific Problem: Available parallel corpora are often small Task: Finding parallel texts on the Web

Example & Walk-Through Previous work: Ma and Liberman (1999) Chen and Nie (2000) Resnik and Smith (2003) 7

Main Steps in Identifying Parallel Text on the Web (Resnik and Smith, 2003) 1.Locating pages that might have parallel translations 2.Generating candidate page pairs that might be translations 3.Filtering out of non-translation candidate pairs 8

Our approach 1.Locating pages that might have parallel translations:  Sampling by sending queries 2. Generating candidate page pairs that might be translations:  Comparing URLs with different matching methods 3.Filtering out of non-translation candidate pairs:  Combining structural and content-based filtering

System Overview

Outline System description (1a) Sampling the source language L1 (1b) Checking pages in the target language L2 (2) Matching pages in L1 and L2 (3) Filtering page pairs Experiments Conclusion and future work

(1a) One-term Sample Sample – Search engine query of one term – Limited to pages in source language – Optional parameter: inurl: – Submitted to search engine API – Search engine does automatic stemming – 100 pages in result set

(1a) Choosing Terms Dictionary – Built using Giza++ word alignment tool – Trained on years 2001-2003 of the Europarl corpus – Contains IBM Model 1 translation probabilities Sampling term – Selected from source language vocabulary – Mid-frequency term Selected at random using a normal distribution Goal: Avoid domain-specificity

(1a) Source Language Expansion From one-term to n-term queries Common IR query expansion technique Based on page summaries returned by the one-term sampling query Summary terms ranked by frequency Leads to semantically related terms because of relevancy ranking of search engine results “shannon” → “information claude” “inconveniences” → “security travelers” Original term expanded with one or more expansion terms re-submitted to search engine

(1b) Checking Query Sampling query terms translated using the Giza++ dictionary “inconvenience security travelers” → “unannehmlichkeit sicherheit” m-best translations of n sampling terms lead to m n checking queries Optional parameter: inurl:

(1b) Target language expansion Alternative to translating a complete n-term sampling query Only translate original one-term sample Expand on target language side equivalently as on source language side m checking queries instead of m n Efficiency vs. source language expansion evaluated in experiments

(1b) “site:” Parameter Optional: site parameter – Allows sites retrieved in checking query to be restricted to sites returned in sampling query – Search engine limits to sites of first 30 sampling query page results

(2) Matching URLs with Fixed Language List URLs from corresponding sampling and checking result sets Considered a match if they only differ in a in a fixed list of language IDs ende en-usde-de enge enudeu enuger englishgerman englischdeutsch

(2) Matching URLs with Levenshtein Distance Levenshtein distance – Also known as “edit distance” – URLs from corresponding sampling and checking result sets – Considered a match when URLs have a Levenshtein distance less or equal than 4, but larger than 0 http://ec.europa.eu/education/policies/rec_qual/ recognition/diploma_en.html http://ec.europa.eu/education/policies/rec_qual/ recognition/diploma_de.html

(2) URL part substitution Sampling L1  source URLs Replacing L1 names/ids in each source URL with L2 names/ids  target URLs Checking whether the target URLs exist Does not require checking queries!

(3) Filtering page pairs Structural filtering (Resnik and Smith, 2003) Content translation metric (Ma and Liberman, 1999) Linear combination

(3) Linearization Linearized File [START:HTML] [START:HEAD] [START:META] [Chunk: 12] [END:META] [START:TITLE] [Chunk:25] [END:TITLE] [END:HTML] … ….. HTML file

(3) Alignment Linearized Source File [START:HTML] [START:HEAD] [START:META] [END:META] [START:TITLE] [Chunk:58] [END:TITLE] [START:META] … Linearized Target File [START:HTML] [START:HEAD] [START:META] [END:META] [START:LINK] [END:LINK] [START:TITLE] [Chunk:68] [END:TITLE] [START:META] …

(3) Structural Metrics Difference percentage (dp) – Measures how different markup in linearized files is – Based on longest common subsequence algorithm (Hunt & McIlroy 1976) – Implemented using diff tool Length correlation of aligned non-markup chunks (r) – Pearson correlation coefficient over all aligned chunks in a file pair – Length of content in characters

(3) Content Translation Metric Calculated on first 500 content words on page Using the Giza++ translation dictionary

(3) Combining Two Kinds of Metrics Structural metrics: dp and r Content-based metric: c Linear combination:

Different settings for experiments (1a) Sampling the source language L1 – Source expansion – The “inurl:” parameter (1b) Checking pages in the target language L2 – Target expansion – The “inurl:” and “site:” parameter (2)Matching pages in L1 and L2 – Using a fixed list – Edit distance – URL part substitution 27

Experiment Results – Matches

Observations – Sampling and Checking Query expansion increases the number of page pairs – Source and target query expansion lead to similar results – Difference between n=2 and n=3 is not significant Possible explanation: Larger semantic divergence of queries on the source and target language sides Using site: and inurl: search parameters increases the number of discovered page pairs – But: structural parameters might miss candidate pairs that don’t follow pattern

Observations – Matching Number of page pair candidates – URL part substitution >> Levensthein distance – Levenshtein distance > Fixed language list Matching methods that use checking queries are heavily impacted by relevancy rankings Levenshtein distance matching method – Allows learning of URL patterns used for parallel pages

Experiment Results – Filtering

Observations – Filtering Combined filter – Evaluated in comparison to human judge – Precision “How many did we get right?” 88.9% Encouraging on noisy test set – Recall “How many did we miss?” 36.4% Low recall can be compensated for by submitting more queries

Conclusions It is possible reliably gather parallel pages using commercial search engines – Even though there are no standard features identifying these pages – Despite the relevance ranking of commercial search engine results

Future work To improve the precision and recall of the filtering step. To address the relevancy ranking and the page limit problem To study whether some queries are more productive than others To test the usefulness of the collected page pairs on applications such as MT. 34

Additional Slides

Experiments & Results Experiment ID Expansion type Query length (n) inurl: Param site: Param Number of page pairs (before filtering) Number of page pairs (after filtering) ListLevenshteinSubstitutionListLevenshteinSubstitution 1none15 1311081197 2Source28 28188934157 3Source310 421975110124 4none1en/de58 8450831722285 5Source2en/de72 13292792731433 6Source3en/de100 16092002531347 7none16 1810991392 8Target24 24177123143 9Target34 12176100149 10none1en/de56 9350412434281 11Target2en/de107 16191312733426 12Target3en/de45 7283951215335 13none13010 258 n/a 69 14Source23022 743 n/a 932 n/a 15Source33046 1074 n/a 1241 n/a 16none1en/de3059 164 n/a 1315 n/a 17Source2en/de30118 442 n/a 2850 n/a 18Source3en/de30171 693 n/a 4649 n/a

Languages on the Web Source: http://www.glreach.com/globstats/index.php3

Questions Asked 1.How do we find parallel pages in the sea of mostly monolingual pages? 2.What is the share of parallel pages for a given language pair?

Estimating the Percentage of Parallel Pages for a Language Pair E E ∩ F F P(D E |E)=0.03% P(E D |D)=0.27%

References IJCNLP 2008 paper and presentation – http://search.iiit.ac.in/CLIA2008/accepted_papers.php http://search.iiit.ac.in/CLIA2008/accepted_papers.php Email – mailto:achim@digitalsilkroad.net mailto:achim@digitalsilkroad.net MSR internship information – http://research.microsoft.com/aboutmsr/jobs/int ernships/default.aspx http://research.microsoft.com/aboutmsr/jobs/int ernships/default.aspx

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Similar presentations

Presentation on theme: "Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Similar presentations

Presentation on theme: "Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback