“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Slides:



Advertisements
Similar presentations
Números.
Advertisements

University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Reinforcement Learning
Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
Addition and Subtraction Equations
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
1 When you see… Find the zeros You think…. 2 To find the zeros...
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Sampling in Marketing Research
Break Time Remaining 10:00.
The basics for simulations
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Briana B. Morrison Adapted from William Collins
MCQ Chapter 07.
2013 Fox Park Adopt-A-Hydrant Fund Raising & Beautification Campaign Now is your chance to take part in an effort to beautify our neighborhood by painting.
Regression with Panel Data
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Biology 2 Plant Kingdom Identification Test Review.
CSE 6007 Mobile Ad Hoc Wireless Networks
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
Artificial Intelligence
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Subtraction: Adding UP
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Select a time to count down from the clock above
WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.
9. Two Functions of Two Random Variables
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Presentation transcript:

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th, 2013 Department of Computer Science University of Maryland at College Park

Motivation Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages Goal: Find ways to efficiently and effectively  Search complex, noisy data  Deliver content in appropriate form 3 multi-lingual text user’s native language forum posts clustered summaries

retriev ir find materi (usual document unstructur natur (usual text satisfi need larg collect (usual store comput work assum materi collect document written natur languag need form queri rang word entir document typic approach ir repres document vector weight term term mean word stem pre-determin list word.g. `` '' `` '' `` '' may remov set term found creat nois search process document score relat queri score queri term independ aggreg term-docu score Information Retrieval 4 Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores. queri:11.69, ir:11.39, vector:7.93, document:7.0 9, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58

Cross-Language Information Retrieval 5 Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören). 89,933 2, , ,134 92,541 4, ,671 78, ,580 19,318 5, , ,822 23,890 95, ,349 9,

Machine Translation 6 Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in entsprechenden Text in der Zielsprache geschrieben übersetzen. Machine translation (MT) is to translate text written in a source language into corresponding text in a target language.

Motivation Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages Goal: Find ways to efficiently and effectively  Search complex, noisy data  Deliver content in appropriate form 7 multi-lingual text user’s native language MT Cross-language IR

Outline Introduction Searching to Translate (IR  MT)  Cross-Lingual Pairwise Document Similarity  Extracting Parallel Text From Comparable Corpora Translating to Search (MT  IR)  Context-Sensitive Query Translation Conclusions 8 (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13)

Extracting Parallel Text from the Web 9 Preprocess Signature Generation Sliding Window Algorith m Candidate Generation 2-step Parallel Text Classifier doc vectors F signatures F doc vectors E signatures E source collection F target collection E Phase 1 Phase 2 candidate sentence pairs aligned bilingual sentence pairs (F-E parallel text) cross-lingual document pairs Preprocess Signature Generation

Pairwise Similarity Pairwise similarity: finding similar pairs of documents in a large collection Challenges quadratic search space measuring similarity effectively and efficiently Focus on recall and scalability 10

N e English articles N e English document vectors N e Signatures Signature generation Sliding window algorithm [ ] Preprocess Locality-Sensitive Hashing Similar article pairs

LSH(vector) = signature  faster similarity computation s.t. similarity(vector pair) ≈ similarity(signature pair) e.g.,  ~20 times faster than computing (cosine) similarity from vectors  similarity error ≈ 0.03 Sliding window algorithm  approximate similarity search based on LSH  linear run-time 12 Locality-Sensitive Hashing (Ravichandran et al., 2005)

Sliding window algorithm sort permute Generating tables Signature s …. 1, , , … p1p1 pQpQ list 1 … , , ,3 … list Q … , , ,3 … table 1 table Q … , , ,3 … … , , ,3 … MapReduce......

table Q Map Sliding window algorithm 14 Detecting similar pairs table 1 … , , ,3 …

Sliding window algorithm Example Signatures (1, ) (2, ) (3, ) table 1 p1p1 p2p2 list 1 list 2 table 2 (,1) (,2) (,3) (,1) (,2) (,3) (,2) (,1) (,2) (,3) (,1) MapReduce Distance(3,2) = 7 Distance(2,1) = 5 Distance(2,3) = 7 Distance(3,1) = 6 ✗ ✓ ✗ ✓ # tables = 2 window size = 2 # bits = 11

16 MT Doc A MT translate doc vector v A German English Doc B English doc vector v B Doc A CLIR translate doc vector v A German Doc B English doc vector v B doc vector v A CLIR Cross-lingual Pairwise Similarity

17 MT vs. CLIR for Pairwise Similarity low similarity values positive-negative clearly separated MT slightly better than CLIR, but 600 times slower! clir-neg clir-pos mt-neg mt-pos

N e English articles N e English document vectors N e Signatures Signature generation Sliding window algorithm [ ] Preprocess Similar article pairs Locality-Sensitive Hashing for Pairwise Similarity

Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity CLIR Translate N f German articles N e English articles N e +N f English document vectors N e English document vectors Similar article pairs N e Signatures Signature generation Sliding window algorithm [ ] Preprocess

Evaluation Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia Collection: 3.44m En m De Wikipedia articles Task: For each German Wikipedia article, find: {all English articles s.t. cosine similarity > 0.30} 20 # bits (D) = 1000 # tables (Q) = window size (B) =

Scalability 21

two sources of error Signatures Brute-force approach Simila r article pairs upperboun d document vectors Brute-force approach Simila r article pairs ground truth Signatures Signature generation Sliding window algorithm document vectors Simila r article pairs algorith m output Evaluation 22

Evaluation 23 95% recall 39% cost 99% recall 70% cost 95% recall 40% cost 99% recall 62% cost 100% recall no savings = no free lunch!

Outline Introduction Searching to Translate (IR  MT)  Cross-Lingual Pairwise Document Similarity  Extracting Parallel Text From Comparable Corpora Translating to Search (MT  IR)  Context-Sensitive Query Translation Conclusions 24 (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13)

Approach 1.Generate candidate sentence pairs from each document pair 2.Classify each candidate as ‘parallel’ or ‘not parallel’ Challenge: 10s millions doc pairs ≈ 100s billions sentence pairs Solution: 2-step classification approach 1.a simple classifier efficiently filters out irrelevant pairs 2.a complex classifier effectively classifies remaining pairs Phase 2: Extracting Parallel Text 25

cosine similarity of the two sentences sentence length ratio: the ratio of lengths of the two sentences word translation ratio: ratio of words in source (target) sentence with a translation in target (source) sentence Parallel Text (Bitext) Classifier 26

sentenc e detectio n+tf-idf cross-lingual document pairs sentence pairs simple classificatio n complex classificatio n bitext S 1 bitext S 2 source document target document sentences and sent. vectors cartesian product X MAP REDUCE candidate generation 2.4 hours shuffle&sort 1.3 hours simple classification 4.1 hours Bitext Extraction Algorithm 27 complex classification 0.5 hours 400 billion 214 billion 132 billion

Extracting Bitext from Wikipedia SizeLanguage EnglishGermanSpanishChineseArabicCzechTurkish Documents4.0m1.42m0.99m0.59m0.25m0.26m0.23m Similar doc pairs -35.9m51.5m14.8m5.4m9.1m17.1m Sentences~90m42.3m19.9m5.5m2.6m5.1m3.5m Candidate sentence pairs -530b356b62b48b101b142b S1S1 -292m178m63m7m203m69m S2S m m50k-290k k m8-250k Baseline training data -2.1m 303k3.4m0.78m53k Dev/Test set-WMT-11/12 NIST-06/08 WMT-11/12held-out Baseline BLEU

Evaluation on MT

Conclusions (Part I) 31 Summary  Scalable approach to extract parallel text from a comparable corpus  Improvements over state-of-the-art MT baseline  General algorithm applicable to any data format Future work  Domain adaptation  Experimenting with larger web collections

Outline Introduction Searching to Translate (IR  MT)  Cross-Lingual Pairwise Document Similarity  Extracting Parallel Text From Comparable Corpora Translating to Search (MT  IR)  Context-Sensitive Query Translation Conclusions 32 (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13)

Cross-Language Information Retrieval Information Retrieval (IR): Given information need, find relevant material. Cross-language IR (CLIR): query and documents in different languages “Why does China want to import technology to build Maglev Railway?” ➡ relevant information in Chinese documents “Maternal Leave in Europe” ➡ relevant information in French, Spanish, German, etc. 33 query (ranked) documents

grammar extractor decoder language model token aligner token alignments query “maternal leave in Europe” sentence-aligned parallel corpus token translation probabilities n best translation s 1-best translation “congé de maternité en Europe” Machine Translation for CLIR 34 language model translation grammar STATISTICAL MT SYSTEM

Token-based CLIR Token translation formula 35 … most leave their children in …... aim of extending maternity leave to …. … la plupart laisse leurs enfants… … l’objectif de l’extension des congé de maternité à …. Token-based probabilities

Token-based CLIR 36 Maternal leave inEurope 1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% …

Document Retrieval How to score a document, given a query? 37 [maternité : 0.74, maternel : 0.26] “maternal leave in Europe” Query q 1 Documen t d 1 tf(maternité) tf(maternel) df(maternité) df(maternel) …

Token-based CLIR 38 MaternalleaveinEurope 1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% …

1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% … 39 MaternalleaveinEurope 1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% … Token-based CLIR

1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% … Context-Sensitive CLIR 40 This talk: MT for context-sensitive CLIR MaternalleaveinEurope 1.laisser (Eng. forget)  49% 2.congé (Eng. time off)  17% 3.quitter (Eng. quit)  9% 4.partir (Eng. disappear)  7% … 12% 70% 6% 5%

Previous approach: Token-based CLIR 41 Previous approach: MT as black box Our approach: Looking inside the box grammar extractor decoder language model MT token aligner token alignments query “maternal leave in Europe” sentence-aligned parallel corpus token translation probabilities n best derivations 1-best translation “congé de maternité en Europe” language model translation grammar n best derivations STATISTICAL MT SYSTEM

MT for Context-Sensitive CLIR 42 language model MT token aligner grammar extractor token alignments translation grammar query sentence-aligned parallel corpus “maternal leave in Europe” decoder token translation probabilities n best translation s 1-best translation “congé de maternité en Europe”

CLIR from translation grammar Token translation formula 43 S  [X : X], 1.0 X  [X 1 leave in europe : congé de X 1 en europe], 0.9 X  [maternal : maternité], 0.9 X  [X 1 leave : congé de X 1 ], 0.74 X  [leave : congé ], 0.17 X  [leave : laisser], Grammar-based probabilities S 1 X1X1 X2X2 leave in Europe maternal S1S1 X 1 X2X2 en Europe maternité congé de Synchronous hierarchical derivation Synchronous Context-Free Grammar (SCFG) [Chiang, 2007]

MT for Context-Sensitive CLIR 44 language model MT token aligner grammar extractor token alignments translation grammar query sentence-aligned parallel corpus “maternal leave in Europe” decoder token translation probabilities n best translation s 1-best translation “congé de maternité en Europe”

MT for Context-Sensitive CLIR 45 language model MT token aligner grammar extractor token alignments translation grammar query sentence-aligned parallel corpus “maternal leave in Europe” decoder token translation probabilities n best translation s 1-best translation “congé de maternité en Europe”

CLIR from n-best derivations 46 t (1) : {, 0.8 } t (k) : { k th best derivation, score(t (k) |s) } t (2) : {, 0.11 } Token translation formula Translation-based probabilities S 1 X1X1 X2X2 leave in Europe maternal S1S1 X 1 X2X2 en Europe maternité congé de S 1 X1X1 in Europe maternal leave S1S1 X 1 maternité en Europe congé de

MT for Context-Sensitive CLIR 47 Ambiguity preserved Context sensitivity 1-best MT token base d n best derivations token alignments translation grammar sentence-aligned bitext 1-best translation MT pipeline gramma r based translatio n based Pr nbest Pr SCFG Pr token

Combining Evidence For best results, we compute an interpolated probability distribution: 48 leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.09 congé 0.90 quitter 0.11 … Pr token Pr SCFG Pr nbest 35% 40% 25% leave laisser 0.33 congé 0.54 quitter 0.8 … Pr interp

Combining Evidence For best results, we compute an interpolated probability distribution: 49 leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.09 congé 0.90 quitter 0.11 … Pr token Pr SCFG Pr nbest 100% 0% leave laisser 0.72 congé 0.10 quitter 0.09 … Pr interp

Combining Evidence 50 For best results, we compute an interpolated probability distribution:

Experiments Three tasks: 1.TREC 2002 English-Arabic CLIR task 50 English queries and 383,872 Arabic documents 2.NTCIR-8 English-Chinese ACLIA task 73 English queries and 388,859 Chinese documents 3.CLEF 2006 English-French CLIR task 50 English queries and 177,452 French documents Implementation  cdec MT system [Dyer et al, 2010]  using Hiero -style grammars, GIZA++ for token alignments 51

Comparison of Models English-French CLEF 2006 Comparison of Models English-Arabic TREC Grammar-based Translation-based (10-best) Token-based Best interpolation 1-best MT Comparison of Models English-Chinese NTCIR-8

53 Comparison of Models Overview

Comparison of Models 54 Interpolated significantly better than token-based and 1-best in all three cases.

Conclusions (Part II) Summary  A novel framework for context-sensitive and ambiguity-preserving CLIR  Interpolation of proposed models works best  Significant improvements in MAP for three tasks Future work  Robust parameter optimization  Document vs. query translation with MT 55

Contributions CLIR Translation Model MT Translation Model MT pipeline baseline bitext Token-based CLIR extracted bitext Bitext Extraction comparable corpora +

Contributions CLIR Translation Model MT Translation Model MT pipeline baseline bitext Token-based CLIR extracted bitext Bitext Extraction comparable corpora + Higher BLEU for 5 lang pairs

Token-based CLIR Contributions MT pipeline baseline bitext CLIR Translation Model MT Translation Model Context-sensitive CLIR

Contributions CLIR Translation Model MT pipeline baseline bitext MT Translation Model Context-sensitive CLIR Higher MAP for 3 lang pairs

Contributions MT Translation Model MT pipeline baseline bitext extracted bitext Bitext Extraction comparable corpora + Context-sensitive CLIR CLIR Translation Model Higher MAP for 3 lang pairs Higher BLEU for 5 lang pairs

Contributions MT Translation Model MT pipeline baseline bitext Token-based CLIR extracted bitext Bitext Extraction comparable corpora + CLIR Translation Model CLIR Translation Model more bitext Higher BLEU after additional iteration

LSH-based MapReduce approach to pairwise similarity Exploration of parameter space for sliding window algorithm MapReduce algorithm to generate candidate sentence pairs 2-step classification approach to bitext extraction  Bitext from Wikipedia: improvement over state-of-the-art MT Set of techniques for context-sensitive CLIR using MT  Combination-of-evidence works best Framework for better integration of MT and IR Bootstrapping approach to show feasibility All code and data as part of Ivory project ( ) 62 Contributions

Thank you!

Cross-lingual Pairwise Similarity In a comparable corpus, find similar document pairs that are in different languages Challenge: loss of information during translation Goals  A first step for parallel sentence extraction  Contribute to multi-lingual collections such as Wikipedia 64

MapReduce 65 Map: (k, v) ↦ Reduce: (k', ) ↦

0 3.1 WMT10 train data size (in millions) 2-step Selecting Sentence Pairs 66 complex > WMT10 train 16.9 simple > 0.98 random sampling complex > step simple> simple > Conclusions with same amount, S 2 outperforms S 1 more data sampled => higher BLEU, until certain point

Bitext Extraction Approaches noisy data with S 1 2-step>1-step consistently data  BLEU 

Evaluation on MT

We derived an analytical model of our algorithm –based on a deterministic approximation –formula to estimate recall, given parameters –tradeoff analysis without running any experiments 69 Analytical Model Ture et al, SIGIR’11

70 Ture et al, SIGIR’11 Analytical Model

Identify links between German and English Wikipedia articles –“Metadaten”  “Metadata”, “Semantic Web”, “File Format” –“Pierre Curie”  “Marie Curie”, “Pierre Curie”, “Helene Langevin-Joliot” –“Kirgisistan”  “Kyrgyzstan”, “Tulip Revolution”, “2010 Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan” Worse performance when: –significant difference in length (e.g. specific to Germany) –highly technical articles (e.g. chemical elements) 71 Ture et al, SIGIR’11 Contributing to Wikipedia

Cross-Language IR: Overview 72 Relevance judgments Evaluation index of English collection Indexing French collection CLIR Translation CLIR Translation English collection Retrieval Engine Ranked list of documents English query

Cross-Language IR: Overview 73 Relevance judgements Evaluation Retrieval Engine Ranked list of documents French collection index of French collection Indexing CLIR Translation CLIR Translation French query English query this talk

Simhash Each bit determined by average of term hash values MinHash Order terms by hash, pick K terms with minimum hash Random projections (RP) Each bit determined by inner product between random unit vector and doc vector 74 Locality-Sensitive Hashing for Pairwise Similarity

Future Work Extracting parallel text from the web Domain adaptation in MT  use CLIR/IR and Phase 1 to find in-domain parallel text Language context in NLP  Context-sensitive vs. Token-based CLIR  Document vs. sentence-level MT Representing phrases in CLIR  Ad-hoc approach does not improve Document vs. query translation in CLIR 75 (Ture et. al, NAACL’12)

Comparison of Variants MT approach: Flat vs Hierarchical  hierarchical > flat for grammar-based CLIR  hierarchical ~ flat for translation-based CLIR  interpolated > token with either (except flat-Arabic)  interpolated > 1-best with either (except flat-French) Heuristic for source tokens aligned to many target tokens  no clear winner (mostly one-to-many)  Focus on hierarchical MT & one-to-many 76

Comparison of Models 77 Interpolated significantly better than token-based in all three cases, better than 1-best and 10-best for Ar and Zh

Efficiency vs. Effectiveness 78

Experiments: Finding good λ 1 and λ 2 Learn from different queries in same collection? ✔ Learn from different collection in different language? ✗ Learn from different collection in same language? ? Change in training resources? ? 79

CLIR from translation grammar Token translation formula 80 [maternal : maternité], 0.9 [in europe : en europe], 0.74 [maternity leave : congé de maternité], 0.74 [leave : congé ], 0.17 [leave : laisser], [paternity leave : congé de paternité], 0.70 Grammar-based probabilities (flat) leave in Europe maternal en Europe maternité congé de Flat derivation Phrase table or “flat grammar” [Koehn et al., 2003] [leave : congé],0.69 [maternal : maternité],0.9 [in europe : en europe],0.74

10-best>Token,Grammar,1-best NBA labor conflict NBA travail +social (Token,Grammar) confl, conflit +contradict (Token,Grammar) Grammar>10-best Centenary celebrations centenaire +celebration (G) Token>Grammar>10-best Theft of “The Scream” vol de \Scream” vs. vol du “Cri"

82 Also…  Analytical model to estimate algorithm recall  Error analysis of bitext output  Comparison of bitext extraction approaches Also…  Flat (cdec) vs Hierarchical MT (Moses) for CLIR  Efficiency analysis of CLIR models