1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early work: Hansards  Canadian parliamentary proceedings  French/English only  Still most resources are in formal newspaper style only

2 Harvesting parallel text from web  Strand: use similar structure to find likely translations  Using similar content to find translations  Applying methods to the Internet Archive, dramatically increasing quantity

3 STRAND  Structural Translation Recognition Acquiring Natural Data  Architecture  Location of possible translations  Generation of candidate translations  Filtering of candidates based on structure

4  Search for language in anchors (anchor: “English” OR anchor: “French”)

5 Structural Filtering  Linearize HTML and discard content  Run through transducer to produce:  [START element-label]  [END element-label]  [CHUNK length]

6  Align sequences using dynamic programming

7 Scalar values  Dp: difference in # structural items that have no match  N: number of aligned non-markup chunks of different lengths  R: correlation of chunk lengths  P: significance level of the correlations

8 Evaluation  Human judgments on 326 English- French paired pages  Using manually set thresholds on dp and n  100% precision  68.6% recall  Similar results on English/Chinese; English/Spanish  Typically throws out 1/3 data  Using machine learning: recall: 84% precision: 96%

9 Drawbacks of structural matching  Not all translations have similar structures  Not all texts use HTML markup

10 Content-based matching  Seed: bilingual lexicon  Link: pair x is in L1 and y in L2  Probability that x a translation of y given by bilingual lexicon  Want most probable link sequence that could account for a pair of texts  Product of the probability of links  Best set of links using Maximum Weighted Bipartite Matching

11  Cross-language similarity score: tsim  Computed on first 500 words of a document for efficiency

12 Experiment  Dictionary  English/French dictionary: 34,808 entries  Dictionary of English/French cognates: 35,513 pairs  Additional web pairs: 11,264 from Bible  Final lexicon: 132,155 pairs  Trained threshold for t-sim on 32 pairs from Strand test set  Strand (manual): Fmeasure of.81  Tsim: F-measure of.88  Combined model: F-measure.977

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Similar presentations

Presentation on theme: "1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Similar presentations

Presentation on theme: "1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early."— Presentation transcript:

Similar presentations

About project

Feedback