Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Similar presentations


Presentation on theme: "1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early."— Presentation transcript:

1 1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early work: Hansards  Canadian parliamentary proceedings  French/English only  Still most resources are in formal newspaper style only

2 2 Harvesting parallel text from web  Strand: use similar structure to find likely translations  Using similar content to find translations  Applying methods to the Internet Archive, dramatically increasing quantity

3 3 STRAND  Structural Translation Recognition Acquiring Natural Data  Architecture  Location of possible translations  Generation of candidate translations  Filtering of candidates based on structure

4 4  Search for language in anchors (anchor: “English” OR anchor: “French”)

5 5 Structural Filtering  Linearize HTML and discard content  Run through transducer to produce:  [START element-label]  [END element-label]  [CHUNK length]

6 6  Align sequences using dynamic programming

7 7 Scalar values  Dp: difference in # structural items that have no match  N: number of aligned non-markup chunks of different lengths  R: correlation of chunk lengths  P: significance level of the correlations

8 8 Evaluation  Human judgments on 326 English- French paired pages  Using manually set thresholds on dp and n  100% precision  68.6% recall  Similar results on English/Chinese; English/Spanish  Typically throws out 1/3 data  Using machine learning: recall: 84% precision: 96%

9 9 Drawbacks of structural matching  Not all translations have similar structures  Not all texts use HTML markup

10 10 Content-based matching  Seed: bilingual lexicon  Link: pair x is in L1 and y in L2  Probability that x a translation of y given by bilingual lexicon  Want most probable link sequence that could account for a pair of texts  Product of the probability of links  Best set of links using Maximum Weighted Bipartite Matching

11 11  Cross-language similarity score: tsim  Computed on first 500 words of a document for efficiency

12 12 Experiment  Dictionary  English/French dictionary: 34,808 entries  Dictionary of English/French cognates: 35,513 pairs  Additional web pairs: 11,264 from Bible  Final lexicon: 132,155 pairs  Trained threshold for t-sim on 32 pairs from Strand test set  Strand (manual): Fmeasure of.81  Tsim: F-measure of.88  Combined model: F-measure.977


Download ppt "1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early."

Similar presentations


Ads by Google