Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Translations Philip Resnik, Noah Smith University of Maryland.

Similar presentations


Presentation on theme: "Identifying Translations Philip Resnik, Noah Smith University of Maryland."— Presentation transcript:

1 Identifying Translations Philip Resnik, Noah Smith University of Maryland

2 Reasons to identify translations Locating parallel text on the Web Filtering out poor quality translations Cross-language duplicate detection/caching

3 Identifying translations using structure 0.750.75 0.900.90 261261 J1J2,STRANDJ1J2,STRAND 0.690.69 0.880.88 315315 J2,STRANDJ2,STRAND 0.700.70 0.880.88 273273 J1,STRANDJ1,STRAND 0.950.95 0.980.98 267267 J1,J2J1,J2 κ%N ComparisonComparison STRAND (Resnik, 1999)

4 Related Work Web mining for parallel text (Nie et al. 1999) Sentence alignment (Fluhr et al. 2000) Duplicate detection (e.g. Broder et al. 1997)

5 Translational Equivalence as a Function over Sets Broder et al (1997): Document representation as a set of “shingles” S(D) r(D1,D2) = |S(D1)  S(D2)| |S(D1)  S(D2)| Cross language generalization: partial equality e = f t with confidence value t(e,f) used to define  and  tt

6 Ways of computing equivalence Bilingual dictionaries –t(e,f) = 1 if (e,f) present in dictionary, 0 otherwise Translation model (Melamed 2000, model A) –t(e,f) = Pr(e,f) String similarity for cognates –t(e,f) = Longest common substring ratio (LCSR) variant –Trained on non-zero entries in translation model

7 Evaluation task Given segmented corpus C1 in L1, C2 in L2 –Assume each segment has 0 or 1 translation equivalents –Match up the equivalents Equivalent to maximum bipartite matching problem –Exhaustive solution available for small sets –Approximated using competitive linking (Melamed) True equivalence pairs give precision/recall curve

8 Some results: sentence matching Task corpora: –Chinese-English: Hong Kong Laws sentences 5622 training sentences, 191 test sentences –Spanish-English: U.N. Parallel Corpus 4695 training sentences, 200 test sentences English-ChineseEnglish-Spanish

9 Some results: document matching Task corpora: –232 English-French Web documents

10 New directions Exploiting the Internet Archive –100-200 million pages (4TB) on disk –Exhaustive URL matching within site –STRAND now adapted for disk-based access Combining structure and content –Improving document-level matching –Selecting good chunks within documents


Download ppt "Identifying Translations Philip Resnik, Noah Smith University of Maryland."

Similar presentations


Ads by Google