Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiments on Processing Overlapping Parallel Corpora

Similar presentations


Presentation on theme: "Experiments on Processing Overlapping Parallel Corpora"— Presentation transcript:

1 Experiments on Processing Overlapping Parallel Corpora
University of Tartu Mark Fishel and Heiki-Jaan Kaalep

2 Outline: Parallel corpora containing overlapping parts
A method for processing these Some experiments on JRC-Acquis (Estonian, Latvian, English)

3 Overlapping parallel corpora
Hunglish and OPUS Hu-En subtitles Hunglish and JRC-Acquis Hu-En legislation texts Univ. of Tartu corpus and JRC-Acquis Et-En legislation texts JRC-Acquis Vanilla and HunAlign legislation texts

4 Overlapping parallel corpora
Additional troubles for handling: source version differences encoding differences format differences But also potential benefits: detect alignment errors raise corpora quality increase segmentation depth

5 ParAlign – the method A method of finding and matching corresponding corpora parts Enables combining corpora detecting potential error spots increasing alignment depth evaluating and improving alignment quality

6 Method based on finding corpora correspondence:

7 Aligning the corresponding language parts:

8 Aligning the corresponding language parts:
Edit distance over the corpora documents comparing N to M sentences matching weight = approx. sentence matching Approximate sentence matching: modified edit distance same letter different case replacing free number inserting/replacing infinitely costly punctuation replacing cheap

9 Aligning the language alignments:
Levenstein distance

10 ParAlign, the Implementation
Combine corpora, include side with more sentences Print out all mismatching parts (potential error spots) Use one corpus as guideline, proof the other one Available at

11 Method Benefits: Handles different segmentation levels (M to N al. unit relations) Insensitive to minor input differences Encoding Typing errors

12 Experiment-1 Univ. of Tartu corpus and JRC-Acquis (English-Estonian)
Overlapping parts found by comparing the CELEX codes Aim: generate joint corpus

13 Results Joint corpus size: al. units

14 Segmentation differences

15 Experiment-2 JRC-Acquis
English-Estonian English-Latvian Estonian-Latvian Aim: compare alignments produced by Vanilla and HunAlign almost 100% overlapping

16 Results En-Et En-Lv Et-Lv Hun Van Matching Mismatching Single 83.5%
85.3% 83.8% 86.2% 98.0% 98.2% Mismatching 15.9% 13.7% 15.5% 12.8% 0.1% 0.2% Single 0.6% 1.0% 0.7% 1.9% 1.6%

17 Future Work Other corpora Optimizing Test on other domains

18 Summary A method for parallel corpora combining/comparing/evaluating/… using overlapping parts Implementation available Joint En-Et corpus Comparison results between HunAlign and Vanilla versions of Jrc-Acquis En-Et, En-Lv and Et-Lv parts

19 Thank You!


Download ppt "Experiments on Processing Overlapping Parallel Corpora"

Similar presentations


Ads by Google