Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert.

Similar presentations


Presentation on theme: "Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert."— Presentation transcript:

1 Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert Gaizauskas, Dan Tufiş and Tatiana Gornostay

2 Challenge of Data Driven MT - Rapid development of data driven methods for MT - Automated acquisition of linguistic knowledge extracted from huge parallel corpora provide an effective solution that minimizes time- and resource- consuming manual work - Applicability of current data- driven methods directly depends on the availability of very large quantities of parallel corpus data - Translation quality of current data-driven MT systems is very low for under-resourced languages and domains 2 3rd BUCC Malta

3 Problem of availability of linguistic resources Relevant for smaller or under-resourced languages Example of Latvian: few parallel corpora of reasonable size (e.g., JRC Acquis, EMEA) SMT trained on this corpora performs well on domain documents, but it has unacceptable results for other domains (en-lv 43.4 BLEU in domain, 10.2 BLEU out of domain) Solution: comparable corpora are much more widely available than parallel translation data 3 3rd BUCC Malta

4 Accurat project The Accurat mission is to significantly improve MT quality for under-resourced languages and narrow domains by researching novel approaches how comparable corpora can compensate for a shortage of linguistic resources ACCURAT methods will be: Adjustable to new languages and domains Language independent where possible 2.5 year project, started on January 1, rd BUCC Malta

5 Key objectives To create comparability metrics - to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT To measure improvements from applying acquired data against baseline results from SMT and RBMT systems To evaluate and validate the ACCURAT project results in practical applications 5 3rd BUCC Malta

6 Use Cases Adjusting MT to narrow domain Automotive engineering, assistive technology and data processing domains Application for Web authoring Blog and social networking (Zemanta application) Using SMT in software localization Increasing efficiency in localization, integration with CAT tools 3rd BUCC Malta

7 Language Coverage Focus on under-resourced languages: Latvian, Lithuanian, Estonian, Greek, Croatian, Romanian and Slovenian Major translation directions like English- Lithuanian, English-Croatian, German­Romanian Minor translation directions like Lithuanian­ Romanian, Romanian-Greek and Latvian- Lithuanian 7 3rd BUCC Malta

8 Work Plan WP1: To create comparability metrics – to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora (M3-M24) WP2: To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT (M3-M23) WP3: To develop, analyze and evaluate methods for automatic acquisition of a comparable corpus from the Web (M1-M22) WP4: To measure improvements from applying acquired data against results from baseline SMT and RBMT systems (M7-M26) 3rd BUCC Malta

9 Work Plan WP5: To evaluate and validate the ACCURAT project results in three practical applications (M7-M30) WP6: To disseminate project results and to transfer the project knowledge, technologies, lessons learned and best practices to interested communities and thus to ensure their worldwide impact and long-term sustainability (M1-M30) WP7: To coordinate the project and provide administrative and financial management (M1-M30) 3rd BUCC Malta

10 Milestones Tools for collecting comparable corpora from the Web (M22) Multilingual comparable corpora (M22) Initial comparable corpora (M3) Criteria and metrics of comparability and parallelism (M24) Initial comparability metrics (M6) Alignment and extraction methods for comparable corpora (M20) Application of existing alignment methods (M6) Improved MT systems (M26) Adjusted MT systems in applications (M30) Baseline SMT systems (M9) 10 3rd BUCC Malta

11 Key Results Comparability metrics developed and tools provided Comparable corpora for under-resourced languages collected and tools provided Methods and tools for multi- level alignment from comparable corpora developed Methods for using comparable corpora in both SMT and RBMT developed Proven application scenarios prepared Strong increase in MT quality for under-resourced languages and narrow domains 11 3rd BUCC Malta

12 Initial comparable corpora (ICC) 1 million tokens for each under-resourced language domain corpus for en-de 3rd BUCC Malta DomainGenrePercent International newsNewswires20% SportsNewswires10% AdminLegal10% TravelAdvice10% SoftwareWikipedia15% SoftwareUser manuals15% MedicineFor doctors10% MedicineFor patients10% 12

13 Recommended proportions parallel – 10% strongly comparable (heavily edited translations or independent, but closely related texts reporting the same event or describing the same subject) – 40% weakly comparable (e.g.,texts within the same broader domain and genre, but varying in subdomains and specific genres, texts in the same narrow subject domain and genre, but describing different events) – 50% length of each document should be between 500 and 3000 words 3rd BUCC Malta

14 Initial comparable corpora: results 3rd BUCC Malta DomainGenrePlannedCollected International news Newswires20%14,73% SportsNewswires10%8,23% AdminLegal10%11% TravelAdvice10%14,46% SoftwareWikipedia15%5,83% SoftwareUser manuals 15%22,11% MedicineFor doctors 10%12,35% MedicineFor patients 10%11,30% 14

15 Initial comparable corpora: results ET-ENLV-ENLT-ENEL-ENRO-ELHR-ENRO-ENRO-DESL-EN parallel9,4811,8246,1713,3332,6239,516,948,5240,17 strongly compara ble51,0637,5121,8320,4730,969,4417,0732,6727,98 weakly compara ble39,4650,6732,0066,2036,4251,0576,0058,8131,85 3rd BUCC Malta

16 Metadata Language Domain Genre Source Number of words IPR status Comparability level parallel and strongly comparable texts are also aligned at the document level 16 3rd BUCC Malta

17 CES (Corpus Encoding Standards) 3rd BUCC Malta

18 Extension to CES-CCES 3rd BUCC Malta

19 CES Alignment–Extension to CCES Alignment 3rd BUCC Malta

20 Criteria of Comparability and Parallelism Lack of definite methods to determine the criteria of comparability Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008) Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006) 3rd BUCC Malta

21 Criteria of Comparability and Parallelism To investigate criteria for comparability between corpora concentrating on different sets of features: Lexical features: measuring the degree of 'lexical overlap' between frequency lists derived from corpora Lexical sequence features: computing N-gram distances in terms of tokens Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes 3rd BUCC Malta

22 First experiment Comparability of corpora is measured in terms of lexical features (GreekEnglish and GermanEnglish language pairs) The set-up is similar to (Kilgarriff, 2001): For each corpus take the top 500 most frequent words relative frequency is used (the absolute frequency, or the word count, divided by the length of the corpus) Automatically generated dictionaries by Giza++ from the parallel Europarl corpus We compare corpora pairwise using a standard Chi- Square distance measure: ChiSquare = {w1... w500} ((FrqObserved - FrqExpected) ^ 2) / FrqObserved 3rd BUCC Malta

23 First experiment Asymmetric method: relative frequencies in Corpus in language A are treated as expected values, and those mapped from the Corpus in language B – as observed. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora. 3rd BUCC Malta

24 Features To extract the features which may be used to identify the comparability between documents Language IndependentLanguage Dependent (requires translation) Document length Date Character overlap Web features - URL of doc source - Common links - Links referring to each other - Image links Other features … Lexical overlap Web features - Anchor text - Image alt tag Genre (?) Domain (?) Other features … 24

25 General Idea parallel weakly comparable strongly comparable EN not comparable Initial Comparable Corpora Comparability Levelf1f1 f2f2 f3f3 …fnfn parallel strongly comparable weakly comparable not comparable... Features extraction Classifier EL EN strongly comparable EN EL New Documents Predicted Comparability Level 25

26 Metrics of Comparability and Parallelism Using defined criteria for parallelism, we would like to develop formal automated metrics for determining the degree of comparability Lack of comparability metrics to evaluate corpus usability for different tasks, such as machine translation, information extraction, cross-language information retrieval Recent studies (Kilgarriff, 2001; Rayson and Garside, 2000) have added a quantitative dimension to the issue of comparability by studying objective measures for detecting how similar (or different) two corpora are in terms of their lexical content Further studies (Sharoff, 2007) investigated automatic ways for assessing the composition of web corpora in terms of domains and genres 3rd BUCC Malta

27 Danielsson, Pernilla and Ridings, Daniel. Practical presentation of a vanilla aligner. Goteborgs universitet, 1997 Melamed, Dan. A Geometric Approach to Mapping Bitext Correspondence. University of Pennsylvania, 1996 Chen, Stanley F. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st annual meeting on Association for Computational Linguistics (Columbus, Ohio 1993), Association for Computational Linguistics Morristown, NJ, USA, State of the Art: Moore, Robert C. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (Tiburon, California 2002), Springer-Verlag, Heidelberg, : provisionary alignment based on sentence lengths IBM Model 1 – estimate Translation Equivalents (TE) table generate one to one links based on sentence lengths and TE table Sentence Alignment on Parallel Texts 3rd BUCC Malta

28 Reification – a link in the alignment is treated as a context independent structured object. Using SVM (libsvm solution). Features: translation equivalence word length correlation (Pearson) special characters occurrence similarity word frequency ranks correlation Crossed links are allowed Our Sentence Alignment on Parallel Texts 3rd BUCC Malta

29 Based on previous experience, literature and current constraints (time, man-power, computational resources) we envisaged 3 possible ways of tackling with the alignment of comparable corpora in order to get useful results: QA techniques Clustering Windowing Scenarios for Aligning Comparable Corpora 3rd BUCC Malta

30 Accurat partners Tilde (Coordinator)Latvia University of SheffieldUK University of LeedsUK Athena Research and Innovation Center in Information Communication and Knowledge Technologies (ILSP) Greece University of Zagreb, Faculty of Humanities and Social Sciences Croatia DFKIGermany Institute of Artificial IntelligenceRomania LinguatecGermany ZemantaSlovenia 3rd BUCC Malta

31 ACCURAT project has received funding from the EU 7 th Framework Programme for Research and Technological Development under Grant Agreement N° Project duration: January 2010 – June 2012 Contact information: Andrejs Vasiljevs andrejs tilde.lv Tilde, Vienibas gatve 75a, Riga LV1004, Latvia


Download ppt "Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Inguna Skadiņa, Andrejs Vasiļjevs, Raivis Skadiņš, Robert."

Similar presentations


Ads by Google