Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Similar presentations


Presentation on theme: "Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia."— Presentation transcript:

1 Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia

2 1.Approximate sentence matching – what is that? 2.Some information about Roman goddess of agreement. 3.Thoughts on translating an entire text corpus… manually. 4.Why is the Attic Greek word συνεργία worth remembering. Agenda

3 ASM is a technique of retrieving sentences similar to a given input sentence from a large text corpus. If we search for the sentence „the agreement was concluded on 11th of March 2012” in law texts we expect ASM to find the sentences: a)„the agreement was concluded on 25th of September 2014” b)„the contract was signed on 11th of March 2012” c)„the agreement was not concluded” Which sentences are similar depends on the similarity measure. Approximate sentence matching

4 ASM is primarily used as a Computer-Aided Translation mechanism. When a translator works on a sentence, he/she searches for similar sentences in the base of previously translated texts – translation memory (TM). This technique is known to boost the efficiency of translation and to ensure repetitiveness of the translations. The drawback – it can be used rather rarely (ca. 5% can be found in TM) ASM – translation memories

5 How to modify the classic TM searching so it can retrieve more valuable information? The goal Image found at: http://www.commercebees.com

6 How to convince translators to use new software instead of their favourite workbenches? The goal Image found at: http://www.commercebees.com

7 What it feels like… Image found at: http://memegenerator.net, depicting a character from the Lord of the Rings film series

8 What it feels like… Image found at: http://www.dailymail.co.uk

9 The Concordia translation memory searcher was developed. It combines classical TM search with concordance searching (finding a single word in context). It takes its name from the Roman goddess of agreement, as it helps to produce translation that „agree” with each other. Let’s not give up!

10 Concordia – example Translation memory I just think it is impossible. He is not sure if it is needed. I want you to repair the car already! I can not repair the lawn mower. It might be impossible to do that. It is impossible to repair the car. search:

11 All possible overlays are then scored. A good overlay covers the most of the input sentence with as little fragments as possible. The translator is presented with translations of longest fragments of the sentence he/she is working on. Productiveness and usability experiments are under way! Concordia

12 And now for something (completely) different… Let us assume we have a large collection of texts in just one language. We would like to build a TM (aka parallel corpus) by manually translating all our sentences. WHAT?! Producing TMs

13 It’s okay, we will not translate ALL the sentences! We will only choose the most represantative ones and translate them. And how do we choose the most representative sentences of a monolingual corpus? Let’s make a clever use of ASM, more precisely – the sentence similarity measure. Producing TMs

14

15 This method proved effective in preparing high-quality specialized translation memories. Such TMs are much more beneficial for the translation process. They can also be used for other purposes, such as training statistical machine translators. Producing TMs

16 Now, what is so special about the word συνεργία? Transliterated it is: synergia – synergy, working together. Good NLP research requires synergy between linguists and computer scientists. Greek word Images found at: https://spectacledbookworm.wordpress.com/ and http://lemmino.deviantart.com

17 Linguists do not seem to know much about how computer software is created and which techniques are easy to implement and which are not. However, to be fair, computer scientists probably know even less about the translation process Moreover, the two groups are motivated differently – translators are primarily focused on the quality of their translation. Computer scientists, on the other hand, are focused on the performance of their software. Synergy – problems

18 Ideally, linguists and computer scientists should spend about 1-2 hours a week working together. They should exchange concepts and educate each other in their fields. The computer scientist should translate a document under supervision of the linguist. The translator should get accustomed with the architecture of the system he/she is using for their work. Ideas for new features in the software should be a result of their mutual thinking process. Synergy – solutions

19 Only with this approach one can establish true synergy! Synergy – solutions Image found at: http://www.referenceforbusiness.com

20 Hvala lijepa! INFuture2015, Zagreb, Croatia


Download ppt "Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia."

Similar presentations


Ads by Google