Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia.

Slides:



Advertisements
Similar presentations
Introduction to Programming
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Transitions Showing the Reader Your Organization Worth Weller.
How to Use a Translation Memory Prof. Reima Al-Jarf King Saud University, Riyadh, Saudi Arabia Homepage:
Programming Paradigms and languages
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Chapter 4 Design Approaches and Methods
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Writing Scientific Papers Lecturer: Prof. Nyoman S. Antara, Ph.D. Agroindustrial Technology Department Faculty of Agricultural Technology Udayana University.
Karolina Muszyńska Based on:
 I need two volunteers for a little “class experiment.” Any takers? Objective: Fill the empty bottle with as many pennies as possible in 30 seconds. Be.
1 Body Paragraphs References © 2001 by Ruth Luman Writing Paragraphs of Importance.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
A Markov Random Field Model for Term Dependencies Chetan Mishra CS 6501 Paper Presentation Ideas, graphs, charts, and results from paper of same name by.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 5 Reducibility Contents Undecidable Problems from Language Theory.
Direct Time Study Chapter 13 Sections: Direct Time Study Procedure
Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Evidence from Content INST 734 Module 2 Doug Oard.
Automated Essay Evaluation Martin Angert Rachel Drossman.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Computer-Aided Language Processing Ruslan Mitkov University of Wolverhampton.
Simplification of lighting and light signalling Regulations (SLR): options and issues Transmitted by the GRE Chair Informal document WP (166th.
CSC-115 Introduction to Computer Programming
Test Prep CS 690 Test 1. Home Page Study Aids Home Page Study Aids  Sample Test Sample Test Sample Test  Study Questions Study Questions Study Questions.
practical aspects1 Translation Tools Translation Memory Systems Text Concordance Tools Useful Websites.
Algorithms and their Applications CS2004 ( ) Dr Stephen Swift 1.2 Introduction to Algorithms.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Bits & Bytes Created by Chris McAbee For AAMU AGB199 Extra Credit Created from information copied and pasted from
Observation & Analysis. Observation Field Research In the fields of social science, psychology and medicine, amongst others, observational study is an.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Welcome to MM570 Psychological Statistics
Programming in C++ Dale/Weems/Headington Chapter 1 Overview of Programming and Problem Solving.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
© 2006 Pearson Addison-Wesley. All rights reserved 2-1 Chapter 2 Principles of Programming & Software Engineering.
SDL Trados Studio 2014 Getting Started. Components of a CAT Tool Translation Memory Terminology Management Alignment – transforming previously translated.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
Usability Testing Instructions. Why is usability testing important? In a perfect world, we would always user test instructions before we set them loose.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SDL Trados Studio 2014 Creating and Managing TMs Alignment Reviewing translations.
1 Arithmetic Where we've been: –Abstractions: Instruction Set Architecture Assembly Language and Machine Language What's up ahead: –Implementing the Architecture.
Presentation and explanation Presented by : Manal Arar.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
PROGRAMMING. Computer Programs  A series of instructions to the computer  pre-written/packaged/off-the-shelf, or  custom made  There are 6 steps to.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Test Prep CS 490 Test 1. Semaphore Primitives Figure 5.10 A Correct Solution to the Infinite-Buffer Producer/Consumer Problem Using Binary Semaphores.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Year 7 Information Evening
Text Based Information Retrieval
Some time ago I wrote how peer java programming can help maintain high quality code. But that is not all! Today I want to explain why I should practice.
Writing Paragraphs of Importance
Creativity in Algorithms
DAY 2 - Lesson 2: Explore PT: Make a Plan (1 hr)
Objective of This Course
Writing Paragraphs of Importance
Showing the Reader Your Organization
Writing Paragraphs of Importance
Writing Paragraphs of Importance
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
Showing the Reader Your Organization
Writing Paragraphs of Importance
Presentation transcript:

Approximate sentence matching and its applications in corpus-based research Rafał Jaworski INFuture2015, Zagreb, Croatia

1.Approximate sentence matching – what is that? 2.Some information about Roman goddess of agreement. 3.Thoughts on translating an entire text corpus… manually. 4.Why is the Attic Greek word συνεργία worth remembering. Agenda

ASM is a technique of retrieving sentences similar to a given input sentence from a large text corpus. If we search for the sentence „the agreement was concluded on 11th of March 2012” in law texts we expect ASM to find the sentences: a)„the agreement was concluded on 25th of September 2014” b)„the contract was signed on 11th of March 2012” c)„the agreement was not concluded” Which sentences are similar depends on the similarity measure. Approximate sentence matching

ASM is primarily used as a Computer-Aided Translation mechanism. When a translator works on a sentence, he/she searches for similar sentences in the base of previously translated texts – translation memory (TM). This technique is known to boost the efficiency of translation and to ensure repetitiveness of the translations. The drawback – it can be used rather rarely (ca. 5% can be found in TM) ASM – translation memories

How to modify the classic TM searching so it can retrieve more valuable information? The goal Image found at:

How to convince translators to use new software instead of their favourite workbenches? The goal Image found at:

What it feels like… Image found at: depicting a character from the Lord of the Rings film series

What it feels like… Image found at:

The Concordia translation memory searcher was developed. It combines classical TM search with concordance searching (finding a single word in context). It takes its name from the Roman goddess of agreement, as it helps to produce translation that „agree” with each other. Let’s not give up!

Concordia – example Translation memory I just think it is impossible. He is not sure if it is needed. I want you to repair the car already! I can not repair the lawn mower. It might be impossible to do that. It is impossible to repair the car. search:

All possible overlays are then scored. A good overlay covers the most of the input sentence with as little fragments as possible. The translator is presented with translations of longest fragments of the sentence he/she is working on. Productiveness and usability experiments are under way! Concordia

And now for something (completely) different… Let us assume we have a large collection of texts in just one language. We would like to build a TM (aka parallel corpus) by manually translating all our sentences. WHAT?! Producing TMs

It’s okay, we will not translate ALL the sentences! We will only choose the most represantative ones and translate them. And how do we choose the most representative sentences of a monolingual corpus? Let’s make a clever use of ASM, more precisely – the sentence similarity measure. Producing TMs

This method proved effective in preparing high-quality specialized translation memories. Such TMs are much more beneficial for the translation process. They can also be used for other purposes, such as training statistical machine translators. Producing TMs

Now, what is so special about the word συνεργία? Transliterated it is: synergia – synergy, working together. Good NLP research requires synergy between linguists and computer scientists. Greek word Images found at: and

Linguists do not seem to know much about how computer software is created and which techniques are easy to implement and which are not. However, to be fair, computer scientists probably know even less about the translation process Moreover, the two groups are motivated differently – translators are primarily focused on the quality of their translation. Computer scientists, on the other hand, are focused on the performance of their software. Synergy – problems

Ideally, linguists and computer scientists should spend about 1-2 hours a week working together. They should exchange concepts and educate each other in their fields. The computer scientist should translate a document under supervision of the linguist. The translator should get accustomed with the architecture of the system he/she is using for their work. Ideas for new features in the software should be a result of their mutual thinking process. Synergy – solutions

Only with this approach one can establish true synergy! Synergy – solutions Image found at:

Hvala lijepa! INFuture2015, Zagreb, Croatia