In two minds: How to teach translation students to learn from parallel corpora Tomaž Erjavec Department of Intelligent Systems Jožef Stefan Institute

Slides:



Advertisements
Similar presentations
Successful Paraphrasing What is, and isnt, plagiarism? PA_paraphrase.html.
Advertisements

1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
Integrating translation technology at undergraduate level Belinda Maia University of Porto.
Further LIT training in Slovenia Amalija Maček University of Ljubljana.
19th MAY The main objective of the Ufficio Scolastico Regionale (regional educational office - USR) is to set up a qualified school, in order to.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CRITICAL City-Regions as Intelligent Territories: Inclusion, Competitiveness and Learning.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Corpora and the ‘general public’ Belinda Maia and Luís Sarmento Universidade do Porto.
Introduction to Assessment Basic Terms and Concepts.
Database Administration Chapter 16. Need for Databases  Data is used by different people, in different departments, for different reasons  Interpretation.
Article 6.3 Habitats Directive Implementation in Slovenia International Workshop on Appropriate Assessment of Plans, Oxford, 1-2 October 2009 Branko Kontic,
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
DIGITIZATION PARTNERSHIPS The National Archives and Records Administration.
Jyothi Kanics Advocacy & Policy Specialist The UNICEF General Measures of Implementation Project.
Galina Bogdanova, Konstantin Rangochev, Desislava Paneva-Marinova, Nikolay Noev Institute of Mathematics and Informatics, Bulgarian Academy of Sciences.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Webpage Design.
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.

Quality System Assessment in Italy European Curricula for Economic Animator in the Enlarging Europe – ECONOMIC ANIMATOR PT04/PP/08/36/446.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
英 3B 戴偲婷. WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt)
Database Administration
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Planning for Flexible Instruction Proposed integration of FID requirements into CP process Advisory Mtg.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Text Mining Application Programming Chapter 1 Introduction Manu Konchady, 2006.
Introduction to the European Union. The European Union Foundation Purpose.
The Unreasonable Effectiveness of Data
Development of an Intelligent Translation Memory MorphoLogic SZAK Publishers Balázs Kis
Chapter 3 Midterm Review Your Help For the Mid-Term.
New Employee Induction Program
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
BILC, June 2008, Athens SOLDIER LINGUISTS - CONTRIBUTORS TO LANGUAGE TEACHING CURRICULUM Tamara Derman Zadravec School of Foreign Languages Translation.
SOCIOLOGY OF EDUCATION
Making a difference: Partnerships in education Knowledge exchange and good practices through the EEA and Norway Grants 9-10 June 2016 Norway House Brussels.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
U4 – who we are Operational since 2003 as a web-based resource centre funded by:
Terminology Extraction Tool (Auto/Semi-Auto)
Chapter 2: System Structures
THE REPUBLIC OF THE SUDAN
Overview Rationale Context and Linkages Objectives Commitments
The JANES corpus of Slovene user-generated content
Darja Fišer CLARIN ERIC Director of User Involvement
Slovenian short experience with the ESF Implementation
Overview Rationale Context and Linkages Objectives Commitments
PolyAnalyst Web Report Training
Presentation transcript:

In two minds: How to teach translation students to learn from parallel corpora Tomaž Erjavec Department of Intelligent Systems Jožef Stefan Institute Špela Vintar Department of Translation and Interpreting University of Ljubljana

Overview  The corpus and concordancer  Using the resource to teach students

The IJS-ELAN parallel corpus  EU MLIS project ELAN: IJS  Slovene-English parallel texts  1 million words, 15 texts  sentence aligned, tokenised  TEI encoded  freely available

Example TU 117. člen Article 117 Memory exhausted zmanjkalo pomnilnika

Web concordance  IMS CQP backend  CGI Perl interface  Apache server

Queries  Vanilla queries: dog*, *dog  Full regular expressions: “dog.*”  Positional attributes: [num=“dual”]  Expressions over tokens  Constrains on aligned segments

Using the corpus in translator training: Developing corpus literacy  what is a corpus?  what’s in the corpus?  how to find things in the corpus?  how to use the results?

Formulating corpus queries  learning to formalize language  wordform vs. lemma (Slovene!)  using parallel search to filter out unwanted examples

Evaluating the results  critical eye: corpus translations may be false or bad  before relying on quantitative data, consider corpus composition  corpus != dictionary

Types of activities  frontal presentations  group work  individual work - translating with the corpus  seminar assignments

Things to observe  translation (in)equivalence, terminological variety  word-formation strategies  pragmatic/cultural conventions of text types  contrastive analysis  other translation strategies

lokaln* samouprav*  ? kuca: z ustreznim razmerjem med državo in lokalno samoupravo, med središčem države in A society with an appropriate relationship between the state and local government, between the national centre and individual regions. parl: obstajati. Specifične oblike lokalne samouprave so Slovenci poznali pod imenom župa, Specific forms of local self-administration were known to Slovenes by the term župa, which meant one or more villages led by a župan. ecmr: reforme javne uprave, razvoj lokalne samouprave, pa tudi oceno kadrovskih potreb in It is therefore an operative document which, apart from strategic goals, defines the areas of reforms, macro - and micro-economic policy measures, development of judicial system, public administration reform, development of local administration, as well as an estimate of the staff and financing requirements for realisation of those reforms. ekol: okolja33. V ta sklop sodi tudi raven lokalne samouprave s svojimi pristojnostmi na področju This also includes the level of local self-government with its responsibilities in the area of environmental protection, which otherwise are dealt with in a special chapter.

Things to observe  translation (in)equivalence, terminological variety  word-formation strategies  pragmatic/cultural conventions of text types  contrastive analysis  other translation strategies

*bug* 20 bugs 13 bug 9 debugging 8 debug 3 buggers 3 bug-free 2 buggy 2 Debugging debuggers 1 debugger bug-fixes 1 *hrošč* 11 hroščev 6 hrošču 5 razhroščevanje 5 hroščih 4 hrošče 3 hrošč 2 razhroščevanja 2 razhroščevalnega 2 hrošči 2 Razhroščevanje 1 razhroščujejo 1 razhroščiti 1 razhroščevanju 1 razhroščevalniku 1 razhroščevalniki 1 razhroščevalnik 1 razhroščevalnih 1 razhroščevalne 1 hroščem 1 hroščati 1 hroščat 1 hrošča

Things to observe  translation (in)equivalence, terminological variety  word-formation strategies  pragmatic/cultural conventions of text types  contrastive analysis  other translation strategies

Ways of translating deontic modality - shall usta: Within its own territory, Slovenia shall protect human rights and fundamental Država na svojem ozemlju varuje človekove pravice in temeljne svoboščine. usta: 11 The official language of Slovenia shall be Slovenian. In those areas where Uradni jezik v Sloveniji je slovenščina. spor: This schedule shall provide for a phasing-out Ta razpored mora predvideti postopno opuščanje tako uvedenih carin, s katerim je treba začeti najkasneje dve leti po uvedbi dajatev, in sicer po enakih letnih stopnjah. orwl: " " Obviously we shall put it off as long as " Nujno jo morava odložiti za tako dolgo, kot moreva. " kuca: a state which shall be fair to all, Je pa v moči vseh državljank in državljanov, da si ustvarijo tako državo, ki bo pravična do vseh, ne glede na njihove poglede na svet, politično prepričanje ali narodno pripadnost. kuca: world. Thus we shall create harmony Tako bomo ustvarjali ravnovesje v sebi, z drugimi in z okoljem.

Things to observe  translation (in)equivalence, terminological variety  word-formation strategies  pragmatic/cultural conventions of text types  contrastive analysis  other translation strategies

Things to observe  translation (in)equivalence, terminological variety  word-formation strategies  pragmatic/cultural conventions of text types  contrastive analysis  other translation strategies

A peek into the log file  ~1,900 different queries since 1999  L2 search: prevarication, forfeiture, runlevel, kernel  lexical-gap words: bias, retrieve, prepoznavnost  culturally bound words: potica, kozolec  (multiword) terms: legira.* (alloy steel)