1 Phrase alignment of Estonian-German parallel treebanks Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University.

Slides:



Advertisements
Similar presentations
with Jan Delay’s “Für immer und Dich”
Advertisements

European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Automatic Methods to Supplement Broad-Coverage Subcategorization Lexicons Michael Schiehlen, Kristina Spranger Institut für Maschinelle Sprachverarbeitung.
Slide 1 Insert your own content. Slide 2 Insert your own content.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment Eelco Mossel LSP 2007, Hamburg.
0 - 0.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
There are a number of set phrases in both English and German that consist of a verb + a certain preposition but these phrases differ between the two languages.
10.2 Lektion 10 Geschichte und Gesellschaft STRUKTUREN © and ® 2012 Vista Higher Learning, Inc Uses of the infinitive The basic form of any verb.
Christian Fortmann & Martin Forst InSTIL/ICALL2004 Symposium, Venice 1 A German LFG for CALL Christian Fortmann, Martin Forst Institut für Maschinelle.
Von Spencer Petersen und Kellen Knight. Dative and accusative prepositions are so named because the prepositional phrase that the preposition makes is.
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Machine Translation II How MT works Modes of use.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 4 Slide 1 Software processes 2.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Information Extraction Lecture 12 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University
Information Extraction Lecture 9 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser.
Eden German Grammar: main developments March-July 2003 Increase in structural complexity covered –provision of X-bar structural backbone within noun phrases.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Kapitel 5 Grammar INDEX Direct Objects Subject & Direct Object
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
Chapter 1, Page 10 It’s the case of the... ? Who was on the phone? So what’s the main issue/problem? Why did Hilde hit her brother? Find the line that.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Chair of Language Technology. 2 Outline General information Staff Teaching –Courses –Supervision Research –Fields –Main results –Participation in conferences.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Approaches to Machine Translation
Corpus Linguistics I ENG 617
Statistical NLP: Lecture 13
--Mengxue Zhang, Qingyang Li
Yuri Pettinicchi Jeny Tony Philip
Approaches to Machine Translation
Improved Word Alignments Using the Web as a Corpus
Statistical Machine Translation Papers from COLING 2004
Experiments on Processing Overlapping Parallel Corpora
Cross Language Information Retrieval (CLIR)
Translating Collocations for Bilingual Lexicons
Presentation transcript:

1 Phrase alignment of Estonian-German parallel treebanks Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University

2 Aim and motivation Aim – the alignment of the phrases of two corpora that are each others' translations Motivation: –Example-Based Machine Translation (EBMT) –Cross-language and translation studies

3 Phrase alignment: example

4 Existing resource – The Sofie Parallel Treebank (password protected) 9 European languages, including German and Estonian initiated by the Nordic Treebank Network chapters 1-2 of Jostein Gaarder’s novel “Sophie’s World” sentences aligned syntactic structure and functions annotated, but different annotation schemes used: –German – TIGER ( stuttgart.de/projekte/TIGER/ ) –Estonian – VISL (

5 Automatic alignment of Estonian-German NPs This is the first automatic alignment of Estonian-X parallel corpora below the sentence level. We started from the automatic alignment of NPs, because –an important part of the sentence's meaning is represented by noun phrases; –NPs are the most frequent phrase types in these languages.

6 The NP alignment method 1. Find all noun phrases in the parallel sentences. Sofie legte dann immer einen dicken Stapel Post auf den Küchentisch, ehe sie auf ihr Zimmer ging, um ihre Aufgaben zu machen. Tavaliselt pani ta paksu pataka posti köögilauale, enne kui läks üles oma tuppa koolitöid tegema.

7 The NP alignment method 2. Find all correspondences between the noun phrases. Sofie legte dann immer einen dicken Stapel Post auf den Küchentisch, ehe sie auf ihr Zimmer ging, um ihre Aufgaben zu machen. Tavaliselt pani ta paksu pataka posti köögilauale, enne kui läks üles oma tuppa koolitöid tegema. 3. Remove overlapping correspondences.

8 The NP alignment method To accomplish we used online dictionaries (ET- EN and DE-EN) and annotation information: 2. To set the correspondences between Estonian and German NPs –Translate all NP heads to English; –Find the intersections of translations; –If a pair of NPs are related by translation, then set a correspondence between them. 3. To remove overlapping correspondences –Use proper names as milestones; –Look at the locations of the NPs in the sentence.

9 Results 53 sentence pairs 134 possible NP matches were found, out of which 75 matches were selected. precision 84% recall 53%

10 Sources of errors Different tree structures (German – deeper) Translation problems. We used English as an intermediary language to find German-Estonian word correspondences (there is no free German- Estonian electronic dictionary). An NP in one language may correspond to a different phrase type or to a part of an NP in the other language. A PP in German often corresponds to an NP in Estonian –A lot of grammatical information that is expressed by prepositions in German or English is expressed by grammatical cases in Estonian.

11 Alternative approach – statistical An alternative to using bilingual electronic dictionaries is the use of statistical word alignment methods. This method has been evaluated by Samuelsson (2004) for the phrase alignment of a German- Swedish parallel treebank. We intend to test this method also for a German-Estonian treebank, although we are aware of the structural differences between German and Estonian which make automatic word alignment more difficult.

12 Treebank tools There exist tools for monolingual treebanks: –editors, e.g. Annotate –treebank query tools ( tgrep, TIGERSearch ) Special software tools for building and using of parallel treebanks are needed. We have developed an alignment viewer based on SVG (Scalable Vector Graphics). Need to implement: –alignment editor (currently being developed at Stockholm University) –phrase alignment test tool

13 Alignment visualization: Index file in HTML Tree overview [0] EEDENI AED...[0, 1][1] Der Garten Eden [1] lõppude lõpuks pidi miski kunagi tekkima mittemillestki. [1, 2][2]... schließlich und endlich mußte doch irgendwann irgend etwas aus null und nichts entstanden sein... [3] Alguses tuli ta koos Jorunniga. [3, 4][4] Das erste Stück war sie mit Jorunn zusammen gegangen. [4] Nad olid rääkinud robotitest. [4, 5][5] Sie hatten sich über Roboter unterhalten. [5] Jorunn arvas, et inimaju on nagu keerukas elektronarvuti. [5, 6][6] Jorunn hielt das menschliche Gehirn für einen komplizierten Computer.

14 After a click…a SVG picture

15 Conclusion and perspectives Our first attempt to align the noun phrases in the Estonian-German parallel treebank led to satisfactory results. The results could be improved if –different phrase types would be taken into consideration; –a more exact dictionary could be used; –Estonian syntactic trees would be deepened, making their annotation depth more similar to that of the German trees.