Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics.

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

Building Wordnets Piek Vossen, Irion Technologies.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Corpus Processing and NLP
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Improved Parser for Simple Croatian Sentences Kristina Vučković, Božo Bekavac, Zdravko Dovedan University of Zagreb, Faculty of Humanities and Social Sciences.
Research methods in corpus linguistics Xiaofei Lu.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Computational Investigation of Palestinian Arabic Dialects
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
group ПР-09-4 м Shevchenko Lilia
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Supertagging CMSC Natural Language Processing January 31, 2006.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Computational and Statistical Methods for Corpus Analysis: Overview
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Presentation transcript:

Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics ** Department of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb NooJ 2011 Dubrovnik

Talk overview project ACCURAT problem and corpora modeling local grammars and applying them statistical evaluation NooJ2011 Dubrovnik

ACCURAT FP7 project main goal - to develop methods and techniques to overcome one of the central problems of machine translation – the lack of linguistic resources for under- resourced areas of machine translation key innovation - creation of methodology and tools to measure, to find and to use comparable corpora to improve the quality of MT the ACCURAT project will significantly contribute not only to the theory of MT, but also to corpus linguistics, information extraction and natural language processing in general NooJ2011 Dubrovnik

Scientific objectives create comparability metrics – to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora establish research methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora disambiguation – important process for POS and MSD tagging NooJ2011 Dubrovnik

Problem parallel and comparable resources are sparse for Croatian when paired with any of the languages included in the project, especially if the other language is under-resourced as well importance of high quality annotation for existing language resources for Croatian building (factored) language models for MT using text anchors in comparable resources MSD-tagging and lemmatization errors detected in existing Croatian language resources e.g. Croatian National Corpus v2.5 (automatically lemmatized and MSD-tagged), manually annotated subcorpora, Croatian Dependency Treebank manual analysis of their annotation reveals regular patterns in these errors NooJ2011 Dubrovnik

Problem forms of descriptive adjectives in the nominative singular case in the neuter gender are the same as the forms of the adverbs that are made from those adjectives by suffixation these adverbs are realized in context in most cases adverb is made from adjective that has abstract meaning there are several types of word forms the forms of adverbs and adjectives that occur with no semantic constraints: razdragano (gleeful), bahato (arrogant), ubrzano (rapidly), uzrujano (upset), umiljato (cuddly) forms that are made from verbs: drhtavo (shaking), laskavo (flattering), šepavo (lame) forms that have dual meaning (concrete and abstract): mlako (lukewarm), šugavo (itchy), mračno (darkly), hladno (cold), gorko (bitter) forms that denote spatial and temporal relations: rano (early), duboko (deeply), plitko (shallow), lijevo (left) NooJ2011 Dubrovnik

Corpora Croatia Weekly 100 kw newspaper corpus (newspaper published from 1998 to 2000, 118 numbers) it covers different domains: politics, economy and finance, tourism, ecology, culture, art, sports part of Croatian side of the Croatian-English Parallel Corpus manually lemmatized and MSD-tagged using the MULTEXT-East v3 morphosyntactic specifications Orwells "1984" corpus, manually lemmatized and MSD- tagged using MULTEXT-East v4 languages: En, Ro, Sl, Cs, Et, Hu, Sr, Bg, Ru, Mk, Hr... encoded in TEI P4 (XML) NooJ2011 Dubrovnik

Corpora imported the corpora to NooJ used the NooJ XML import feature kept the MSD feature annotations for adjectives, adverbs, nouns and verbs converted the annotations for these PoS from Multext- East to NooJ format for lexical resources modified feature annotations e.g. MTE verb type from auxiliary, copulative to PG (auxiliary verb) in NooJ preprocessing enabled designing the rules without using Croatian resources for NooJ, i.e. skipping NooJ linguistic analysis NooJ2011 Dubrovnik

Patterns we noticed several types of patterns in which adverbs that are homographic with adjectives occur they are defined by their contextual environment 1)Vpg + A* + V Vpg + R* + V 2)Vpg + A + A* Vpg + R + A* 3)A* + V R* + V 4)A + A* + N R + A* + N NooJ2011 Dubrovnik

Vpg + A* + V NooJ2011 Dubrovnik

Vpg + A + A* NooJ2011 Dubrovnik

A* + V NooJ2011 Dubrovnik

A + A* + N NooJ2011 Dubrovnik

Statistics 1 NooJ2011 Dubrovnik manually checked concordances errors frequently include the word sve, so we upgraded all grammars in order not to recognize sve cw100orwellcw100 + orwell Vpg + A* + V64 %62 %63 % Vpg + A + A*100 % A* + V82 %54 %67 % A + A* + N69 %75 %70 % total77 %61 %70 %

Example of upgraded grammar NooJ2011 Dubrovnik

Statistics 2 obtained results improved after we applied new grammars significant difference between newspaper and literature corpus NooJ2011 Dubrovnik cw100orwellcw100 + orwell Vpg + A* + V100 %83 %92 % Vpg + A + A*100 % A* + V87 %63 %74 % A + A* + N78 %100 %82 % total89 %73 %83 %

Future work forms of relational adjectives in the nominative singular case in the masculine gender are the same as the forms of the adverbs that are made from those adjectives by suffixation (junački, pučki, bratski, životinjski) disambiguation of these forms also depends on the grammatical context in which they occur, so it can also be done in a similar way applying the disambiguation rules to other Croatian language resources NooJ2011 Dubrovnik

Thank you for your attention. The research within the project Accurat leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ), grant agreement n o