Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb.

Slides:



Advertisements
Similar presentations
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
CODE/ CODE SWITCHING.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
English Lexicography.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes.
Language Development Major Questions: 1) What is language/what is involved in language? 2) What are the stages of language development? 3) Is language.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Phonetics, Phonology, Morphology and Syntax
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
MACHINE TRANSLATION A precious key to communicate beyond linguistic barriers 1.
GRAMMAR: PARTS OF SPEECH
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Chapter 10 Language and Computer English Linguistics: An Introduction.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
A Survey of English Lexicology
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Grammars Grammars can get quite complex, but are essential. Syntax: the form of the text that is valid Semantics: the meaning of the form – Sometimes semantics.
Natural Language Processing
Introduction Chapter 1 Foundations of statistical natural language processing.
Communicative and Academic English for the EFL Professional.
SYNTAX.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Introduction to Machine Translation
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
General Notes on Stylistics
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
English-Korean Machine Translation System
Comparative Structures in Croatian: MWU Approach
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
A tool for automated extraction of multi-word expressions
ADDITION OF IPA TRANSCRIPTION TO THE BELARUSIAN NOOJ MODULE
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Constructing the Croatian resources for e-learning of Japanese
Social Knowledge Mining
Introduction to Machine Translation
Final Review English II.
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb Europhras 2015 Malaga, Spain

Language of our work - Croatian South-Slavic language High similarity to Bosnian, Serbian and Montenegrin Latin alphabet Properties: Highly flective (7 cases) Syntactically flexible (almost any word order possible) Pronoun dropping A challenge for computational processing

Computional approach to idioms Comparative structures as a subtype of idiomatic structures Two manners of computational language processing o Statistical approach o Rule-based approach Idioms o Higly specific part of language (i.e. replacing one word changes the whole meaning) o Statistical approach would yield unprecise results o Rule-based approach preferential, especially when dealing with flective languages

Importance of idioms in computatonal processing of texts Present in language, yet often ignored Difficult to proccess – described only linguistically Causing incomplete computational understanding of the language and unprecise translation Lack of real data about their frequency Why are they diffucult to process? Because of their multi-word nature Because of their elusive semantic properties ( meaning is not the sum of the words ) Because of their cultural and historical nuances which render them very difficult to translate without special preparation

Croatian phraseology and comparisons Well described linguistically (Croatian Dictionary of Idioms with ~2500 entries) o Lack of systematic approach essential for text processing o Sorted into categories for the purpores of this work Comparative structures as one of the main categories of idioms o Radi kao pčela (Working hard as a bee) o Puši kao Turčin (Smokes like a pipe, lit. Like a Turk) o Brz poput strijele (Fast as an arrow) Approximately 540 set comparative phrases in Croatian (Fink-Arnovski)

Comparisons in literature and beyond Comparative structures (usporedbe ili poredbe) mainly a feature of literary texts and newspaper o Filaković (2008) assumes their presence in the works of fiction by analyzing the works of Croatian writer I.B.Mažuranić o Kovačević (2012) reports linguistic creativity in use of comparative structures in newspaper articles o Mance and Trtanj (2010) note the usage of modern slang variants of the comparisons No statistical data about their real usage in various types of text

Goals of this work To build a tool for automated processing of the comparative idioms in Croatian texts To be able to recognize them in any type of the text as the multi word unit o Extract, describe and ennumerate the structures o Collect the statistical data about their frequency in different styles of texts o Serve as an example for similar work in other languages o Be used as a tool in automated or semi-automated machine translation of Croatian to any lanugage (provided the additional work)

NooJ – a tool for rule based automated text processing NooJ – free to use linguistic development environment for various kinds of rule-based automated text and corpora processing Morphological, syntactic and semantic processing with options for translation and transformation of sentences Ready made resources for dozen languages: o Acadian, Arabic, Armenian, Belarusian, Bulgarian, Catalan, Croatian, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Serbian, Slovene, Spanish, Turkish, Vietnamese Great tool for highly flective languages

Methodology 1.Listing and categorizing the idioms 2.Definition and recognition of rules 3.Construction of training and testing corpora 4.Construction of grammars for processing texts o Using NooJ as a platform 5.Testing phase 6.Calculation of results

Listing and categorizing the idioms Based on Croatian Dictionary of Idioms and idioms manually found in Croatian corpus For the purposes of computational approach, we defined five major categories a) Noun phrase with an attribute or apposition b) Verbal phrase with a direct object c) Verbal phrase with the optional direct object which can disrupt the syntactic structure d) Comparative structure (A/V as N) e) Fixed phrase which doesn't change in any syntactic environment

Definition and recognition of rules 312 different comparative construcion in our dictionary o Recognized in any form, tense, case and word order Divided into 5 subcategories due to sytactic properties 1.Adjective AS Noun= 89 2.Noun AS Preposition= 9 3.AS a Noun/Adjective=49 1.AS a Noun (7) 2.AS a PP fixed phrase (37) 3.AS a N + PP (5) 4.Verb AS Noun= AS IF Verb=

Construction of training and testing corpora First phase: training o A smaller corpus of sentences exclusively containing the structures in question (comparative structures with phrases „kao” or „poput”) Second phase: testing o After the completion of the grammars (NooJ files for processing texts), results are tested on the bigger corpus o Corpus 1: random texts from the Web corpus of differents styles of text (2,2 million words corpus) o Corpus 2: literal text of mostly Croatian authors (658 Kw corpus)

Construction of grammars for processing texts Grammar – a file constructed in NooJ environment, made for syntactic processing of the texts Input, output, variebles, nested grammars Concordance with marked texts as an output

Adjective AS Noun Recognizes: Lijep kao slika (pretty as a picture) Pijan kao smuk (drunk as a sponge) Brz kao zec (fast as a bullet)

Noun AS prepositon AS a Noun Recognizes: Kao drvena Marija (being stiff, unrelaxed) Poput guske u magli (without thinking) Recognizes: Mrak kao u rogu (pitch dark)

Verb AS Noun AS IF Verb Recognizes: Kao da je u zemlju propao (as if the Earth swallowed him) Kao da je pao s Marsa (clueless, as if he came from Mars) Recognizes: Ići kao po loju (go smoothly, slide like over the fat) Šutjeti kao grob (be silent as a grave)

Example of results Comparative structure

Evaluation Kilo- words (Kw) Number of structures found PrecisionRecallF-measure Training corpus 100%96%98% Corpus 1 (web) 2247 Kw22 Corpus 2 (books) 658 Kw67 Average

Conclusions about comparison in Croatian Number of comparative structures in different types of texts varies greatly o General texts (web corpus) – 1 per every words o Literal texts (books from Croatian authors) – 1 per every 1000 words Confirmed hypothesis that such structures are pertaining mostly to literal style o 10 times more frequent in books and works of fiction o Rare in other styles of writing due to the stylistic marking they bring to the text

Thank you for your attention. Questions?