Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb.

Similar presentations


Presentation on theme: "Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb."— Presentation transcript:

1 Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb krkocijan@ffzg.hrkrkocijan@ffzg.hr, sara.librenjak@gmail.comsara.librenjak@gmail.com Europhras 2015 Malaga, Spain 2015-07-01

2 Language of our work - Croatian South-Slavic language High similarity to Bosnian, Serbian and Montenegrin Latin alphabet Properties: Highly flective (7 cases) Syntactically flexible (almost any word order possible) Pronoun dropping A challenge for computational processing 19.1.2016 2

3 Computional approach to idioms Comparative structures as a subtype of idiomatic structures Two manners of computational language processing o Statistical approach o Rule-based approach Idioms o Higly specific part of language (i.e. replacing one word changes the whole meaning) o Statistical approach would yield unprecise results o Rule-based approach preferential, especially when dealing with flective languages 19.1.2016 3

4 Importance of idioms in computatonal processing of texts Present in language, yet often ignored Difficult to proccess – described only linguistically Causing incomplete computational understanding of the language and unprecise translation Lack of real data about their frequency Why are they diffucult to process? Because of their multi-word nature Because of their elusive semantic properties ( meaning is not the sum of the words ) Because of their cultural and historical nuances which render them very difficult to translate without special preparation 19.1.2016 4

5 Croatian phraseology and comparisons Well described linguistically (Croatian Dictionary of Idioms with ~2500 entries) o Lack of systematic approach essential for text processing o Sorted into categories for the purpores of this work Comparative structures as one of the main categories of idioms o Radi kao pčela (Working hard as a bee) o Puši kao Turčin (Smokes like a pipe, lit. Like a Turk) o Brz poput strijele (Fast as an arrow) Approximately 540 set comparative phrases in Croatian (Fink-Arnovski) 19.1.2016 5

6 Comparisons in literature and beyond Comparative structures (usporedbe ili poredbe) mainly a feature of literary texts and newspaper o Filaković (2008) assumes their presence in the works of fiction by analyzing the works of Croatian writer I.B.Mažuranić o Kovačević (2012) reports linguistic creativity in use of comparative structures in newspaper articles o Mance and Trtanj (2010) note the usage of modern slang variants of the comparisons No statistical data about their real usage in various types of text 19.1.2016 6

7 Goals of this work To build a tool for automated processing of the comparative idioms in Croatian texts To be able to recognize them in any type of the text as the multi word unit o Extract, describe and ennumerate the structures o Collect the statistical data about their frequency in different styles of texts o Serve as an example for similar work in other languages o Be used as a tool in automated or semi-automated machine translation of Croatian to any lanugage (provided the additional work) 19.1.2016 7

8 NooJ – a tool for rule based automated text processing NooJ – free to use linguistic development environment for various kinds of rule-based automated text and corpora processing http://nooj4nlp.net/ Morphological, syntactic and semantic processing with options for translation and transformation of sentences Ready made resources for dozen languages: o Acadian, Arabic, Armenian, Belarusian, Bulgarian, Catalan, Croatian, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Serbian, Slovene, Spanish, Turkish, Vietnamese Great tool for highly flective languages 19.1.2016 8

9 Methodology 1.Listing and categorizing the idioms 2.Definition and recognition of rules 3.Construction of training and testing corpora 4.Construction of grammars for processing texts o Using NooJ as a platform 5.Testing phase 6.Calculation of results 19.1.2016 9

10 Listing and categorizing the idioms Based on Croatian Dictionary of Idioms and idioms manually found in Croatian corpus For the purposes of computational approach, we defined five major categories a) Noun phrase with an attribute or apposition b) Verbal phrase with a direct object c) Verbal phrase with the optional direct object which can disrupt the syntactic structure d) Comparative structure (A/V as N) e) Fixed phrase which doesn't change in any syntactic environment 19.1.2016 10

11 Definition and recognition of rules 312 different comparative construcion in our dictionary o Recognized in any form, tense, case and word order Divided into 5 subcategories due to sytactic properties 1.Adjective AS Noun= 89 2.Noun AS Preposition= 9 3.AS a Noun/Adjective=49 1.AS a Noun (7) 2.AS a PP fixed phrase (37) 3.AS a N + PP (5) 4.Verb AS Noun= 157 5.AS IF Verb= 8 19.1.2016 11

12 Construction of training and testing corpora First phase: training o A smaller corpus of sentences exclusively containing the structures in question (comparative structures with phrases „kao” or „poput”) Second phase: testing o After the completion of the grammars (NooJ files for processing texts), results are tested on the bigger corpus o Corpus 1: random texts from the Web corpus of differents styles of text (2,2 million words corpus) o Corpus 2: literal text of mostly Croatian authors (658 Kw corpus) 19.1.2016 12

13 Construction of grammars for processing texts Grammar – a file constructed in NooJ environment, made for syntactic processing of the texts Input, output, variebles, nested grammars Concordance with marked texts as an output 19.1.2016 13

14 Adjective AS Noun Recognizes: Lijep kao slika (pretty as a picture) Pijan kao smuk (drunk as a sponge) Brz kao zec (fast as a bullet) 19.1.2016 14

15 Noun AS prepositon AS a Noun Recognizes: Kao drvena Marija (being stiff, unrelaxed) Poput guske u magli (without thinking) Recognizes: Mrak kao u rogu (pitch dark) 19.1.2016 15

16 Verb AS Noun AS IF Verb Recognizes: Kao da je u zemlju propao (as if the Earth swallowed him) Kao da je pao s Marsa (clueless, as if he came from Mars) Recognizes: Ići kao po loju (go smoothly, slide like over the fat) Šutjeti kao grob (be silent as a grave) 19.1.2016 16

17 Example of results Comparative structure 19.1.2016 17

18 Evaluation 19.1.2016 18 Kilo- words (Kw) Number of structures found PrecisionRecallF-measure Training corpus 100%96%98% Corpus 1 (web) 2247 Kw22 Corpus 2 (books) 658 Kw67 Average

19 Conclusions about comparison in Croatian Number of comparative structures in different types of texts varies greatly o General texts (web corpus) – 1 per every 10000 words o Literal texts (books from Croatian authors) – 1 per every 1000 words Confirmed hypothesis that such structures are pertaining mostly to literal style o 10 times more frequent in books and works of fiction o Rare in other styles of writing due to the stylistic marking they bring to the text 19.1.2016 19

20 Thank you for your attention. Questions?


Download ppt "Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb."

Similar presentations


Ads by Google