"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute, Ljubljana, Slovenia

European Union & Slovene Ministry of Education and Sport The operation is partly financed by the European Union, the European Social Fund, and the Ministry of Education and Sport of the Republic of Slovenia. The operation is being carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013.

“Communication in Slovene” http://www.slovenscina.eu Leading partner: Amebis, d. o. o., Kamnik Duration: June 2008 - December 2013 Total value: 3.2 million Euro Project consortium: Amebis, d. o. o., Kamnik Jozef Stefan Institute University of Ljubljana Scientific Research Centre of the Slovenian Academy of Sciences and ArtsScientific Research Centre of the Slovenian Academy of Sciences and Arts Trojina, Institute for Applied Slovene Studies

Goals Natural Language Processing Tools and Resources Didactics Language description (and standardization) Language Data

Slovene Lexical Database

Timeline Number of lexical units: minimum 2,500 June-October 2008: preparation November 2008-June 2009: specifications June 2010 June 2011 June 2012

Legal aspects Creative Commons –Attribution –Share Alike –Noncommercial Availabitity –On-line (http://www.termania.net/)http://www.termania.net/ –Dataset (http://www.slovenscina.eu/)http://www.slovenscina.eu/ Owner: Ministry of Education and Sports Future: Slovene HLT Agency?

Past experience International (early): –GENELEX (1990-94) –LE PAROLE (1993-98) –SIMPLE (1998-2002) ----------------------------------------------- –ACQUILEX I, II (- 1995) –ILC- DELIS … Individual languages: elexico (DE), CLIPS (IT), CORNETTO (NL), ALFALEX (FR), STO (DK), ADESSE (SP), GRIAL (SP), CEGLEX (PL), SALDO (S), BLF (FR), PRALED (CZ),... elexicoCLIPSCORNETTO ALFALEX STOADESSEGRIALCEGLEXSALDOBLF Important for us: FrameNet, Corpus Pattern Analysis, DANTE, COBUILD

Basics corpus data analysis lexicogrammatical approach semantics and syntax are not separated valency – colligation – collocation meaning = meaning potential –is not stable (norms & exploitations) lumpers vs. splitters = splitters lexicography first, NLP second

semantic indicator semantic frame syntactic structure & pattern syntactic combination collocation extended collocation example phraseology Lexicogrammatical continuum

I. LEXICAL UNIT headword to squeeze part-of-speech verb VI. PHRASEOLOGY phraseological unit to squeeze a quart into a pint pot II. SENSE indicator 1. grip firmly 2. press out liquid frame If a PERSON squeezes an OBJECT, If a PERSON squeezes a LIQUID s|he presses it firmly, usually or a SOFT SUBSTANCE out of with his|her hands. an OBJECT, s|he gets the liquid or substance out by pressing the object. multi-word unit (only nouns and adjectives) IV. COLLOC'S collocation to squeeze (sb's) [hand, arm] to squeeze [the poison, the venom] out V. EXAMPLES example I squeezed her hand gratefully.She immediately squeezed the poison out and that probably saved her life. III. SYNTAX structure vb-obj vb-out-obj pattern sb squeezes sth sb squeezes sth out combination (to squeeze your eyes shut)

I. Lexical Unit link to the lexicon –morphosyntactic information –corpus frequency –pronunciation etc. additional grammatical information –can be inferred (un/countability etc.) –manual (part-of-speech subtypes etc.)

II. Semantic Level Semantic Indicators –simple EFL-like explanations or synonyms forming a sense menu –self-explanatory in relation to each other Semantic Frames –COBUILD / FrameNet / Corpus Pattern Analysis –combination of the systems

Semantic Indicators 1 padat déšť 1.1 o věcech 2 objevovat se ve velkém množství pršet sloveso

Semantic Frames identification of verb/semantic arguments –prototypical pattern – “the norm” (Hanks) –the headword in its syntactic environement identification of semantic types in particular syntactic positions the semantic scenario –a full-sentence definition making a link between the arguments and the situation (FN) typical for a particular sense

Semantic Frame když prší, padají kapky z mraků na zem 1 padat déšť když VĚCI nebo jejich SOUČÁSTI prší, padají jako kapky deště na zem 1.1 o věcech když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně 2 objevovat se ve velkém množství

III. Syntactic Level semantic frame (between semantics and syntax) semantic arguments in capital letters (ID-ed) linked with collocates via syntax syntactic structures (formal) clause and phrase level (all POS; only for NLP) the number of syntactic structures is finite (SLB ~290) source: word sketches (Sketch Engine) syntactic patterns (verbalized) valency (only verbs; for lexicography and NLP) syntactic combinations more than basic patterns: "to squeeze your eyes shut"

Syntactic Structures NP/S+pršet ADV+pršet když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně 2 objevovat se ve velkém množství

Syntactic Patterns NP/S+pršet –co prší –co prší na co/kogo když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně 2 objevovat se ve velkém množství

IV. Collocation Level ● SEMANTIC FRAME: 1 když prší, padají kapky z mraků na zem 2 když KRITIKY nebo DOTAZY prší, znamená to, že že je jich hodně ● SYNTACTIC STRUCTURES AND PATTERNS: 1 NP/S+pršet LOKACE 2 NP/S+pršet co prší pršet na co/koho co prší pršet na čem co prší na co/kogo If a part of syntactic patterns are collocational, they are shown on the collocation level. ● COLLOCATIONS ■ [kapky, déšť] pršet ■ [kritika, dotazy] prší ■ pršet na [zem] ■ pršet na [hlavu]

V. Examples ● COLLOCATIONS ■ [kapky, déšť] pršet ■ [kritika, dotazy] prší ■ pršet na [zem] ■ pršet na [hlavu] ● EXAMPLES (TBL + GDEX) Dívám se z okna, jak prší déšť. Tato klenba zadržuje vodu, která pak skrze průduchy prší na zem. Nevýhodou přilby s otvory je, že při dešti Vám prší na hlavu. Na nakladatelství pršely dotazy, zda kniha vyjde i česky. Zdrcující kritika pršela na adresu vlády i na tiskové konferenci, kterou v úterý uspořádal Svaz obchodu a cestovního ruchu (SOCR).

reference general user school population Slovene as foreign language semantic info menus + frames collocations corpus examples natural language processing computer linguist FOR WHAT FOR WHOM WHAT semantic frames syntactic structures syntactic patterns syntactic structures syntactic patterns other grammatical info

Corpus Data & Authoring Tools FidaPLUS – Gigafida Sketch Engine: www.sketchengine.co.ukwww.sketchengine.co.uk –Slovene Sketch Grammar (LBS syn. structures) –Tick-box Lexicography –GDEX IDM Dictionary Production System –http://www.idm.fr/products/dictionary_writing_system/27/http://www.idm.fr/products/dictionary_writing_system/27/ –custom DTD

FidaPLUS (Gigafida) precursor: FIDA (1997-2000 – 100 million) 621 million tokens tagged-lemmatized (85% accuracy – rule-based tagger) taxonomy –text types –medium –linguistic proof-reading time span: 1990 – 2006 concordancers –http://www.fidaplus.net/http://www.fidaplus.net/ –http://www.sketchengine.co.uk/http://www.sketchengine.co.uk/

SLB sketch grammar + TBL to love

TBL – examples by GDEX

TBL – Entry Editor

GDEX – Good Dictionary Examples system for evaluation (ranking) of sentences with respect to their suitability to serve as dictionary examples sorting sentences so that good examples do not have to be searched for in hundreds of unusable sentences initially trained on English, it did not give good results for other languages

Evaluation

Authoring & search tools IDM Dictionary Production System –currently used by lexicographers iLex (http://www.emp.dk/)http://www.emp.dk/ –in the process of evaluation T-Lex (http://tshwanedje.com/)http://tshwanedje.com/ –evaluated, stand-by ABBYY (http://www.abbyy.com/lingvo_content/)http://www.abbyy.com/lingvo_content/ –in the process of evaluation Termania (http://www.termania.net/)http://www.termania.net/ –online search and vizualization tool

Corpora and web concordancers

Corpora Gigafida –corpus of written texts KRES –smaller and more carefully balanced corpus of written texts GOS (Govorjena slovenščina) –corpus of spoken Slovene Šolar –corpus of school essay transcriptions with teachers’ corrections

CONCORDANCERS Gigafida, KRES, Šolar –written –http://demo.gigafida.net/http://demo.gigafida.net/ GOS –spoken –http://www.korpus-gos.net/http://www.korpus-gos.net/

Gigafida new generation in the written corpus series –FIDA (2000), FidaPLUS (2006), Gigafida (2011) 1,148,350,213 tokens (1.15 billion) simplified taxonomy changed copyright status –10% can be used freely (downloadable as a data set) –no authentication for web access new annotation tools

Corpus annotation new statistical tagger: 92.17 % meta-tagger – a combination of the Amebis rule- based tagger and the new statistical tagger new lemmatizer: 98-99 % new parser under development: MSTParser training corpus: –500.000 words: manually verified POS tags –200.000 words (~11.300 sentences): manually verified dependency treebank with only 10 lables

Taxonomy

KRES & free corpus KRES (in development) –100 million words –online –balanced Free corpus (in development) –100 million words –10% of each corpus document –downloadable data set

Taxonomy KRES

GOS the first corpus of spoken Slovene –120 hours of speech –one million words criteria –demographic –speech type/situation –additional (language learning, 15%) transcription –pronunciation-based –standardized

Web concordancers Log analysis of FidaPLUS concordancer FidaPLUS web survey Analysis of existing corpus tools Analysis of popular web tools (Google etc.) Final goal –use in classroom and by general public –linguists can use existing tools (SkE, CWB, etc.)

Survey – findings Simple search – regularly used by 72% users Advanced search – rarely used (only 8% use it regularly) Lack of intuitiveness The manual is almost key to learning how to use a corpus tool “…if you are not using the interface for a while, you forget what the search commands are, and you don’t (want to) bother with looking into the manual” “…the interface should have a modern design, it should be more user-friendly, and its use should be clear and transparent”

Main design principles similarity to the well-known non-linguistic tools (e.g. Google) No registration Minimum navigation No redundant functions (less is more) Simplicity of searches Help and tips in pop-up windows Simple descriptions of functionality (no terminology )

The result two concordancers –written corpora: Gigafida, other w. corpora –spoken corpus: GOS only one meta-character: quotation marks extensive use of filters –multiple possible lemmas –use of capital letters –immediate access to meta-information

"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Similar presentations

Presentation on theme: ""Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,

Similar presentations

Presentation on theme: ""Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,"— Presentation transcript:

Similar presentations

About project

Feedback