Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Similar presentations


Presentation on theme: "New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social."— Presentation transcript:

1 New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social SciencesJozef Stefan Institut natasa.logar@fdv.uni-lj.sisimon.krek@guest.arnes.si

2 “Communication in Slovene” Web site: http://www.slovenscina.euhttp://www.slovenscina.eu Leading partner: Amebis, d. o. o., Kamnik Duration: June 2008 - December 2013 Total value: 3,2 million Euro Project consortium: Amebis, d. o. o., Kamnik Jozef Stefan Institute University of Ljubljana Scientific Research Centre of the Slovenian Academy of Sciences and ArtsScientific Research Centre of the Slovenian Academy of Sciences and Arts Trojina, Institute for Applied Slovene Studies

3 Language data Three corpora of Slovene:  a billion word written corpus  GigaFIDA  100 million word balanced subcorpus  KRES  a million word corpus of spoken Slovene  GOS

4 Other activities NLP tools & resources –statistical tagger and parser –training corpus (500.000 words) –lexicon (100.000 lemmas) Language learning –integration of resources & tools in Slovene language teaching –pedagogical corpus interface –pedagogical corpus-based grammar Language description –lexical database (NLP & lexicography) –manual of style

5 Goals

6 GigaFIDA a billion word written corpus linguistic annotation –lemmatized –morpho-syntactically annotated –partly syntactically annotated format –XML TEI P5 format purpose –data for the new Slovene lexical database, pedagogical grammar and manual of style –freely available on the web

7 A bit of FIDA history FIDA corpus –1997-2000 –100 million words –available for project partners (academic & industrial) FidaPLUS corpus –2005-2006 –620 million words –publicly available in the web concordancer –available for partners as a data set –text type: fiction 3,5%, non-fiction 96,5% (90% newspapers and magazines)

8 KRES a 100 million word written subcorpus criteria –balanced (text types, production-reception etc.) –text quality (processing & annotation) –copyright issues: 10 % purpose –downloadable as a data set –freely available for research (BNC style) –Creative Commons (Authorship, Non-Commercial)

9 New taxonomy KRESGigaFIDA Print8050 <> 90 Books3515 <> 35 Fiction1720 <> 50 Non-fiction1830 <> 60 Periodicals4020 <> 40 Newspapers2030 <> 70 Magazines2030 <> 70 Other55 <> 10 Internet2010 <> 50 News sites830 <> 70 Corp. & govern. sites 1230 <> 70

10 GOS a million word corpus of spoken Slovene −120 hours of speech criteria −demographic −speech type/situation −additional (language learning, 15%) transcription –pronunciation-based –standardized

11 Demographic criteria –sex: 50% M –age: <34: 40% –education: primary/secondary school: 70% –region: SW: 35%, Ljubljana r.: 25%, NE: 25%, Maribor r.: 15%

12 Speech type/situation criteria –public/non-public discourse: 60% : 40% –media: face to face c.: 50% telephone: 10% radio: 20% TV: 20%

13 Tools for linguistic annotation Tokenization & segmentation –new more trasparent rules Lemmatizer & tagger –rule-based (Amebis) –statistical (JSI) –metatagger (JSI) Parser –statistical (based on MSTParser) Online services (beta) –tagger: http://oznacevalnik.slovenscina.eu/http://oznacevalnik.slovenscina.eu/ –parser: http://razclenjevalnik.slovenscina.eu/http://razclenjevalnik.slovenscina.eu/

14 March 2011 Three publicly and freely available annotated corpora of modern Slovene, all texts copyright (+ gathering of new texts still in progress) New user-friendly interface (see Iztok Kosem presentation) Freely available tools for linguistic annotation of Slovene (tagger, parser) … and not much further down the road: new, up-to-date language descriptions and manuals See: www.slovenscina.eu


Download ppt "New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social."

Similar presentations


Ads by Google