Presentation is loading. Please wait.

Presentation is loading. Please wait.

SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.

Similar presentations


Presentation on theme: "SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University."— Presentation transcript:

1 SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University of Ljubljana, Faculty of Economics International scientific conference «Corpus linguistics» Saint-Petersburg State University, June 25 – 27, 2013

2

3 SLOVENIA Population: 1,992,690 Ljubljana (capital) 260,000 Independence: 25 June 1991 (from Yugoslavia) Surface: 20,273 sq km Border countries: Austria, Croatia, Hungary, Italy Adriatic coastline: 46.6 km Highest point: Triglav 2,864 m

4 SLOVENIA (2) Language: Slovene (var.: Slovenian) Ethnic composition: Slovene 83.1%, Serb 2%, Croat 1.8%, Bosniak 1.1%, other or unspecified 12% Religions: Catholic 57.8%, Muslim 2.4%, Orthodox 2.3%, other or unspecified 28%, none 10.1% (2002 census) GDP - per capita: $28,700 (2012) Currency: EURO (introduced in 2007)

5 SLOVENE LANGUAGE Slovenski jezik, slovenščina Western South Slavic language cca. 2,4 mio speakers (1,85 mio first language) 50 regional dialects (limited understanding: „most diverse Slavic language“) Latin alphabet Č, Š, Ž Highly inflected language Particularities: dual

6 SLOVENSKI BESEDILNI KORPUSI 20 < CORPORA AVAILABLE ONLINE REPRESENTATIVE (GENERAL) CORPORA SYNCHRONOUS CORPORA Nova Beseda – 240 mio words,  2004 (cca 10 years‘ coverage) GigaFida – 1,2 bill. words, 1990-2011 SPECIALISED CORPORA – DSI, Jos, Evrokorpus, VAYNA... – EduKorp, Bibliotekarstvo 6

7 Slovene LIS Terminology Long professional tradition Linguistic shortage in the subject field – Lack of written technical texts – German language tradition – Later English influences – NO dictionaries in LIS terminology – Terminology Project 1987 – Important tangible results

8 Usables International Project – Multilingual Dictionaries of Library Terminology English-Slovene Dictionary of Library Terminology (Slovene) Dictionary of Library Terminology – Printed edition – Electronic edition (web, public access) Text Corpus – Korpus bibliotekarstva

9 Korpus bibliotekarstva Specialized corpus Library and Information Science & practice Synchronous Open public access Dedicated in-house software – PC dat aprocessing – Web-based usage – Rich experience (eg. Dictionaries of the Slovene Academy of Sciences and Arts)

10 Texts Defined selection criteria Subject & Level Written texts Electronic published texts only – Digital born – Digitized & published – NO scanning for the corpus Technical limitations and barriers

11

12 Selected texts & Functions

13 Basic functions Simple/basic search – Single words & phrases – N-grams (N = 1 – 5) – Concordances – Global corpus – selected document segment(s) – Exact matching – Truncation (*) – Upper / lower case Knjižnica - knjižnica

14 Basic functions (2) Advanced search – Frequency search=, Fr>1000 Fr>200 in be:kata* – Word length=, Do=15 Word masking* adjective + substantive * katalog knjižnični *

15 Hyperlinked list of texts & authors

16 Concordance list

17 Citation

18 Full-text access

19 Single word

20 Bigrams

21 Bigrams (2)

22 4-grams

23 Insight 625 texts 353 authors (single or co+authors) 3,66 mio words Lematisation Part of speech tagging 28.808 individual distinctive words Highest frequency- 172.031 (aux. v. „to be“) Hapax legomena- 7.310

24 Frequency distribution First 50

25 Zipf‘s Law vs. experience

26 Parts of speechVerbs

27 NounsAdjectives

28 Accessibility Open Access CC License BLOG Bibliotekarska terminologija http://terminologija.blogspot.com 28

29 Problems & Challenges Choice & acquisition of texts „Analogue“ texts Copyright issues Technical barriers – PDF protected data – Special characters – Special text formatting – Typing errors – Genuine OCR errors

30 Problems & Challenges (2) Linguistic – Highly inflected language Data processing Search Analysis Part of speech tagging – Foreign language „contamination“ General – Resources Human financial

31 Plans Harvesting new texts – Recent / current digital born publications – Recently digitized (e.g. „Knjižnica“) – „Backlog“ 120 graduate theses 28 master theses 25 monographs & proceedings – Scientific analysis – Dictionary updating and supplementing

32 СПАСИБО ЗА ВНИМАНИЕ! Check:http://terminologija.blogspot.comhttp://terminologija.blogspot.com Contact: ivan.kanic@gmail.comivan.kanic@gmail.com http://www2.arnes.si/~ljnuk4/kanic.html


Download ppt "SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University."

Similar presentations


Ads by Google