New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.

Slides:



Advertisements
Similar presentations
Methods and Tools for Development of the Russian Reference Corpus Serge Sharoff University of Leeds.
Advertisements

"Communication in Slovene" with an emphasis on the Slovene lexical database and corpora Simon Krek Amebis, d.o.o., Kamnik, Slovenia Jožef Stefan Institute,
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Høgskolen i Oslo Using Self-Compiled, Discipline- Specific Corpora as a Practical Learning-Research Tool for Developing Written Language Skills in English.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Future challenges of Corpus Linguistics Voltaire comment from earlier: we see things from our own perspective How to “harness the power” of text archives,
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
Audio-visual media in L2 teaching Film. What media do you use? 2 Videos with transcription Available on YouTube or Deutsche Welle (
Research methods in corpus linguistics Xiaofei Lu.
Claudia Borg, Institute of Linguistics Ray Fabri, Institute of Linguistics Albert Gatt, Institute of Linguistics Mike Rosner, Department of Intelligent.
The electronic corpus of 17th and 18th century Polish texts (up to 1772) – aims, methods, current state, problems and prospects for development Włodzimierz.
The Translational English Corpus: A practical approach to corpus building.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Memory Strategy – Using Mental Images
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Collection Map Frank R. Walkup Library John Muir High School Pasadena, California Submitted for review October, 2011.
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
©2006 Barry Natusch Tools for Language Researchers Barry Natusch “ Man is a tool-using animal. Without tools he is nothing, with tools he is all. ” - Thomas.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Semantic Web services Interoperability for Geospatial decision.
MLA Format MLA (Modern Language Association) Most commonly used to write papers and cite sources for liberal arts and humanities.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Wikipedia and youtube as a new source of information Matěj Trakal and Tomáš Bouda.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
The Balanced Tagged Corpus of Icelandic and Other Icelandic Language Technology Resources Eiríkur Rögnvaldsson, University of Iceland Sigrún Helgadóttir,
Introduction to the News. General Terms Journalism Gathering and reporting of news Journalist One who gathers and reports news News Information previously.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
MedKAT Medical Knowledge Analysis Tool December 2009.
Citing your Sources  A bibliography or Works cited page is a list of all the sources used in your project, arranged alphabetically by author's last name.
YOUR TEXT HERE CONSUME INFORMATION RISK SECRET MEDIA DESIGN EXPAND
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
ELibrary Elementary # 43 Rethinking How We Do Research Liz Golden & Johanna Lawler Teacher - Librarians Greater Essex County District School Board with.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
How Many Words Does It Take to Listen and Read in English?
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Darja Fišer CLARIN ERIC Director of User Involvement
Overview of corpora and other language resources
Artificial Intelligence 2004 Speech & Natural Language Processing
ENETCOLLECT - WG2 Simon Krek.
Presentation transcript:

New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social SciencesJozef Stefan Institut

“Communication in Slovene” Web site: Leading partner: Amebis, d. o. o., Kamnik Duration: June December 2013 Total value: 3,2 million Euro Project consortium: Amebis, d. o. o., Kamnik Jozef Stefan Institute University of Ljubljana Scientific Research Centre of the Slovenian Academy of Sciences and ArtsScientific Research Centre of the Slovenian Academy of Sciences and Arts Trojina, Institute for Applied Slovene Studies

Language data Three corpora of Slovene:  a billion word written corpus  GigaFIDA  100 million word balanced subcorpus  KRES  a million word corpus of spoken Slovene  GOS

Other activities NLP tools & resources –statistical tagger and parser –training corpus ( words) –lexicon ( lemmas) Language learning –integration of resources & tools in Slovene language teaching –pedagogical corpus interface –pedagogical corpus-based grammar Language description –lexical database (NLP & lexicography) –manual of style

Goals

GigaFIDA a billion word written corpus linguistic annotation –lemmatized –morpho-syntactically annotated –partly syntactically annotated format –XML TEI P5 format purpose –data for the new Slovene lexical database, pedagogical grammar and manual of style –freely available on the web

A bit of FIDA history FIDA corpus – –100 million words –available for project partners (academic & industrial) FidaPLUS corpus – –620 million words –publicly available in the web concordancer –available for partners as a data set –text type: fiction 3,5%, non-fiction 96,5% (90% newspapers and magazines)

KRES a 100 million word written subcorpus criteria –balanced (text types, production-reception etc.) –text quality (processing & annotation) –copyright issues: 10 % purpose –downloadable as a data set –freely available for research (BNC style) –Creative Commons (Authorship, Non-Commercial)

New taxonomy KRESGigaFIDA Print8050 <> 90 Books3515 <> 35 Fiction1720 <> 50 Non-fiction1830 <> 60 Periodicals4020 <> 40 Newspapers2030 <> 70 Magazines2030 <> 70 Other55 <> 10 Internet2010 <> 50 News sites830 <> 70 Corp. & govern. sites 1230 <> 70

GOS a million word corpus of spoken Slovene −120 hours of speech criteria −demographic −speech type/situation −additional (language learning, 15%) transcription –pronunciation-based –standardized

Demographic criteria –sex: 50% M –age: <34: 40% –education: primary/secondary school: 70% –region: SW: 35%, Ljubljana r.: 25%, NE: 25%, Maribor r.: 15%

Speech type/situation criteria –public/non-public discourse: 60% : 40% –media: face to face c.: 50% telephone: 10% radio: 20% TV: 20%

Tools for linguistic annotation Tokenization & segmentation –new more trasparent rules Lemmatizer & tagger –rule-based (Amebis) –statistical (JSI) –metatagger (JSI) Parser –statistical (based on MSTParser) Online services (beta) –tagger: –parser:

March 2011 Three publicly and freely available annotated corpora of modern Slovene, all texts copyright (+ gathering of new texts still in progress) New user-friendly interface (see Iztok Kosem presentation) Freely available tools for linguistic annotation of Slovene (tagger, parser) … and not much further down the road: new, up-to-date language descriptions and manuals See: