Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of corpora and other language resources

Similar presentations


Presentation on theme: "Overview of corpora and other language resources"— Presentation transcript:

1 Overview of corpora and other language resources
The Árni Magnússon Institute for Icelandic Studies Overview of corpora and other language resources The Árni Magnússon Institute for Icelandic Studies In recent years, starting about 10 years ago, we have been building various language resources for Icelandic, to be made available electronically. Much of this work has been at or in collaboration with The Árni Magnússon Institute. We maintain a website (malfong.is) which lists all available language resources, not just our resources but everything else we know about too. I will introduce the corpora and language resources that have been built in recent years and made available by The Árni Magnússon Institute for Icelandic Studies. I will briefly describe the resources and their usage licenses. I may also mention resources made available by others. Some of our data sets have quite severe restrictions of use but others can be used for almost anything in almost every way imaginable. Some licenses were specially made for the data set in question and some are CC licenses. Some of the datasets I talk about have “recently” been made available using “open” licenses, even though they were collected years before. And that is what we want to do, make our data as open as possible so as many as possible can and will use them. March 9th, 2015

2 Overview Text Corpora Speech Corpora
Language descriptions and Dictionaries Language Tools

3 Text Corpora Proprietary license Proprietary license
CC BY 3.0 license Others: Icelandic Parsed Historical Corpus; Wortschatz The Icelandic Frequency Dictionary was the first big project where a considerable amount of contemporary Icelandic texts were collected, analyzed and reported on. When we started training taggers, i.e. for tagging the Tagged Icelandic Corpus, this was the obvious data to use and about 70% of the original data has been available to everyone for a few years now. Not all, because when the data was collected, mostly from books (up to 50% is fiction), the copyright holders only accepted the use for the purposes of that research, so we had to get them to accept another license when we published it electronically 20 years later. This license is quite complicated but essentially only allows use for research. The tagged Icelandic Corpus is more balanced, so to say. It includes texts from printed books, fiction and non-fiction, newspapers, periodicals, websites (blogs, educational, government, etc.) speeches made at the Parliament, student essays, radio and tv scripts, lists, etc. Some of that data is freely available for everyone for free use, but most of it is not and therefore the whole corpus has a license similar to that of the IFD. A 1 million word subset of MIM (tagged icelandic corpus) was created, MIM-GOLD. It was automatically tagged (using IFD as training data) and the tags were then manually corrected. This work is in its final stages. The planned correction process was finished last year and we have estimated the tagging accuracy using this data. We found a few flaws that will not be too much of a hassle to fix to get the accuracy of the tags to the same level as the IFD. The current version is available for download now. It has the same license as MIM. Only for research. The Saga Corpus is a corpus of old Icelandic sagas, 41 texts in all. Most of the texts were published in this form between 1985 and 1991, the texts were normalized to Modern Icelandic spelling and several inflectional endings were also changed to modern icelandic form. It was tagged iteratively, first using a method developed for modern Icelandic. The tagging accuracy was measured in random samples (88%, compared to 90.4% for IFD texts). Some texts were then selected for manual correction. They were added to the IFD data and a new model created, finally reaching accuracy of 92.7%. The saga corpus is distributed with a CC BY 3.0 license, which makes it pretty close to public domain. Icelandic Parsed Historical Corpus – is a diachronic corpus with samples of written Icelandic from the 12th century to modern times. 1 million words and is mostly comparable to the corpora of historical English, developed at Upenn. Wortschatz is a text corpora of more than 500 million running words, mostly from the National Library's web scraping archives. ( ) Developed at the Univeristy of Leipzig.

4 Speech Corpora Parliament Speech Corpus Hjal Corpus Málrómur
20 hours of speech CC BY 3.0 Hjal Corpus Collected in 2003 for speech recognition Málrómur Currently 44 hours of clean speech Collected in cooperation with Google for Speech Recognition CC BY 4.0 ISLEX Recordings Recordings of all the Icelandic words (48.500) in the ISLEX dictionary and roughly 700 phrases. CC BY NC ND 3.0 Others: Jensson Corpus, Thor Corpus, RÚV discussions. The Parliament Speech corpus contains 20 hours of speech ( running words) in synchronized text- and sound files. Recordings from with detailed transcriptions in text files. Information about speakers (age, gender) are provided as well. This data is intended to reflect natural spoken Icelandic under formal conditions. The discussion periods were chosen as they primarily consist of unprepared speeches that are unlikely to have been written in advance and read out loud. The transcriptions and processing of the material was mostly carried out by students. The Hjal Corpus was collected in 2003 for training a speech recognition system. It contains over sound files, each containing an utterance of one or more words recorded over phone. Most of them have only a single word (but there are numbers, place names, etc.) It is hard to estimate the total duration because the sound files contain lots of silences before and after the utterances. Málrómur is the most recent speech corpus. It was collected 3 years ago by Reykjavik University and The Árni Magnússon institute in collaboration with Google. Google used this data as a basis for training their recognizer for Icelandic, but the recordings are open source, that is available to all. We recorded around 130 thousand utterances and include information on speakers (age group, gender). We are in the process of cleaning the data, that is cutting of long silences before and after the utterances and making sure the spoken text is the same as the prompts given. We have published 57 thousand utterances, in total around 44 hours of clean speech. This data was recorded using Samsung phones, but not through the phone line. ISLEX Recordings. Recorded in a studio. Read by one woman (50 years old). The three corpora mentioned at the bottom contain in total between 6 and 7 hours of speech, with multiple speakers under good recording conditions.

5 Language Descriptions and Dictionaries
Pronunciation dictionary Over phonetically written word forms CC BY 3.0 BÍN – Database of modern icelandic inflection paradigms Proprietary license The Icelandic Terminology Bank 42 termbases CC BY SA 3.0 ISLEX – Icelandic – Scandinavian dictionary words CC BY NC ND 3.0 IceWordNet The pronunciation dictionary was built as a part of the Hjal-project, discussed earlier. This is a list of phonetically transcribed words read by the participants in the Hjal-project. The Database of modern icelandic inflection is a collection of paradigms. The project started in 2002 and the work is still ongoing. It currently contains more than paradigms. The data is available for download and can be used with certain restrictions, such as the user is not allowed to distribute the database to others. This database has proved very useful and is used in a wide variety of projects. Everything for web search, spellcheckers/grammar checkers to computer games. The Icelandic Terminology Bank is a syndicate of termbases, which have been collected by specialists in their fields. The Árni Magnússon Institute has provided the infrastructure for keeping records of the terms and publishing them online. The terminology bank contains around 60 searchable termbases, and 42 of these can be downloaded and used under an open license. The termbases vary greatly in size and details, with the smallest containing a few hundred terms but the biggest tens of thousands of terms. ISLEX is an online multilingual dictionary with modern Icelandic as a source language and Danish, Norwegian and Swedish as target languages, with Faeroish being opened this month and Finnish is also being worked on. The online dictionary access is free of charge and the data is available for researches under a non-commercial, non-derivative license. IceWordNet is based on Princeton WordNet. It consists of nearly 5000 Icelandic translations of the words from the core list from Princeton, along with the Icelandic synonyms of the words

6 Language Tools Older tools: CombiTagger, IceNLP, Lemmald
New tools: Skrambi, Nefnir, Kvistur A few tools have been developed for working with language resources, these include CombiTagger, Lemmald and IceNLP for tagging, lemmatizing, tokenizing, parsing and recognizing named entities. These older tools can use some updating as their accuracy is not always optimal, to say the least. Tomorrow we will hear about some of the new tools being developed. These include a spellchecker and lemmatizer.


Download ppt "Overview of corpora and other language resources"

Similar presentations


Ads by Google