Presentation is loading. Please wait.

Presentation is loading. Please wait.

109.05.2015 International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics.

Similar presentations


Presentation on theme: "109.05.2015 International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics."— Presentation transcript:

1 International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany Old German and Old Lithuanian: The Creation of Two Deeply-Annotated Historical Text Corpora

2 Introduction Aim: creation of deeply-annotated corpora of historical language stages Approach: depending on existing resources from previous analyses qualities of the language itself Comparison of approaches: Old German Reference Corpus (OG/OGRC) Old Lithuanian Reference Corpus (OL/OLRC)

3 Description of the corpora Old German Reference Corpus (Referenzkorpus Altdeutsch) all preserved texts from the oldest stages of German – Old High German and Old Saxon (= Old Low German) – ca. 750 – 1050 CE – ca. 650,000 word tokens cooperation of 3 German universities: 2008 – 2013 – Humboldt University (Berlin) – Goethe University (Frankfurt am Main) – Schiller University (Jena) several subcorpora already searchable online

4 Description of the corpora OGRC:

5 Description of the corpora Old Lithuanian Reference Corpus (Senosios lietuvių kalbos korpusas) preserved texts from the oldest stage of Lithuanian – ca – 1800 CE – ca. 10,000,000 word tokens pilot project covering 540,000 word tokens started in 2012 international cooperation – Lithuanian Language Institute (LKI, Vilnius) – Goethe University (Frankfurt am Main) – University of Pisa, Italy use of experiences made with the OGRC due to cooperation in Frankfurt

6 Description of the corpora Qualities of the texts of both corpora types of texts: – religious and secular texts – prose and poetry – translated/adapted and independently composed texts language: – variation due to diachronic, diatopic and diastratic differences foreign-language source texts and foreign-language words in the texts: – annotation as similar as possible to OG/OL word tokens – comprised in aforementioned word token numbers Old Lithuanian: balanced choice of texts for pilot project

7 The unequal starting points Divergence from modern languages OL considerably closer to Modern Lithuanian than OG to Modern (High or Low) German – not only due to different age: invention of printing press in 15 th century and spread of written texts  deceleration of transformation pace of European literary languages  moderate language development from OL to Modern Lithuanian (however, large differences in spelling, in OL many variants) vs. extensive mutations in vowel system between OG and Early Modern Times (e.g. reduction of unstressed vowels to schwa/zero)

8 The unequal starting points  Impacts on availability of resources Old Lithuanian – no historic dictionary of Lithuanian, no OL grammar (but OL dictionaries) – dictionaries and grammars of Modern Lithuanian may be helpful Old German – specific dictionaries and grammars – glossaries for every subcorpus: all attested inflected word forms, related to corresponding lemmata  OLRC: basis for compilation of OL grammar and glossary  OGRC: questioning and amending of existing works

9 The unequal starting points Digital availability of the texts OG: one printed edition per text digitized by TITUS project in Frankfurt OL: 10 texts in pilot project – 6 on TITUS – 3 adopted from OL database of Lithuanian Language Institute (LKI) – 1: edition being prepared TITUS texts: – structural annotation: e.g., chapters and lines for original document and edition – information can directly be adopted, together with texts

10 The unequal starting points titus.uni-frankfurt.de

11 The unequal starting points Referential text version OGRC: – digitized edition as main reference layer – manual addition of original text forms and graphical peculiarities saved for later, only performed by way of example OLRC: – digitized edition extended by version of original manuscripts or prints – detailed representation of amendments  digitization of original documents required

12 The courses of action: OGRC Pre-annotation digitization of glossaries for the subcorpora into XML format

13 The courses of action: OGRC Pre-annotation digitization of glossaries for the subcorpora into XML format linking part-of-speech and morphological data of the word forms with the word tokens in the texts: – extraction of data from glossary files – enrichment with additional part-of-speech and morphological information manually extracted from grammars most glossaries give attestations with locations in text  one-to-one-attribution aim of consistent spelling and consistent modern German translation  adaptation of glossary lemmata to standard dictionaries of Old High German and Old Saxon

14 The courses of action: OGRC Conversion and manual annotation conversion into ELAN format – software by Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands database structure – with part-of-speech, morphological, lemmatical and structural pre-annotation manual annotation: – amendment of information – dissolution of ambiguities – addition of simple syntactical annotation

15 The courses of action: OGRC automated creation of standardized version of word tokens – from lemmata plus part-of-speech and morphological data – morphological knowledge of language stages conveyed into Perl program – standard word forms used to detect annotation mistakes by automated comparison with word forms in text edition

16 The courses of action: OLRC Pre-annotation no glossaries  annotation tool learning from manual annotation required use of Toolbox (by SIL International, Dallas, Texas) – applying expansible dictionaries one dictionary with data of Lemuoklis – morphological analyser, lemmatizer and tagger by the LKI – enriched by semi-manually classified data from dictionaries on OL, Slavic loanwords in OL and Bible names other dictionary with data of Lithuanian language dictionary – retrieval of data on all lemmata in the corpus from its digital version

17 The courses of action: OLRC Annotation in Toolbox (OLRC)

18 The courses of action: OLRC lemmatization of word forms of OL texts: if possible, automatic, else manual creation of standardized word forms by Lemuoklis from lemmata, part-of-speech and morphological annotation Modern Lithuanian-English dictionary  lemma translation conveyance of word tokens into standardized spelling: Consistent Changes Program (SIL) – mainly for older texts, specific rules for every single author needed

19 The courses of action: OLRC Manual annotation and conversion in Toolbox: – joining of texts with Lemuoklis ʼ data – manual disambiguation Toolbox: no chart structure, restriction of amount of annotation layers  transfer of data into ELAN – automated split-up of word forms into graphemes  annotation (also OGRC) – e.g., addition of information on multiword expressions, quotations and glossing of words  conversion into image annotation tool ImAnTo (Frankfurt University) – annotation of facsimiles of original documents – selection of details of images and linking to annotations

20 The courses of action: OLRC

21 The courses of action: Parallel processing Tagsets and annotation schemes part-of-speech and morphological annotation: OGRC: Deutsch Diachron Digital Tagset (DDDTS) – adaptation of TIGER Morphology Annotation Scheme for Modern German, based on Stuttgart-Tübingen Tagset (STTS) – DDDTS used as basis for creation of tagset for OL – distinguishing between lemma-specific and record-specific qualities of word tokens language of word tokens according to ISO (goh, osx; olt; lat)

22 The courses of action: Parallel processing The ANNIS database transfer of subcorpora of both projects into ANNIS database (Potsdam University, Germany) joining of texts with extensive metadata description – developed by Middle High German and OGRC, adapted by OLRC complex search patterns possible, more comfortable search tool in preparation

23 The courses of action: Parallel processing Representation in the ANNIS database (OGRC)

24 Conclusion Comparison of approaches for OL and OG work on OLRC benefits from course of action applied for OGRC – in spite of various aspects diverging initially OLRC can use digitized data and tools for Modern Lithuanian – inapplicable for OGRC lack of glossaries for OLRC  additional adaptive annotation tool special approaches required for objectives exceeding those of OGRC – e.g. precise annotation of facsimiles of original documents  however, cooperation advantageous, more time for philological work

25 Thank you for your attention! Спасибо за внимание! Old German Reference Corpus:


Download ppt "109.05.2015 International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics."

Similar presentations


Ads by Google