Presentation is loading. Please wait.

Presentation is loading. Please wait.

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität.

Similar presentations


Presentation on theme: "Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität."— Presentation transcript:

1 Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de

2 Tokens and Types Distribution in TITUS Outline TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика 201326.06.20132

3 Tokens and Types Distribution in TITUS TITUS Resource Data TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http://titus.uni-frankfurt.de Корпусная лингвистика 201326.06.2013 A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 3 TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens

4 Tokens and Types Distribution in TITUS TITUS Data Корпусная лингвистика 201326.06.2013 http://www.clarin.eu/node/1512 Added by J. Gippert, R. Mittmann 4

5 Tokens and Types Distribution in TITUS TITUS Search Engine TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика 201326.06.20135

6 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic Biblia Gothica contains additional parallel passages in Latin and Greek. Корпусная лингвистика 201326.06.2013 Biblia Gothica (http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm). 6

7 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Корпусная лингвистика 201326.06.2013 Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm). 7

8 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish Old Polish texts contain a simultaneous display of editions that have arisen at different times. Корпусная лингвистика 201326.06.2013 Kazania Świętokrzyskie (http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm). 8

9 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Корпусная лингвистика 201326.06.2013 Ossetian: Nart epic (http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/ nart/nart.htm). 9

10 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German Tönnies Fenne's Manual (17th century) contains at least 9 different languages ​​or language variations. Корпусная лингвистика 201326.06.201310

11 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian Корпусная лингвистика 201326.06.2013 Old Prussian corpus consists of at least 21 different languages ​​or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 11

12 Tokens and Types Distribution in TITUS Creation A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. Корпусная лингвистика 201326.06.2013 $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; # $zeile =~ s/\d*\s+ //g; # 12

13 Tokens and Types Distribution in TITUS Examples: Gothic Корпусная лингвистика 201326.06.2013 Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types TokensTypes Gothic420240 Latin572325 Greek627319 13

14 Tokens and Types Distribution in TITUS Examples: Gothic Gothic Bible. New Testament Books. Total: 170215 tokens und 28876 types TokensTypes Gothic611679121 Latin526489036 Greek5640010719 Корпусная лингвистика 201326.06.201314

15 Tokens and Types Distribution in TITUS Examples: Корпусная лингвистика 201326.06.2013 Tönnies Fenne's Manual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 15

16 Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика 201326.06.201316

17 Tokens and Types Distribution in TITUS Metadata DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component MetaData Infrastructure Корпусная лингвистика 201326.06.201317

18 Tokens and Types Distribution in TITUS CMDI - Component MetaData Infrastructure Корпусная лингвистика 201326.06.2013 http://www.clarin.eu/cmdi 18

19 Tokens and Types Distribution in TITUS TITUS Metadata: HTML Format TITUS Texts: Biblia gothica: Frame Корпусная лингвистика 201326.06.201319

20 Tokens and Types Distribution in TITUS New Metadata Set for TITUS Корпусная лингвистика 201326.06.201320 * Namevorhanden *Authornew *ProjectContactNameexisting *ProjectContactAddressexisting *ProjectContactEmailexisting *ProjectContactOranisationexisting *ProjectDescriptionexisting *Resource.Languageneu *Resource.ResourceLinkexisting *Resource.Access.Availabilityexisting *Resource.Access.Dateexisting *Resource.Access.Ownerexisting *Resource.Access.Publisherexisting *Resource.Publication.Time.Original.Manuscriptnew *Resource.Publication.Time.Original.Facsimilenew *Resource.Publication.Time.Original.Publishednew *Resource.Publication.Time.Electronicexisting *Resource.Wordcount.General.Tokens*new (CLARIN) *Resource.Wordcount.General.Typesnew *Resource.Wordcount.Language.Tokensnew *Resource.Wordcount.Language.Typesnew *Resource.Metadata.Encodingnew

21 Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI 16.6.2002 1629 Tokens 893 Types Tokens | Types Language 1_General 10 Tokens | 9 Types Language 2_Gothic 420 Tokens | 240 Types Language 4_Latin 572 Tokens | 325 Types Language 5_Greek 627 Tokens | 319 Types Корпусная лингвистика 201326.06.201321

22 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201322

23 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201323

24 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201324

25 Tokens and Types Distribution in TITUS Thank you for your attention! Корпусная лингвистика 201326.06.2013 Links ARBIL (Metadaten-Editor) http://tla.mpi.nl/tools/tla-tools/arbil/ CLARIN http://www.clarin.eu CMDI http://www.clarin.eu/cmdi Dublin Core http://dublincore.org/documents/dcmi-terms/ IMDI http://www.mpi.nl/IMDI/ OLAT http://www.language-archives.org/ TEI http://www.tei-c.org/index.xml TITUS http://titus.uni-frankfurt.de 25

26 Tokens and Types Distribution in TITUS Корпусная лингвистика 201326.06.2013 Old Prussian Corpus Tokens General: 17662 tokens Types General: 8390 types 26


Download ppt "Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität."

Similar presentations


Ads by Google