Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität.

Slides:



Advertisements
Similar presentations
Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.
Advertisements

IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background The.
Williams Family Photo Album. Photo Album Project.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Hyper Text Markup Language.  HTML is a language for describing web pages.  HTML stands for Hyper Text Markup Language  HTML is not a programming language,
A guide to HTML. Slide 1 HTML: Hypertext Markup Language Pull down View, then Source, to see the HTML code. Slide 1.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The Repertorium Initiative: Computer Processing of Medieval Slavic Manuscripts Prof. PhD Anissava Miltenova, Department of Old Bulgarian Literature, Institute.
Information Retrieval in Practice
Formation of ETD‘s and releated issues 6th ETD Conference May 20 – , Berlin Dr. Nikola Korb, Co-ordination Agency DissOnline Deutsche Bibliothek.
Publishing on the WWW Search Engines & Metadata. Aims and Objectives To identify and discuss the different types of search engine Understand the basic.
WebLicht Application and Workspaces Munich September WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University.
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
Overview of Search Engines
The electronic corpus of 17th and 18th century Polish texts (up to 1772) – aims, methods, current state, problems and prospects for development Włodzimierz.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Strategies for Building Successful Digital Initiatives at Small to Medium Size Institutions Rachel Frick & Andrew Rouner.
Digital Encoding What’s behind E-text Resources?.
DIGITIZATION OF RARE LIBRARY MATERIALS Metadata Format Access to Digital Documents © Adolf Knoll, National Library of the Czech Republic.
July 11, 2003E-MELD 2003 E-MELD “School” of Best Practice Helen Aristar-Dry & Gayathri Sriram The LINGUIST List Eastern Michigan University.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Title, meta, link, script.  The title looks like:  The tag defines the title of the document in the browser toolbar.  It also: ◦ Provides a title for.
Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Metadata Helen Aristar Dry Eastern Michigan University LINGUIST List.
P. Schirmbacher Humboldt-Universität zu Berlin The Changing Process of Scholarly Publishing or the Necessity of a New Culture of Electronic.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
November 10, 2005DLF OAI Training Interoperability, OAI, and Shareable Metadata Sarah Shreeves University of Illinois at Urbana-Champaign OAI Best Practices.
Digital Collection of TUT Library
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Metadata Bridget Jones Information Architecture I February 23, 2009.
Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a.
Introduction to metadata
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
LINGUISTICS RESEARCH AND ANALYSIS OF THE BULGARIAN FOLKLORE. EXPERIMENTAL IMPLEMENTATION OF LINGUISTIC COMPONENTS IN BULGARIAN FOLKLORE DIGITAL LIBRARY.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
A centre of expertise in digital information managementwww.ukoln.ac.uk DCMI Affiliates: Implications for Institutions Rosemary Russell UKOLN University.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Basic Metadata Workshop Claire Hill Project Coordinator, AVEL An introduction to metadata and its application.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Invitation to Computer Science 6 th Edition Chapter 10 The Tower of Babel.
Metadata: Expanding the Challenges Expanding Access: Connecting the Global Community to a Multitude of Formats 11th Biennial OLAC Conference Montréal,
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Open Access and Institutional Repositories, 10 July 2007, UKZN, Durban,,South Africa Metadata for institutional repositories: an introduction Pat Liebetrau.
Выполнила студентка группы 3 ЭФК Мирошниченко Н.В.
Original New Testament Manuscripts
Towards new approaches to editing of old manuscripts and documents
Istituto di Linguistica Computazionale – Pisa
The Most Basic HTML Page
EDDI Copenhagen (Denmark)
Internet Skills ELEC135 Alan Noble Room 504 Tel:
Presentation transcript:

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main

Tokens and Types Distribution in TITUS Outline TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика

Tokens and Types Distribution in TITUS TITUS Resource Data TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) Корпусная лингвистика A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 3 TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens

Tokens and Types Distribution in TITUS TITUS Data Корпусная лингвистика Added by J. Gippert, R. Mittmann 4

Tokens and Types Distribution in TITUS TITUS Search Engine TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic Biblia Gothica contains additional parallel passages in Latin and Greek. Корпусная лингвистика Biblia Gothica ( 6

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Корпусная лингвистика Codex Marianus ( 7

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish Old Polish texts contain a simultaneous display of editions that have arisen at different times. Корпусная лингвистика Kazania Świętokrzyskie ( kazania/kazan.htm). 8

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Корпусная лингвистика Ossetian: Nart epic ( nart/nart.htm). 9

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German Tönnies Fenne's Manual (17th century) contains at least 9 different languages ​​or language variations. Корпусная лингвистика

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian Корпусная лингвистика Old Prussian corpus consists of at least 21 different languages ​​or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 11

Tokens and Types Distribution in TITUS Creation A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. Корпусная лингвистика $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; # $zeile =~ s/\d*\s+ //g; # 12

Tokens and Types Distribution in TITUS Examples: Gothic Корпусная лингвистика Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types TokensTypes Gothic Latin Greek

Tokens and Types Distribution in TITUS Examples: Gothic Gothic Bible. New Testament Books. Total: tokens und types TokensTypes Gothic Latin Greek Корпусная лингвистика

Tokens and Types Distribution in TITUS Examples: Корпусная лингвистика Tönnies Fenne's Manual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 15

Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component MetaData Infrastructure Корпусная лингвистика

Tokens and Types Distribution in TITUS CMDI - Component MetaData Infrastructure Корпусная лингвистика

Tokens and Types Distribution in TITUS TITUS Metadata: HTML Format TITUS Texts: Biblia gothica: Frame Корпусная лингвистика

Tokens and Types Distribution in TITUS New Metadata Set for TITUS Корпусная лингвистика * Namevorhanden *Authornew *ProjectContactNameexisting *ProjectContactAddressexisting *ProjectContact existing *ProjectContactOranisationexisting *ProjectDescriptionexisting *Resource.Languageneu *Resource.ResourceLinkexisting *Resource.Access.Availabilityexisting *Resource.Access.Dateexisting *Resource.Access.Ownerexisting *Resource.Access.Publisherexisting *Resource.Publication.Time.Original.Manuscriptnew *Resource.Publication.Time.Original.Facsimilenew *Resource.Publication.Time.Original.Publishednew *Resource.Publication.Time.Electronicexisting *Resource.Wordcount.General.Tokens*new (CLARIN) *Resource.Wordcount.General.Typesnew *Resource.Wordcount.Language.Tokensnew *Resource.Wordcount.Language.Typesnew *Resource.Metadata.Encodingnew

Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI Tokens 893 Types Tokens | Types Language 1_General 10 Tokens | 9 Types Language 2_Gothic 420 Tokens | 240 Types Language 4_Latin 572 Tokens | 325 Types Language 5_Greek 627 Tokens | 319 Types Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Thank you for your attention! Корпусная лингвистика Links ARBIL (Metadaten-Editor) CLARIN CMDI Dublin Core IMDI OLAT TEI TITUS 25

Tokens and Types Distribution in TITUS Корпусная лингвистика Old Prussian Corpus Tokens General: tokens Types General: 8390 types 26