Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

Slides:



Advertisements
Similar presentations
The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
Advertisements

Getting Involved in OLAC Steven Bird University of Pennsylvania LREC Symposium: The Open Language Archives Community 29 May 2002.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LREC Symposium: The Open Language Archives Community.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LSA Symposium: The Open Language Archives Community.
Building metadata components Dieter Van Uytvanck Max Planck Institute for Psycholinguistics CLARIN-NL Info Session Nijmegen
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
A Unified Structure for Dutch Dialect Dictionary Data Folkert de Vriend 1, Lou Boves 1,2, Henk van den Heuvel 1, Roeland van Hout 2, Joep Kruijsen 2, Jos.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
From CLARIN Component Metadata to Linked Open Data
CMDI Interoperability Workshop Daan Broeder TLA / MPI for Psycholinguistics CLARIN NL.
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
TLA/CLARIN CLAVAS Use Cases: Overview CMDI integration – Metadata editing Resource Annotation Kinship data.
Steven KrauwerCLARIN-NL Launch CLARIN-EU: Where do we stand? Steven Krauwer Utrecht institute of Linguistics UiL OTS CLARIN-EU Coordinator.
OneGeology-Europe - the first step to the European Geological SDI INSPIRE Conference 2010, Session Thematic Communities: Geology Krakow, June 24 th 2010.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Populating the Infrastructure using Standards Daan Broeder CLARIN NL EB TLA - MPI for Psycholinguistics CLARIN Coordinators Meeting June 29,30 Budapest.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
CLARIN tools for workflows Overview. Objective of this document  Determine which are the responsibilities of the different components of CLARIN workflows.
CLARIN-NL Call 3 Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
1 CLARIN - NL Language Resources and Technology Infrastructure for the Humanities and the Social Sciences in the Netherlands Jan Odijk LREC May.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
CLARIN web services and workflow Marc Kemps-Snijders.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.
CLARIN-NL Call 4 Jan Odijk CLARIN-NL Call 4 Info-session Amsterdam, 30 Aug
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
CLARINO WP2 National Registry and Long- Term Archiving Freddy Wetjen and Oddrun Pauline Ohren National Library of Norway Bergen, 12. September 2013.
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.
DASISH Metadata Catalogue Binyam Gebrekidan Gebre, Stephanie Roth, Olof Olsson, Catharina Wasner, Matej Durco, Bartholemeus Worcslav, Przemyslaw Lenkiewicz,
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
CMDI Component Registry Patrick Duin Max Planck Institute for Psycholinguistics 2011.
CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Populating the infrastructure the case of the Netherlands Hans Bennis executive board of CLARIN-NL Meertens Institute (KNAW) CLARIN COORDINATORS BUDAPEST,
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senn’s Information Technology, 3 rd Edition Chapter 12 Creating Web-Enabled.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
IPDA Architecture Project International Planetary Data Alliance IPDA Architecture Project Report.
CLARIN and CLARINO resources Knut Hofland Uni Research Computing Bergen, Norway Workshop ICAME 37, Hong Kong,
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
PARTHENOS-project.eu EOSC market demand for art, humanties and cultural heritage Amsterdam– EGI Conference– 7/4/2016 Franco Niccolucci Scientific Coordinator,
The Earth System Curator Metadata Infrastructure for Climate Modeling Rocky Dunlap Georgia Tech.
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Broad Functional Classification a Data Type Registry Use Case
Darja Fišer CLARIN ERIC Director of User Involvement
CMDI Component Registry
Presentation transcript:

Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010 LREC, Malta

Project partners  Daan Broeder  Dieter Van Uytvanck  Folkert de Vriend  Laura van Eerten  Griet Depoorter

3 Outline 1.What is CMDI? 2.What is the goal of our project? 3.How to go from a resource to harvestable metadata? 4.Findings of the project and future challenges

1) What is CMDI?  CLARIN MetaData Infrastructure (CMDI) is the infrastructure used for descriptive metadata in CLARIN (Common Language Resources and Technology Infrastructure)  Descriptive metadata is used to characterize data resources and tools, to facilitate discovery and management in large (virtual) infrastructures and repositories. 4

Advantages CMDI  Compared to other metadata infrastructures: - Flexibility - Researchers can decide what metadata fits their needs and use ready made metadata components. - Researchers can also create new metadata components if they want. - Complete Infrastructure: software for metadata modeling, editing, harvesting, exploitation - Still compatible with existing frameworks: OLAC, IMDI, TEI 5

Basic Component Metadata Modeling Technical Metadata Sample frequency Format Size … Lets describe a sound recording

Basic Component Metadata Modeling Language Technical Metadata Name Id … Lets describe a sound recording

Basic Component Metadata Modeling Language Technical Metadata Actor Sex Language Age Name … Lets describe a sound recording

Basic Component Metadata Modeling Language Technical Metadata Actor Location … Continent Country Address Lets describe a sound recording

Basic Component Metadata Modeling Language Technical Metadata Actor Location Project … Name Contact Lets describe a sound recording

Basic Component Metadata Modeling Language Technical Metadata Actor Location Project Lets describe a sound recording Metadata profile

Main principles behind CMDI  Component approach which is flexible and lets you design your own metadata profile  But semantics need to be declared explicitly by making use of concepts that are stored in the ISOcat registry. This way interoperability can still be guaranteed. 12

2) What is the goal of our project? Testing of CMDI principles by applying them to existing resources at MI and INL 13

 Lexical resources (with proper names, monolingual and bilingual lexica, historical and scientific dictionaries)  Linguistic databases (with syntactical, morphological and phonological dialect variation)  Ethnological databases (containing data about folktales, songs, probate inventories and pilgrimages).  Corpora (spoken and written)  Historical documents (bible texts) 14 Resources at MI and INL used

3) Workflow from resource to harvestable metadata instance 15 A Resource analysis B Construction of XML metadata profiles for each granularity level present in resource C Add metadata to instances Resource Harvestable metadata instance Very basic tool kit for creating schema and instances

Let’s apply this workflow to one of the resources in the project 16  Dynamic Syntactic Atlas of the Dutch dialects (DynaSand)  A linguistic database of speech and text to chart the syntactic variation at the clausal level in 267 dialects of Dutch spoken in the Netherlands, Belgium and North-West France.

A) Resource analysis 17 A Resource analysis DynaSAND Data, information, metadata? Granularity levels?

B) Profile construction 18 B Construction of XML metadata profiles for each granularity level present in resource Use existing components

Existing components 19

B) Profile construction 20 B Construction of XML metadata profiles for each granularity level present in resource Introduce new components Introduce new components Use existing components

New Components 21

B) Profile construction 22 B Construction of XML metadata profiles for each granularity level present in resource Introduce new components Introduce new components Use existing components Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability) Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability)

Link concepts in new components to existing ISOCat 23

B) Profile construction 24 B Construction of XML metadata profiles for each granularity level present in resource Introduce new components Introduce new components Introduce new ISOCat concepts (ensuring semantic interoperability) Introduce new ISOCat concepts (ensuring semantic interoperability) Use existing components Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability) Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability)

Introduce new ISOCat concepts 25

Result 1: DynaSand collection profile 26

Result 2: DynaSand subcollection profile 27

C: Generate schemas and add metadata to instances 28 B Construction of XML metadata profiles for each granularity level present in resource C Add metadata to instances Very basic tool kit for creating schema and instances

Instance for DynaSand collection metadata 29

Workflow from resource to harvestable metadata instance 30 A Resource analysis B Construction of XML metadata profiles for each granularity level present in resource C Add metadata to instances Introduce new components Introduce new components Resource Harvestable metadata instance Introduce new ISOCat concepts (ensuring semantic interoperability) Introduce new ISOCat concepts (ensuring semantic interoperability) Data, information, metadata? Granularity levels? Use existing components Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability) Link concepts in new components to existing ISOCat concepts (ensuring semantic interoperability) Very basic tool kit for creating schema and instances

4) Most important findings of the project  CMDI appeared flexible enough for the resources selected at MI and INL: - Many existing components could be reused. - Where this was not possible the framework indeed made it possible to make new components.  This was the case for both IMDI and non-IMDI type of resources.  A very general issue when making existing resources available through a metadata infrastructure (not CMDI- specific), is how to deal with “data, information, metadata distinction” and granularity levels. -> Advice: keep an end user perspective (discovery and management).  Document with best practices will be made available on CLARIN.EU website. 31

Future challenges for CMDI  Existing ISOCat concept definitions can be too specific or too broad (“birth year” versus “birth date” f.i.). What if too many components and concepts are created and the semantics become too diffuse to be useful? - Will we need increasingly more standardization and “cleaning” effort from ISOCat in the future? - Will we need more ways of encouraging reuse of existing components and concepts? - Should we add success indicators?: “this component is already being used by 1 million satisfied customers!” - Should we make more explicit what the benefits of reuse are?: “all of these great tools can be used on your data too when you reuse components X and Y!”. 32

33 Some links  CLARIN-NL components:  ISOcat data category registry:  Tools for creating CMDI: - XML-toolkit: Component registry and browser and Arbil metadata editor:

Thank you 34