Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.

Slides:



Advertisements
Similar presentations
Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop.
Advertisements

OAI from 50,000 Feet OAI develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. Begun in 1999.
THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
Building metadata components Dieter Van Uytvanck Max Planck Institute for Psycholinguistics CLARIN-NL Info Session Nijmegen
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
From CLARIN Component Metadata to Linked Open Data
CMDI Interoperability Workshop Daan Broeder TLA / MPI for Psycholinguistics CLARIN NL.
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Populating the Infrastructure using Standards Daan Broeder CLARIN NL EB TLA - MPI for Psycholinguistics CLARIN Coordinators Meeting June 29,30 Budapest.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Using IESR Ann Apps MIMAS, The University of Manchester, UK.
CLARINO WP2 National Registry and Long- Term Archiving Freddy Wetjen and Oddrun Pauline Ohren National Library of Norway Bergen, 12. September 2013.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
DASISH Metadata Catalogue Binyam Gebrekidan Gebre, Stephanie Roth, Olof Olsson, Catharina Wasner, Matej Durco, Bartholemeus Worcslav, Przemyslaw Lenkiewicz,
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
CMDI Component Registry Patrick Duin Max Planck Institute for Psycholinguistics 2011.
CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
Exploring and Enriching a LR Archive via the Web Marc Kemps-Snijders, Alex Klassmann, Claus Zinn, Peter Berck, Albert Russel, Peter Wittenburg MPI for.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CMDI Software Components. MD Service Delivers services for the Catalog & Search GUI – Query – Populate UI Acts as a WS and exposes the query and “queryModel()*”
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
JISC Information Environment Service Registry (IESR) Ann Apps MIMAS, The University of Manchester, UK.
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Statistical Data and Metadata Exchange SDMX Metadata Common Vocabulary Status of project and issues ( ) Marco Pellegrino Eurostat
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
IPDA Registry Definitions Project Dan Crichton Pedro Osuna Alain Sarkissian.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Broad Functional Classification a Data Type Registry Use Case
The Re3gistry software and the INSPIRE Registry
Darja Fišer CLARIN ERIC Director of User Involvement
Session 2: Metadata and Catalogues
Márton Németh – László Drótos How to catalogue a web archive?
Presentation transcript:

Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial March 4, Utrecht

CLARIN metadata background CLARIN since 2007 investigated and created solutions for: – Common AAI infrastructure – Single system of persistent identifiers (PIDs) for resources – Common metadata domain - CMDI – … CMDI was/is developed by CLARIN partners: Austrian Academy, IDS, MPI for Psycholinguistics, Sprakbanken Univ. Gothenborg National CLARIN projects: CLARIN-NL, -DE, -DK, -AT, -PL have committed resources to work on CMDI ISO TC37 standardization procedure underway Mandatory for metadata exchange within CLARIN infrastructure

Metadata in General Data about Data Structured Data about Data – So not a prose description (although that can be a part) – … but keyword/value type of data: Name = “myresource”, Title = “mybook”, Creator = “me” Nomenclature: – Set of such keys is a metadata set – elements: metadata elements, attributes, descriptors – Metadata set or schema (also a format specification) Used for: Resource discovery / accessing Management

Metadata for Language Resources I Resource types: – Video, audio, pictures, annotations, primary texts, notes, grammars, lexica, … Application – Resource discovery, management, res. processing,… Different levels of description (granularity): – complete corpora e.g. Brown Corpus. – sub corpora or corpus components: e.g. all Flemish recordings in the Spoken Corpus Dutch – (recording) sessions: e.g. the recording of a dialogue (sound file + transcript) – individual resources: e.g. a text file

Metadata for Language Resources II Metadata was/is often embedded in annotations – CHAT format – TEI header Advantage of splitting this: – Independent formats allowing different combinations of metadata with annotations – Keep different versions of metadata records for different metadata environments or frameworks … but danger of inconsistencies In some cases not all metadata can be factored out of the annotation

@Languages: eng, TEX Participant eng, Cristina *TEX: hello my name is Laura. *TEX: white, the television. *TEX: tall. *TEX: bicycle. *TEX: very

LR Metadata Landscape before CLARIN Fragmented landscape Metadata sets, schema & infrastructures in LR domain: – IMDI, OLAC/DCMI, TEI, … Problems with current solutions: – Inflexible: too many (IMDI) or too few (OLAC) metadata elements – Limited interoperability both semantic and functional – Problematic (unfamiliar) terminology for some sub- communities. – Limited support for LT tool & services descriptions

Metadata Components CLARIN chose for a component approach: CMDI – NOT a single new metadata schema – but rather allow coexistence of many (community/researcher) defined and controlled schemas – with explicit semantics for interoperability How does this work? Components are bundles of related metadata elements that describe an aspect of the resource A complete description of a resource may require several components. Components may use and contain other components Components should be designed for reusability

Metadata Components Technical Metadata Sample frequency Format Size … Let’s describe a speech recording

Metadata Components Language Technical Metadata Name Id … Let’s describe a speech recording

Metadata Components Language Technical Metadata Actor Sex Language Age Name … Let’s describe a speech recording

Metadata Components Language Technical Metadata Actor Location … Continent Country Address Let’s describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project … Name Contact Let’s describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project Metadata schema Metadata description Let’s describe a speech recording Component definition XML W3C XML Schema XML File Profile definition XML Metadata profile

Recursive model Recursive Component model Components can contain other components Enhances reusability Actor Address Location Project ActorLanguage

Location Country Coordinates Actor BirthDate MotherTongue Text Language Title Tool CreationDate Type Component registry user Dance Name Type User selects appropriate components to create a new metadata profile or an existing profile Selecting metadata components from the registry CMDI Component Reuse At this moment existing profiles & components are recommendations: Profiles & Components are created by researchers & metadata modelers Reuse is strongly encouraged but not yet enforced

Concept registries Basically a list with concepts and their definitions and where every concept has a unique identifier. Some have a complicated structure and are associated with elaborate (administrative) processes to determine the status and acceptation of concepts in the registry. e.g. ISO- DCR. others are static and simple lists of concepts and descriptions e.g. DCTERMS

Country dcr:1001 Language dcr:1002 Location Country Coordinates Actor BirthDate MotherTongue Text Language Title Tool CreationDate Type Component registry BirthDate dcr:1000 ISOcat concept registry user Dance Name Type Semantic interoperability partly solved via references to ISO DCR or other registry Selecting metadata components from the registry Title: dc:title DCMI concept registry CMDI Explicit Semantics User selects appropriate components to create a new metadata profile or an existing profile

Recording CreationDate Type Component registry Genre 1 dcr:1020 Language dcr:1002 Genre 2 dcr:1030 Dance Name Type Relation Registry Text 1 Language Title Genre1 Text 2 Language Title Genre2 ISOCat Relation Registry User MD search User selects or creates a profile that specifies relations between concepts dcr:1020 = dcr:1030 dcr:1020 ~ dcr:1030 dcr:1020 > dcr:1030

CMDI Record Structure Some basic structure needed in CMDI records Header with administrative information Metadata components with descriptive metadata ResourceProxies with typed references to described resources Type: – Resource – Landing Page – Metadata HEADER with administrative information Resource Proxy 1,type, link … Resource Proxy n, type, link Resource Proxy 1,type, link … Resource Proxy n, type, link Descriptive metadata CMDI RECORD R R

CMDI Collection Modeling MD R RRR RRRR R hierarchy of sub-collections MD

CMDI Philosophy The CMDI takes an archivist or “production” first viewpoint – Prioritize that the metadata can be of good quality: consistent, coherent, correctly linked to the concept registries – The consumer side can be more “experimental” and diverse. – Many MD exploitation “stacks” or consumers applications can work in parallel on the same metadata

Metadata Actors & Entities Metadata Users use metadata to find or resources – Product: suitable resource Metadata Creators create metadata to describe resources – Product: metadata description of a resource Metadata Curator Updates metadata description for maintenance – Product: metadata description of a resource Metadata modelers create metadata schema and/or terminology – Product: metadata schema with explicit terminology Metadata repository facility that for managing metadata descriptions Metadata catalogue software that allows users to search & browse in metadata

CMDI Metadata life-cycle OAI-PMH Data provider OAI-PMH Service provider Local metadata repository Joint metadata repository metadata modeler metadata user metadata creator ISOcat component registry & editor metadata editor metadata curator metadata curator metadata catalogue Relation Registry search & semantic mapping DATA Perform search/browsing on the metadata catalog using the ISO DCR and other concept registries and CLARIN relation registry Create metadata schema from selection of existing components. Allow creation of new components if they have references to ISOcat Metadata harvesting by OAI-PMH protocol Metadata descriptions created

CMDI backward compatibility There is a ‘huge’ installed base of metadata records available for harvesting: OLAC, IMDI, DC CMDI component registry was seeded with: – IMDI profile – DC/OLAC profile – META-SHARE Specialist IMDI profiles for SignLanguage, Bilingualism,... Were developed within some CLARIN NL projects Those communities used to these schemas can work Others may need assistance to convert their metadata schema

Current CMDI status I ISO-DCR: ±1200 metadata concepts CMDI component registry: ± 850 components, 150 profiles Produced & inspired by: Deconstructing existing metadata schema IMDI, OLAC, TEI CLARIN NL call 1,2,3 projects CLARIN EU work

Current CMDI status II CMDI production ISOCat DCR Component registry & editor ARBIL metadata editor, and other options CMDI exploitation VLO (Metadata Catalog) and other options (MIS) Relation Registry YAMS (yet another metadata search) Virtual collection Registry

Thank you for your attention