Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011.

Slides:



Advertisements
Similar presentations
DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.
Advertisements

CLARIN AAI, Web Services Security Requirements
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
CLARIN and the DSA Paul Trilsbeek The Language Archive Max Planck Institute for Psycholinguistics.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
Steven KrauwerCLARIN-NL Launch CLARIN-EU: Where do we stand? Steven Krauwer Utrecht institute of Linguistics UiL OTS CLARIN-EU Coordinator.
Steven KrauwerLREC20081 CLARIN: Common Language Resources and Technology Infrastructure for the Humanities and Social Sciences Kimmo Koskenniemi (University.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
CLARIN Centers for a Sustainable Infrastructure Daan Broeder, MPI for Psycholinguistics Jan Odijk, Utrecht University.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Chinese-European Workshop on Digital Preservation Beijing (China), July.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
CLARIN tools for workflows Overview. Objective of this document  Determine which are the responsibilities of the different components of CLARIN workflows.
CLARIN-NL Call 3 Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
CLARIN for Linguists Introduction Jan Odijk LOT Summerschool Nijmegen,
1 CLARIN - NL Language Resources and Technology Infrastructure for the Humanities and the Social Sciences in the Netherlands Jan Odijk LREC May.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
Terminology services and the DDC: the High-Level Thesaurus and beyond Presented to the symposium Dewey goes Europe: on the use and development of the Dewey.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.
CLARIN-NL Call 4 Jan Odijk CLARIN-NL Call 4 Info-session Amsterdam, 30 Aug
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
1 CLARIN - NL Language Resources and Technology Infrastructure for the Humanities in the Netherlands Jan Odijk Utrecht 28 June 2010.
CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Linguistics with CLARIN Introduction Jan Odijk LOT Winterschool Amsterdam,
1 CLARIN - NL Language Resources and Technology Infrastructure for the Humanities and the Social Sciences in the Netherlands.
Populating the infrastructure the case of the Netherlands Hans Bennis executive board of CLARIN-NL Meertens Institute (KNAW) CLARIN COORDINATORS BUDAPEST,
Common Lab Research Infrastructure for the Arts and Humanities CLARIAH Jan Odijk EuroRisNet+ Workshop, Lisbon,
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
CLARIN work packages. Conference Place yyyy-mm-dd
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
W HAT IS I NTEROPERABILITY ? ( AND HOW DO WE MEASURE IT ?) INSPIRE Conference 2011 Edinburgh, UK.
Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
CRIS and repositories: NARCIS Elly Dijk KNAW Research Information EuroCRIS meeting, Moscow (Rusland), 9 October 2008.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
IPDA Architecture Project International Planetary Data Alliance IPDA Architecture Project Report.
Bavarian Agency for Surveying and Geoinformation AAA - The contribution of the AdV in an increasing European Spatial Data Infrastructure - the German Way.
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Software Preservation Network PresQT Workshop
Outline Pursue Interoperability: Digital Libraries
ISOCAT ISOCAT Problems
Common Solutions to Common Problems
Malte Dreyer – Matthias Razum
Bird of Feather Session
Virtual Competency Centre 1: e-Infrastructure General VCC meeting, 2/3 April 2012, Utrecht, The Netherlands Karlheinz Moerth (Co-head of VCC 1, Austria)
Presentation transcript:

Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011

Context Documentation Visibility Referability Accessibility Long Term Preservation Interoperability Conclusions Overview

CLARIN-NL National project in the Netherlands Budget: 9.01 m euro Funding by NWO (National Roadmap Large Scale Infrastructures) Coordinated by Utrecht University 24 partners (universities, royal academy institutes, independent institutes, libraries, etc.)24 partners Context

Dutch National contribution to the Europe-wide CLARIN infrastructure Prepared by CLARIN preparatory project ( )CLARIN preparatory project –Also coordinated by Utrecht University From Dec 2011 to be coordinated by the CLARIN-ERIC –ERIC: a legal entity at the European level specifically for research infrastructuresERIC Context

An technical research infrastructure in which a humanities researcher who works with language-related resources –Can find all data relevant for the research –Can find all tools relevant for the research –Can apply the tools to the data without any technical background or ad-hoc adaptations –Can store data resulting from the research –Can store tools resulting from the research via one portal CLARIN infrastructure (NL)

This requires systematic sharing of resources (=data, tools, web services, …) Systematic Sharing requires –Documentation –Visibility –Referability –Accessibility –Long Term Preservation –Interoperability of resources CLARIN infrastructure (NL)

Resource curation projects –Curate an existing resource Demonstrator projects –Curate an existing tool and supply a demonstration scenario #subprojects 21 (12-14 in 2012) Data Curation Service –Offers the service of curating existing data Where curation includes –Documentation, Visibility, Referability, Accessibility, Long Term Preservation, Interoperability CLARIN-NL subprojects

CLARIN infrastructure is virtual and distributed –CLARIN-Centres work together to implement the infrastructure –Each stores and makes available a part of the resources –Some also provide computational facilities –Centres must meet a list of requirements and be certified by CLARIN Candidate CLARIN Centres in NL –Institute for Dutch Lexicology (INL)INL –Max Planck Institute for Psycholinguistics (MPI)MPI –Meertens Institute (MI)MI –Huygens ING Institute (HI)HI –Data Archiving and Networked Services (DANS)DANS CLARIN-NL Centres

Implementation of basic infrastructure functionality –setting up authentication and authorizations systems –several registries (e.g. ISOCAT, RELCAT, Metadata Registry) –various other infrastructure services Search Facilities –In resource descriptions (`metadata’) Centralized after metadata harvesting –In the data themselves Via federated search Using Webservices in Workflow systems –Cooperation with Flanders –Based on work done in the STEVIN-programmeSTEVIN-programme –(as a severe test for interoperability) Infrastructure Implementation

Is always necessary, so hardly any additional effort Partly in natural language Partly formalized –Described under a particular formally identifiable attribute –With an explicit type for the value of the attribute –Possibly with further restrictions on the values (patterns, finite lists of values, constraints, etc.) –Represented formally and unambiguously Any piece of documentation that can be formalized must be formalized, and must be put in the resource description (metadata of the resource) Documentation

Resource Descriptions –Component-based MetaData Infrastructure (CMDI)CMDI –One can define resource profiles as collections of components (which can contain components). –Many generally useable components are available –Resource profiles for most common resources are available –Component-based  flexibility –Flexibility: danger: diversity, no interoperability –Controlled by semantic interoperability (see below) –Not yet available but needed: profile(s) for tools Supported by tools –Component and profile editors –Component and profile registries –Metadata editor Documentation

Each resource and its resource description must be stored at a CLARIN-centre CLARIN-centres make resource descriptions available for metadata harvesting (using OAI-PMH)OAI-PMH Via harvesting the metadata, the metadata become available in the CLARIN resource catalogue –browsing via the Virtual Language Observatory (VLO) using faceted browsingVLO –Search via a search interface (under development) In the metadata and in the data String search and structured search Results if desired collected in a Virtual Collection Visibility

By name or title is not sufficient –All the problems that natural language poses for communication: not always unique (ambiguity) language-specific Corpus Gesproken Nederlands –Variants in other languages: Spoken Dutch Corpus –limited knowledge of the foreign language  variants: Corpus Spoken Dutch, Dutch Spoken Corpus Long, too redundant, –abbreviations/acronyms: CGN Invites for errors –Spoken Dutch Cropus, Spken Dutch Corpus URLs –Still too long/redundant (unless one uses shortened URLs) –Unstable, volatile Persistent Identifiers (PIDs) are needed Referability

PIDs Each CLARIN-Centre –must assign a PID to each resource (and/or to subresources) –Keep the PID resolution registry up-to-date PID systems –Handle (preferred)Handle –URNURN –Perhaps others (e.g. DOI)DOI Referability

CLARIN infrastructure –Accessible at any time and from any place IPR –CLARIN-NL promotes maximal open access of resourcespromotesopen access –is working on plans to implement policies and functionality to properly handle IPR and ethical restrictions Researchers’ Mindset –Many researchers in the humanities are hesitant or even unwilling to share their resources with others –How to resolve this? With a carrot and a stick CLARIN must accommodate reasonable wishes CLARIN must prove benefits for researchers who put their resources there Funding agencies must oblige researchers to do so (partially already so) Accessibility

Necessary to make sure the resources can be shared with future researchers (that may be the producer!) Each CLARIN-Centre is obliged to ensure long term preservation Usually outsources to specialized centres –MI outsources to DANS –MPI outsources to internal Max Planck Gesellschaft organisation Long Term Preservation

Interoperability of resources is the ability of resources to seamlessly work together –No manual ad-hoc adaptations –Adaptations occur automatically `behind the screens’ Need for interoperability is high –Humanities researchers: not the required technical background Interoperability –Syntactic interoperability and Semantic interoperability Each subproject must try to achieve interoperability –Report any problems and make suggestions for adaptations –So that the resources are adapted to the infrastructure (in some cases) and vice-versa (in other cases) Not easy, but the only way to get further is to actually try this and learn from it. Interoperability

the formats of data are selected from a limited set of (de facto) standards or best practices supported by CLARINstandards or best practices software tools and applications take input and yield output in these formats Syntactic Interoperability

Focus on the semantics of Data Categories (DCs) a privileged data category registry (DCR) is set up containing DCs: –unique persistent identifiers for DCs (PIDs), –their semantics, –a definition, –Examples –lexicalizations in various languages. Each resource specific DC mapped to DC from the privileged DCR.  every researcher can use his/her own DCs  different DCs from different resources can be interpreted as identical in meaning, via the DC of the privileged DCR In CLARIN-NL multiple (complementary) privileged DCRs are allowed. The primary is ISOCATISOCAT Semantic Interoperability

Achieving semantic interoperability is very hard –Many DCs are almost identical (principled/pragmatic/arbitrary reasons) –Some DCs in ISOCAT are not defined clearly –There are many similar DCs in ISOCAT –Relevant DCs are not easy to find in ISOCAT Three actions taken –Held several workshops to discuss problems –Appointed a coordinator to deal with problems –Decided to implement RELCAT registry to specify relations between DCs Semantic Interoperability

CLARIN-NL requires systematic sharing of resources Therefore requires researchers to work on –Documentation –Visibility –Referability –Accessibility –Long Term Preservation –Interoperability Of resources For certain aspects this is relatively easy but it must be done For other aspects this is very hard but it must be done so that we can learn The approach described here may be a model for other countries working on the CLARIN-infrastructure It may be a model for other resource sharing facilities (e.g. META- SHARE) Conclusions

Thanks for your attention!