Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam, 2015-01-16 1.

Slides:



Advertisements
Similar presentations
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
Advertisements

THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LSA Symposium: The Open Language Archives Community.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
CATCHPlus Valorisation project for CATCH research programme. –Public funding –But: development mainly by commercial parties –Open source required Cultural.
A. Grigorov, A. Georgiev, M. Petrov, S. Varbanov, K. Stefanov Building a Knowledge Repository for Life-long Competence Development.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
Introducing Symposia : “ The digital repository that thinks like a librarian”
Management of information. Objectives Discuss the benefits of good management practice Present reference management tools Present bookmark management.
Overview of Search Engines
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
INTRODUCTION TO RESEARCH DATA MANAGEMENT Robin Desmeules Janice Kung J W Scott Health Sciences Library University of Alberta Libraries.
A step-by-step tutorial by Henry Liu Auckland City Libraries Make a start Chinese Digital Community.
Populating the Infrastructure using Standards Daan Broeder CLARIN NL EB TLA - MPI for Psycholinguistics CLARIN Coordinators Meeting June 29,30 Budapest.
OARE Module 3: OARE Portal.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
CLARIN tools for workflows Overview. Objective of this document  Determine which are the responsibilities of the different components of CLARIN workflows.
CLARIN-NL Call 3 Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
CLARIN for Linguists Introduction Jan Odijk LOT Summerschool Nijmegen,
1 CLARIN - NL Language Resources and Technology Infrastructure for the Humanities and the Social Sciences in the Netherlands Jan Odijk LREC May.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
UPSpace An institutional research repository for the University of Pretoria Presented by Ina Smith to the School of Public Management and Administration.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.
CLARIN-NL Call 4 Jan Odijk CLARIN-NL Call 4 Info-session Amsterdam, 30 Aug
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
An Overview of MPEG-21 Cory McKay. Introduction Built on top of MPEG-4 and MPEG-7 standards Much more than just an audiovisual standard Meant to be a.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Linguistics with CLARIN Introduction Jan Odijk LOT Winterschool Amsterdam,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Populating the infrastructure the case of the Netherlands Hans Bennis executive board of CLARIN-NL Meertens Institute (KNAW) CLARIN COORDINATORS BUDAPEST,
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
How do I search the Internet? Narrow your topic and its description; pull out key words and categories.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
National Library of the Czech Republic Integration of digital materials into EDL Adolf Knoll National Library of the Czech Republic Helsinki CENL Workshop.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
The JISC Information Environment Service Registry (IESR) Ann Apps Mimas, The University of Manchester, UK.
FACES General Overview ViRR (Virtueller Raum Reichsrecht) Software Solutions Kristina Büchner and Bastien Saquet Contact:Kristina Buechner:
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.
Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)
VI-SEEM Data Repository
VI-SEEM Data Repository
Manuscript Transcription Assistant Initiative
Malte Dreyer – Matthias Razum
A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.
Presentation transcript:

Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,

Why store resource in CLARIN? How to store resources in CLARIN – What you must do – What the CLARIN Centre must do Overview 2

 Why store resource in CLARIN? How to store resources in CLARIN – What you must do – What the CLARIN Centre must do Overview 3

You may benefit from it – Existing tools in CLARIN faster production, better quality, more features Search engines, analysis tools, visualisation tools – Can be easily combined with other data in CLARIN Others may benefit from it – Many unexpected uses of your data – Now or in the future Why? 4

Openness in Science – Usually produced with public money Integrity – Recently many scandals with faked data Verifiability and Replicability of research results – Essential for the proper conduct of science – More and more journals are requiring it Funding Agency requires it (data management plan) DANS CLIP on data sharing (in Dutch) DANS CLIP on data sharing Why? 5

Why store resource in CLARIN?  How to store resources in CLARIN – What you must do – What the CLARIN Centre must do Overview 6

Start early – Preferably before you start creating data Contact a CLARIN Type B centre – They can help you – Your resource must be stored at a CLARIN centre Which CLARIN centre? – Check CLARIN-NL portal ( ) – CLARIN-NL website Centres pageCentres – Brief summary Brief summary How? 7

Why store resource in CLARIN? How to store resources in CLARIN  What you must do – What the CLARIN Centre must do Overview 8

Define what your data are / are going to be Ensure legal / ethical compliance – Permission from subjects to use the data for research – Provisions for respecting privacy matters Determine Metadata Contents – Determine what information should be included in the metadata (resource description) of your resource – Collect this information What you must do 9

Determine centre Contact them and make arrangements Contact them Determine CLARIN-recommended standard format(s) for your dataCLARIN-recommended standard format(s) for your data – Consult with the CLARIN Centre – Ask help from the helpdesk – E.g. LMF (Lexical Markup Framework for lexicons, cf. Cornetto, DuELME, …) – What you must do 10

Metadata must be in CMDI format CMDI provides – A model for metadata – A format for metadata – Tools to make metadata CMDI Metadata are written in XML It does NOT proscribe the contents of the metadata Introduction to CMDI Introduction to What you must do 11

CMDI Metadata use a metadata profile Metadata profile: a combination of Metadata profile – Metadata components – Metadata elements Metadata Component: a combination of Metadata Component – Metadata components (optional) – Metadata elements (optional) Metadata Element – XML element: name, value (of an explicit type), attribute-value pairs What you must do 12

This provides high flexibility – YOU can determine the metadata for your resource – By defining your own profiles, components, elements CMDI helps you with – A profile and component editor [login required]profile and component editorlogin required – A list of commonly used profiles and componentscommonly used profiles and components – A metadata editor: ARBILARBIL What you must do 13

Flexibility requires explicit semantics! – The CLARIN infrastructure must `know’ what you mean with your metadata elements – Otherwise it cannot use faceted browsing in the VLO or the Meertens Metadata Search Engine VLOMeertens Metadata Search Engine What you must do 14

Explicit semantics: – Each element in the data and metadata must have a link to a CLARIN-recognized concept or data category registry – Most prominent data category registry in CLARIN was ISOCATISOCAT – Example Data Category in ISOCAT Example Data Category in ISOCAT – Example Link to ISOCAT in element definition Example Link to ISOCAT in element definition What you must do 15

Explicit semantics (2): – RELCAT alpha version RELCAT alpha version For relations between data categories – SCHEMACAT alpha version SCHEMACAT alpha version For describing schemas of resources – Recently changed to CLARIN Concept Registry CCR Browser an OpenSKOS instance hosted by Meertens Institute – CLAVAS Vocabulary Service CLAVASVocabulary Service Interface to other data category registries and vocabularies – ISO Language codes ISO Language codes What you must do 16

Attend dedicated tutorials on CMDI and ISOCAT/CCR – Regularly organized in NL (each 2+ times/year) Regularly organized Usually the CLARIN Centre helps you creating the CMDI metadata Maximally reuse existing profiles /components – It will help you get better metadata – You do not have to reinvent the wheel What you must do 17

CLARIN – strongly recommends using certain components (e.g. GeneralInfo component) and – may require inclusion of certain properties Do not forget properties that are ‘obvious to you’, e.g. – Language Period of the language Standard language or dialects – Title – Version What you must do 18

Add properties that are important from a linguistic perspective: – Which linguistic annotations does it contain – For which subdisciplines of linguistics is it mostly relevant – Does it involve text, audio, video, or databases? For tools – What are the linguistic properties of its input and output (which annotations, which annotation schemata, pos-tags, formats, etc) What you must do 19

Live Version v. exchange/archive version – E.g. lexicon in Lexical Markup Framework compatible XML text v.Lexical Markup Framework XML database with indexes for fast search – Live version is ideally derived fully automatically from the exchange/archive version In close cooperation with the CLARIN Centre What you must do 20

Software – Desktop tool – Desktop application – Web service SOAP / REST, use CLAM if possible SOAPRESTCLAM – Web application There must be `metadata’ for software as well! – Generic profile exists and is being refined Consult with the CLARIN Centre What you must do 21

Why store resource in CLARIN? How to store resources in CLARIN – What you must do  What the CLARIN Centre must do Overview 22

Assist you with your tasks Assign Persistent Identifiers (PIDs) to all data and metadata – Handle system for assignment and resolution of PIDs Handle system – example example Make Metadata harvestable – OAI-PMH protocol OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting What the CLARIN Centre must do 23

Store the data in the centre’s repository – LAMUS (the Language Archive) and its documentation online or as PDF LAMUS onlinePDF – EASY (DANS) and its Help and Support Page EASYHelp and Support Page Make data themselves available and accessible in the CLARIN infrastructure What the CLARIN Centre must do 24

Provisions for legal / ethical restrictions Long term preservation – Data Seal of Approval Data Seal of Approval (Minimal) Maintenance What the CLARIN Centre must do 25

Adapt them to meet the CLARIN requirements – Data Curation CLARIN-NL – has financed many data curation projectsdata curation projects – Has set up Data Curation ServiceData Curation Service CLARIAH (successor project) CLARIAH – Will continue these curation activities What about existing data? 26

Thanks for your attention! 27

DO NOT ENTER HERE 28

Meertens Institute: resources relevant for the study of – cultural expressions and Language variation within the Dutch language Max Planck Institute for Psycholinguistics (The Language Archive): resources related to the study of – psychological, social and biological foundations of language Huygens Institute: resources related to the study of – history and literature of the Netherlands. Institute for Dutch Lexicology (INL) – relevant to the lexicological study of the Dutch language Data Archiving and Networked Centres (DANS) – digital research data generally CLARIN-NL B Centres 29

Koninklijke Bibliotheek (National Library): – Digital books, articles, newspapers – Includes DBNL (Digital Library for Dutch Literature) – (will be available in the VLO soon) Nederlands Instituut voor Beeld & Geluid (NIBG, Netherlands Institute for Sound and Vision ) – Audio-visual data (esp. TV and radio programmes) – NIBG data via the VLO NIBG data via the VLO Utrecht University Library (UBU) – Digital books, articles – UBU data via the VLO UBU data via the VLO CLARIN-NL D Centres 30

Return Page CMDI Profile: Example 31

Return Page CMDI Component: Example 32

Return Page CMDI Element: Example 33

Return Page ISOCAT DC 34

Return Page LINK to ISOCAT DC 35

CLARIN attempts to maximize open and free access to resources – with as little restrictions as possible – no login unless it cannot be avoided Sometimes, a login (Authentication and Authorisation, AAI) is required, e.g. – Because there are legal and/or ethical restrictions on the data – To identify you and assign you your own workspace / data – To enter with your own personal settings LOGIN 36

CLARIN is a distributed infrastructure – How can we avoid that you have to login again and again? – How can we avoid that you have to remember many user names and passwords? – How can we avoid that CLARIN has to securely store user names, passwords and possibly other privacy-sensitive information? LOGIN 37

The answer: ShibbolethShibboleth – When you log in, you are directed to a login with your own institute When you log ina login with your own institute – You then log in with you institute’s user name and passwordlog in with you institute’s user name and password – The institute server then confirms that you are a trusted person, and you can enter this part of the CLARIN infrastructure you can enter this part of the CLARIN infrastructure it does not pass on any sensitive information such as your user name or password – If you now go to another part of the CLARIN infrastructure that requires login, it `knows’ that you are already logged in, so you do not have to do this again (Single Sign On, SSO)to another part of the CLARIN infrastructure LOGIN 38

Return Page LOGIN 39

Return Page LOGIN 40

Return Page LOGIN 41

Return Page LOGIN 42

Return Page LOGIN 43

Return Page PIDs 44

Return Page LOGIN 45