Roy Lowry British Oceanographic Data Centre.  Controlled Vocabularies - What and Why  Controlled Vocabularies - History  Controlled Vocabularies -

Slides:



Advertisements
Similar presentations
Connecting Social Content Services using FOAF, RDF and REST Leigh Dodds, Engineering Manager, Ingenta Amsterdam, May 2005.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
GE/BCDMEP Meeting March 2004 EnParDis Enabling Parameter Discovery Roy Lowry, Michael Hughes & Laura Bird British Oceanographic Data Centre.
A Semantic Modelling Approach to Biological Parameter Interoperability Roy Lowry & Laura Bird British Oceanographic Data Centre Pieter Haaring RIKZ, Rijkswaterstaat,
NERC DataGrid Vocabulary Workshop, RAL, February 25, 2009 NERC DataGrid Vocabulary Server Description.
Roy Lowry Adam Leadbetter British Oceanographic Data Centre.
The BODC Parameter Markup and Usage Vocabulary Semantic Model Roy Lowry British Oceanographic Data Centre GO-ESSP Meeting, RAL, June 2005.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
NERC DataGrid Vocabulary Governance Vocabulary Workshop, RAL, February 25, 2009.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Demonstration of adding content to an ICAN Semantic Resource Roy Lowry, Adam Leadbetter, Olly Clements (NETMAR - BODC) Tanya Haddad (ICAN - OCA)
Guide To UNIX Using Linux Third Edition
ASP.NET Programming with C# and SQL Server First Edition
Introduction to Controlled Vocabularies (Term / Code Lists)
Status of advanced access and viewing services following SeaDataNet D5.6 and D8.7 By Dick M.A. Schaap – Technical Coordinator Barcelona – Spain, 19 – 20.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
2 nd Training Workshop 4 – 5 June 2007 Common Data Index - CDI By Dick M.A Schaap Technical Coordinator SeaDataNet.
The NERC DataGrid Vocabulary Server Roy Lowry British Oceanographic Data Centre Ontology Registry Meeting.
The NERC DataGrid Vocabulary Server: an operational system with distributed ontology potential Roy Lowry British Oceanographic Data Centre GO-ESSP 2008,
SeaDataNet Ontology Use Case Roy Lowry British Oceanographic Data Centre Coastal Atlas Interoperability Workshop, Corvallis, July (+ Lessons.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
MEDIN Data Guidelines. Data Guidelines Documents with tables and Excel versions of tables which are organised on a thematic basis which consider the actual.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
SOFTWARE ENGINEERING BIT-8 APRIL, 16,2008 Introduction to UML.
Controlled Vocabularies (Term Lists). Controlled Vocabs Literally - A list of terms to choose from Aim is to promote the use of common vocabularies so.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Web Programming: Client/Server Applications Server sends the web pages to the client. –built into Visual Studio for development purposes Client displays.
CF Conventions Support at BADC Alison Pamment Roy Lowry (BODC)
CSCI 6962: Server-side Design and Programming Introduction to Java Server Faces.
NERC DataGrid Vocabulary Server Access Vocabulary Workshop, RAL, February 25, 2009.
1 The NERC DataGrid DataGrid The NERC DataGrid DataGrid AHM 2003 – 2 Sept, 2003 e-Science Centre Metadata of the NERC DataGrid Kevin O’Neill CCLRC e-Science.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
EMODNET Chemistry 2 Semantic Suggestions Roy Lowry and Adam Leadbetter British Oceanographic Data Centre.
NERC DataGrid NERC DataGrid Vocabulary Server Use Cases Vocabulary Workshop, RAL, February 25, 2009.
EMODnet Chemistry 2 Service Contract MARE/2012/10 S Progress of the CDI service By Dick M.A. Schaap – Technical Coordinator Istanbul – Turkey,
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
CDI Controlled Vocabularies Roy Lowry, Karen Vickers (BODC) Michele Fichaut, Catherine Maillard (SISMER) Reinhard Schwabe (DOD) 4 June 2003.
Content and Systems Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers available.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
EMODnet Chemistry 2 Service Contract MARE/2012/10 S How to make EMODnet Chemistry fit for purpose at system level By Dick M.A. Schaap – Technical.
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
1 Alison Pamment, 2 Calum Byrom, 1 Bryan Lawrence, 3 Roy Lowry 1 NCAS/BADC,Science and Technology Facilities Council, 2 Tessella plc, 3 British Oceanogrphic.
Data Formats, Flags and Vocabularies Roy Lowry British Oceanographic Data Centre SeaDataNet Training Course, Ostend, June 16-19, 2008.
Metadata V1 By Dick M.A. Schaap – technical coordinator Oostende, June 08.
Validation of Metadata XML files SeaDataNet Training, June 2008 Presented by with contributions from Karen Vickers (BODC) Presented by Michèle Fichaut.
Controlled Vocabularies and Coordinate Reference Systems
P01 parameters: tips and tools
Usage of BODC parameter vocabularies
What is a vocabulary? A person's vocabulary is the set of words within a language that are familiar to that person. A vocabulary usually develops with.
Controlled Vocabularies and Coordinate Reference Systems
Introduction to Controlled Vocabularies (Term / Code Lists)
Controlled Vocabularies
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
ISRAMAR Work Flow for SeaDataNet
Microsoft Access 2003 Illustrated Complete
PDAP Query Language International Planetary Data Alliance
MIKADO: Generation of CDI ISO19139 XML files
Vanessa Tosello (IFREMER), Flavian Gheorghe (MARIS)
PREMIS Tools and Services
Vocabularies at the British Oceanographic Data Centre
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

Roy Lowry British Oceanographic Data Centre

 Controlled Vocabularies - What and Why  Controlled Vocabularies - History  Controlled Vocabularies - SeaDataNet  Controlled Vocabularies - Mappings  Controlled Vocabularies - Future

 Controlled Vocabulary (CV)  A collection of concepts that may legally populate a given field in a data or metadata model.  A concept is an instance of the real world entity modelled by that field - e.g. Instrument, parameter.  Concept Labelling  Machine readable label - code, URI (URN or URL)  Human readable labels - name, abbreviation, definition

 Why?  Alternative to CV is plain language text which is subject to:  Spelling errors - e.g. Macoma baltica for Macoma balthica  Entity abuse - e.g. Sea-Bird SBE9-11+ in a parameter field  CVs and concepts may be incorporated into knowledge management infrastructure and linked semantically to build ontologies.  Smart discovery  AI-driven data aggregation

 In the 1980s there was IODE GETADE who developed the GF3 code tables  Thorough content governance with concepts well defined and their scope carefully considered  Published by IODE as a book in five languages  Result is a beautiful piece of work that cannot be maintained.

 In the 1990s GETADE waned as funding squeezed  Some vocabulary governance moved to individuals  Poor judgement on what new entries should be allowed leading to vocabulary abuse (e.g. Making a data model 1:1 into 1:many by adding a list as a vocabulary concept)  Some vocabularies moved to local management  Like Galapagos finches they evolved into entities that were similar but significantly different  Unlike Galapagos finches many variants retained the same name!

 SEASEARCH - content governance delegated to individuals, but realisation that vocabulary management needed to be centralised with a master copy universally accessible 24/7.  SeaDataNet/NERC DataGrid - developed the NERC Vocabulary Server at BODC to deliver this.  Accessible vocabularies with clear entity definitions  Every concept given a URN that resolves into a URL that delivers an RDF XML document  Basis of the Semantic Web

 SeaDataNet makes extensive use of CVs in its metadata models and data formats  Each CV targets one or more fields in these models/formats  List of SeaDataNet CVs may be found at  Ignore the Mxx entries they are hosted by SeaDataNet on behalf of MEDIN, which just leaves 64!  Common practice is to use the 3-character code in the 'Library' column as the CV name e.g. P01, P02, L05, L22

 SeaDataNet Controlled Vocabularies may be accessed in one of five ways:  Human readable forms  Maris client library Maris client library  Maris client thesaurus Maris client thesaurus  BODC thesaurus (concept scheme) best viewed in Chrome BODC thesaurus (concept scheme)  Machine readable forms  RESTful interface to CV or concept (RDF XML)CVconcept  SOAP interface

 RESTful Syntax  Base is (returns an RDF XML catalogue of all 263 CVs in NVS)  To this we add the 3-byte vocabulary ID plus 'current' e.g. P03/current/ (returns all concepts in that CV in RDF XML)  To this we can add  'accepted' (returns all valid concepts in that CV in RDF XML)  'deprecated' (returns all deprecated concepts in that CV in RDF XML)  'all' (returns all concepts in that CV in RDF XML)  Concept code (returns concept document in RDF XML)

 RDF XML Concept Document (deprecated concept)Concept Document Elecrical conductivity of the water body WC_Cond This is an obsolete term for this definition. Use CNDCZZ01 instead. SDN:P01::PCONZZ :48:35.0 deprecated true</owl:deprecated >

 Mapping strategy depends upon workflow order  What comes first - CDI record or Data file?  CDI record first  Parameters and instruments for CDI assigned by manually mapping local vocabularies to P02 and L05  Parameters and instruments for data file assigned by manually mapping local vocabularies to P01 and L22  Data file first  Parameters and instruments for data file assigned by manually mapping local vocabularies to P01 and L22  Parameters and instruments for CDI automatically obtained using P01/P02 and L05/L22 mappings in NVS

 Manual Mapping Techniques  Library Text Search Library Text Search  Input a string into the 'Free search' box and press 'Search'  Wildcard character is '%' for 1 or more characters  Search is case-insensitive  Wildcard automatically added to start and end of string  'Microzooplankton taxonomy-related biosurface area per unit volume of the water column' found by search for 'zooplankton'

 Manual Mapping Techniques  Library Text Search Library Text Search  Hardest vocabulary to map to is P01 because it's big (currently concepts)  Planned construction of search strings can help  P01 concept labels can be long and complex  BUT they are constructed using a semantic model so information is always presented in the order  What  Substance name then synonyms  What to where relationship  Where  How

 Manual Mapping Techniques  Consider a search for PCB183 in 'standard' fine sediment (<63um)  The following will fail to find anything  'PCB183 concentration'  '63um sediment' (63um%sediment gets false hits)  'sediment <63um%dry weight'  But this is right on the money  'con%PCB183%dry%sediment%<63'

 Manual Mapping Techniques  String has all components in the right order  What - con for Concentration  Substance synonym - PCB183  What to where relationship - dry for dry weight  Where sediment%<63 for sediment <63um  The resulting hit  Concentration of 2,2',3,4,4',5',6-heptachlorobiphenyl {PCB183 CAS } per unit dry weight of sediment <63um

 Manual Mapping Techniques  Thesaurus Search Thesaurus Search  For parameters entry point is P08 (Disciplines), P03 (Agreed Parameter Groups) or P02 (Discovery Parameters)  Pressing a '+' in P08 opens up P03  Pressing a '+' in P03 opens up P02  Pressing a '+' in P02 opens up P01  Works well for finding P01 in cases where small numbers of P01 terms are mapped to each P02  In other cases the list may be too long for comfortable scanning and library string searching will work better

 Automated Mapping Technique  To automatically find the P02 code for a given P01 code  Obtain the RDF XML document for the P01 codeRDF XML document  Look for <skos:broader rdf:resource including the URL for a P02 concept which in this example is:   Job Done - all that's needed is a bit of software to do the job programmatically  In BODC we store P01 codes in data and automatically convert to P02 to generate CDI. SHOULDEXCLUDE CO- ORDINATE VARIABLES

 Parameter mappings  The biggest problem with mapping local parameter vocabularies to a standard vocabulary is understanding EXACTLY what is meant by the local term.  Consider the parameter 'Particulate Zinc'  This could mean:  The concentration of zinc per unit dry weight of the residue of a filtered sample  The concentration of zinc contained in the particles per unit volume of a body of water  The concentration of zinc contained in the particles per unit mass of a body of water  Each of these has a different P01 code.  Think carefully and ask many questions!

 P01 mappings based on semantic model  Expose elements of the semantic model  Concentration of  2,2',3,4,4',5',6-heptachlorobiphenyl {PCB183 CAS }  per unit dry weight of  sediment <63um  User selects combination that maps to their local parameter like one-armed bandit wheels  System returns appropriate P01 code or automatically generates a new P01 code

 P01 mappings based on semantic model  Creates the risk of 'Green Dog' syndrome  User is free to select any combination of elements  Some combinations may be valid, others are not  Consider lists of animals plus colours  GREEN + LIZARD - good choice  GREEN + DOG - not such a good choice  Consequently, quality control of user-selected semantic model combinations is essential  Places latency in new code assignment cycle but I am convinced this is worthwhile - others disagree

 Semantic aggregation  Set up a vocabulary of aggregated parameter concepts - P35 for EMODNET chemistry  Map each P35 concept to P01 concepts that may be validly included in the aggregation  Aggregation software issues RESTful call to NERC Vocabulary Server for P35 concept  Software then parses returned RDF XML document to identify P01 concepts that may validly be aggregated  This functionality is currently being written into ODV