Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Metadata concepts, issues and experiences – lessons from 8 years.

Slides:



Advertisements
Similar presentations
Autodesk Integrations Overview SmartDesk A seamlessly integrated, affordable, out-of-the-box, Windows based drawing and document management tool for.
Advertisements

THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Using the Self Service BMC Helpdesk
Business Development Suit Presented by Thomas Mathews.
CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004.
Customizing the MOSS 2007 Search Results November 2007 Rafael Perez.
European Interoperability Architecture e-SENS Workshop : Cartography Tool in practise 7-8 January 2015.
Accessing and Using the e-Book Collection from EBSCOhost ® When an arrow appears, click to proceed to the next slide at your own pace. To go back, click.
The North American Carbon Program Google Earth Collection Peter C. Griffith, NACP Coordinator; Lisa E. Wilcox; Amy L. Morrell, NACP Web Group Organization:
C-squares - a new simple, XML friendly, display/ query/ exchange format for representing spatial data extents at the metadata level System concept and.
Spatial Information Integration Services (SIIS) ISO/TC211 Workshop on Standards in Action Adelaide, South Australia October 2001 Mr. Neil Sandercock, SA.
Introduction to ZPORTAL Prepared by Houeida K. Charara Electronic Resources Librarian LAU Libraries ©2010.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
For Mapping Biodiversity Data Data Management Options.
1 Adaptive Management Portal April
Geospatial standards Beyond FGDC Geog 458: Map Sources and Errors March 3, 2006.
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
ISO Standards: Status, Tools, Implementations, and Training Standards/David Danko.
CEDROM-SNi’s DITA- based Project From Analysis to Delivery By France Baril Documentation Architect.
XML, DITA and Content Repurposing By France Baril.
CM [A] R’s “MarLIN” Metadata System - or, how do we discover what data we’ve got?? Tony Rees Manager, Divisional Data Centre 3 June 2005 CSIRO.
PLP Guide1 Training Guide for Inzalo PLP Management.
MEDIN Data Guidelines. Data Guidelines Documents with tables and Excel versions of tables which are organised on a thematic basis which consider the actual.
Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.
Classroom User Training June 29, 2005 Presented by:
What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
System for Administration, Training, and Educational Resources for NASA SATERN Overview for Learners May 2006.
Metadata and Data Management activities at CSIRO Marine Research, Australia Kim Finney & Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
North American Profile: Partnership across borders. Sharon Shin, Metadata Coordinator, Federal Geographic Data Committee Raphael Sussman; Manager, Lands.
OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.
Sept 19,  Provides a common set of terminology and definitions  A framework for describing resources and processes  Enables computer based interoperability.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
European Interoperability Architecture e-SENS Workshop : Collecting data for the Cartography Tool 7-8 January 2015.
Metadata Lessons Learned Katy Ginger Digital Learning Sciences University Corporation for Atmospheric Research (UCAR)
0 eCPIC User Training: Resource Library These training materials are owned by the Federal Government. They can be used or modified only by FESCOM member.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
MarLIN CSIRO Marine Laboratories Information Network update April 1999 Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart acknowledgements:
Introduction to Omeka. What is Omeka? - An Open Source web publishing platform - Used by libraries, archives, museums, and scholars through a set of commonly.
Prototype Information Architecture. Key Requirements Access to data, tools, and expertise –Integrated access to spatial data –Submission of info. to OWEB.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
CSIRO Marine Research Data Centre linked databases - CAAB, MarLIN and Divisional Data Warehouse.
Implementation Experiences METIS – April 2006 Russell Penlington & Lars Thygesen - OECD v 1.0.
Using the Right Method to Collect Information IW233 Amanda Murphy.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
NDD (National Oceans Office Data Directory) development overview as at 1 July 2002 Tony Rees/Miroslaw Ryba CSIRO Marine Research, Hobart.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1 Understanding Cataloging with DLESE Metadata Karon Kelly Katy Ginger Holly Devaul
Introduction to Morpho RCN Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
MarLIN - CSIRO Marine Laboratories Information Network.
1 © Xchanging 2010 no part of this document may be circulated, quoted or reproduced without prior written approval of Xchanging. MOSS Training – UI customization.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
CAAB and taxon management at CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart
ESRI Education User Conference – July 6-8, 2001 ESRI Education User Conference – July 6-8, 2001 Introducing ArcCatalog: Tools for Metadata and Data Management.
MarLIN: a research data metadatabase for CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart contact:
A look to the past for the future- The North American Profile Sharon Shin Metadata Coordinator Federal Geographic Data Committee.
Forms Manager. What is Forms Manager? Forms Manager is a completely new online form creation and form data management tool.
System concept and development by: Tony Rees Divisional Data Centre CSIRO Marine Research, Australia c-squares - a new method for representing, querying,
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Using E-Business Suite Attachments
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
SDMX IT Tools SDMX Registry
Presentation transcript:

Tony Rees Divisional Data Centre CSIRO Marine Research, Australia Metadata concepts, issues and experiences – lessons from 8 years of metadata management at CMR - for CSE Metadata Workshop, Canberra, May 2005

Overview Some definitions / concepts Who are the clients for metadata? (what is our target audience) How do people find metadata? (discovery / search mechanisms) The national metadata infrastructure context (ASDD etc.) Search methods – free text vs. structured searches, and the CMR (MarLIN) approach What metadata to collect? Space and time “footprints” in metadata records (storage and search implications) How do we populate the system... Selected implementation aspects (when actually building a system).

Metadata is … Structured, summary information regarding a dataset or similar resource Conforms to some standard – e.g. ANZLIC (for our region), ISO 19115, can have agency-specific extensions Provides both descriptions of resources (cataloguing / documentation function) and potentially, previews of / access point to the data Definition of “Dataset” – in the eye of the beholder – a logical set of data sharing common attributes e.g. data type, collection method, survey / expt... – size of data “chunks” (granularity of the metadata) determined by agency practices and preferences Probably good to distinguish dataset-level metadata from item level descriptions (keep in separate, tailored systems).

Some example metadata systems … GCMD (NASA)

Some example metadata systems (cont’d)… NERC Metadata Gateway (UK)

Some example metadata systems (cont’d) … Australian Spatial Data Directory (another gateway)

Some example metadata systems (cont’d) … MarLIN (CMR metadata system)

What are we trying to do here? Describe our data holdings – to the inside and outside world Bring together relevant dataset documentation (or pointers to it) in a single, www-accessible location Provide a good (i.e.: tailored) set of search tools which suit our data holdings and “target” users Facilitate access to our data – on a self serve basis (where possible) ** Connect our entered information to the wider world for “discovery” purposes, e.g. to metadata gateways and internet search engines Re-use metadata as a “building block” in broader Divisional systems (capture once, use many times) ** (** = value adding)

Who are the clients for our metadata? (hopefully not...)

Who are the clients for our metadata? (hopefully yes...)

Who are the clients for our metadata? CSIRO researchers and their internal / external collaborators (e.g. for data discovery) Divisional management External parties – schools, public, scientific community, policy makers, consultants Ourselves– if an extensive data custodian (use for internal cataloguing / data access purposes) Recipients of CSIRO data – can supply metadata along with data products (also, may be a project deliverable) Future users (v. important) – “corporate memory”

How do people find metadata? Agency-level systems (own access points) Metadata gateways – e.g. ASDD (Australian Spatial Data Directory) for Australia, NERC metadata gateway for UK Future one-CSIRO system (??) Internet search engines e.g. Google (if mechanism for crawling is enabled) Standalone metadata files (e.g. supplied with data). NB: all have their place, e.g. agency-level systems may support richer or better targeted search facilities than those available via gateways.

National Metadata Infrastructure future agency system metadata systems describe / point to... CMR MarLIN CMR data DEH EDD DEH data BoM BoM data GA GA data etc. ASDD Australian Spatial Data Directory – national cross-agency metadata gateway search via ASDD – search across multiple agencies, basic functionality search via MarLIN – search only CMR holdings, but extra functionality (also view “CMR internal” records not visible to external users)

ASDD search – across multiple agency systems

(etc.)

Also, converse applies – one word, multiple uses, e.g. shark (fish), shark cat (type of boat), Shark Bay (place)... Variant spellings also a problem (e.g. sea lion vs. sea-lion vs. sealion; fishery vs. fisheries; organization vs. organisation; Mt. vs. Mount... Typographical errors may render document invisible to a free text search (can be at either end, e.g. searcher or stored data). Limitations of text-based searching... Basically a “hit and miss” method – no “browse” capability, or method to broaden / focus the search Relies on searcher and metadata creator using same words for same concepts (does not happen in practice, with free text entry across multiple systems)... e.g. “whales” vs. “cetaceans” vs. “marine mammals” vs. species scientific names (multiple wordings covering potentially the same concept)

Steers users to use “one concept, one descriptor” approach; no spelling variants / errors Can organise thematically / hierarchically, i.e. “shark” under zoology, “Shark Bay” under localities... (less confusion); also can have explicit relationships (broader / narrower, related categories, etc.) Supports structured information retrieval and browsing Good prompt for terms that the searcher (or content creator) may not otherwise think to enter Amenable to global updates (hold list item ID’s in the record, actual values in a look-up table, change in one place only) Can be access point to more extensive stored additional information (e.g. via project, voyage, organisation, publication ID) – content creator picks a value from the list, system automatically adds the rest Main difficulties: getting agreement on list content; anticipating all user needs; loss of flexibility / fine detail of expression (i.e., still a need for free text as optional supplement). Also, list maintenance is an overhead. cf – Advantages of picklists (“controlled vocabularies”)...

e.g. MarLIN approach... (example: search by taxonomic group)

(etc.) NB: (1) this method (in principle) maximises both “recall” (getting records that you do want) and “precision” (not getting records that you don’t want) (2) fewer “0 records returned” messages (user cannot search on terms not actually used)

What metadata to collect? – 1 Core ANZLIC fields – title, abstract, space and time ranges, data quality, data contact point, ANZLIC search words... (c. 40 fields)

What metadata to collect? – 2 Other fields of value to the agency – e.g... –project codes + associated info. –more specialised keywords or search terms –controlled defined regions list –links - data documentation, graphics links, data access –stored data volume, stored data location –references, contributors, acknowledgements (e.g. funding)... Some of the above correspond to elements in the ISO standard (c. 400 fields), some will be new Tension between simple metadata set (few elements, but easy to collect) and more extensive dataset information (more effort to collect, but increased future value and / or structured search options).

CMR Metadata search page (portion)... in order to be useful for structured searches, relevant information must be captured at metadata entry time, in a consistent way (e.g. via picklists and supporting tables).

Also need to consider space, time “footprints”, i.e. how to support these at search time Example for a CMR dataset (“Lira” catch dataset from 1973):

overlap = “hit” Storage of relevant Temporal and Spatial search info: (default) Tend to not worry about temporal patchiness (maybe just add text comment in “completeness” field) Dataset time range (as start, end dates) Search time range (as start, end dates) Machine-readable temporal search: Machine-readable spatial search: Dataset bounding box (as start, end lat & lon) Search bounding box (as start, end lat & lon) overlap = “hit” Spatial patchiness (or irregular polygon shapes) can be a more serious problem – CMR solution on next slide

Spatial footprints – improved method CMR has implemented a grid squares-based system for improved spatial “footprint” representation and querying (without requirement for a full GIS back end): Dataset spatial extent – stored as list of squares intersected in list = “hit” We use 0.5° x 0.5° squares – same resolution as 1: mapsheet series (approx. 50 x 50 km) Global “c-squares” notation covers marine as well as land areas. Search by grid square (or set of squares) not in list = “miss”

Related functionality on Museum Victoria “Bioinformatics” site (search interface shown): Searcher can use this approach to define a non- rectangular region of interest (green highlighted cells) (NB, this uses a different [non global] notation for the cells, however the basic principle is the same)

Result for the relevant “Lira” CMR metadata record... Red squares (as square IDs) are what is actually stored, can then be superimposed on any user-selected base map for display purposes Now will not get “false positives” – e.g. from searching at Alice Springs

Remainder is “standard” metadata (ANZLIC + CMR extensions), e.g... (etc.)

How do we populate the system (get people to describe their data)? Non-trivial problem Education – value of metadata, responsibility of data custodians to describe their data in designated system/s Prescriptive approach – build into project planning, sign-off, APA’s Facilitation – dedicated personnel assist scientists, knock on doors Making records on researchers’ behalf – resource intensive, also not ideal since person making the metadata does not have the best understanding of the data Incrementally – e.g. as data is migrated into corporate systems, require the metadata to go with it (robust linkage) – NB, will probably always be “data islands” that this approach misses.

How far have we got...? Currently there are some 2,100 records in the MarLIN system (etc.)

How far have we got...? – cont’d 90-95% of “Data Centre” holdings described – after 8 yr process! (<1000 records, mostly ships’ data, by voyage and data type) a few “data islands” have made concerted attempts to describe their data (e.g records each) some major data acquisition exercises have generated records, mostly for third party data (generally not visible on extranet) – e.g. where metadata is a specified project deliverable along with the data (good!) remainder is pretty patchy (maybe 10% compliance) – hope to kickstart with project-based “skeleton records”, also more rigid directives / follow up from Divisional management.

Project data template (example): (etc.)

What information model to use? Stored data Metadata system Projects database Library pubs. list Item-level catalogues Ancillary information Ideal world (probably unattainable): Persons database... all information would be entered / maintained in one place only; updates would propagate automatically through the system; all resources would be electronic and seamlessly accessible

Best we can do for now... Stored data Metadata system – main “datasets” table Item-level catalogues Ancillary information digital + non-digital MarLIN “persons” table digital + non-digital MarLIN “references” table? (or text descriptions) MarLIN “projects” table MarLIN “data” links (URLs) in table (also text descriptions) MarLIN “doc” links (URLs) in table (also text descriptions) MarLIN “doc” + “graphic” links (URLs) in table (also text descriptions) plus some other tables (not shown) for voyages, organisations, keywords...

Functionality / Processes to be supported (... list probably incomplete!) User interfaces – create, edit, search metadata records Administrator functions – user identities and privileges, “super- user”-level record modification, deletion, list maintenance Moderator function – approve / edit content to be published Security / authentication – who can access “internal” records (e.g. by specified IP domains or other mechanism) Access logging – including what search terms used, how many “hits”, etc. (plus applications to review user log and access stats) Application maintenance, tech. support, user training Automated connections to remote systems, plus on-demand import / export features (e.g. via XML) Ongoing development / modification to functionality or database structure – process, resources...

Metadata integration / remote calls (examples) Project work space (HTML page)

Metadata integration / remote calls (examples) Custom MarLIN search via web call (from different database)

Metadata integration / remote calls (examples) Re-use of MarLIN supporting tables content (in other contexts)

Concluding remarks Simple in theory, not so simple in practice, to design and implement a good system (especially in a research, rather than basic “products set” environment) – no “off the shelf” solution (or even key components) available Designing a system gives the opportunity to incorporate new / improved concepts (scope for innovation, design challenges) Should be benefits in sharing code, approaches, experiences across Divisions or other groups Populating the system is as important as building it! Connection to external gateways is not too hard, once system plus some publishable content exists CMR is a lonely trailblazer within CSIRO.. still considered an example of “best practice” (a bit of a worry, seeing how far we still have to go)...

Thanks! To visit MarLIN: go to >> Data Centre ( >> MarLIN ( MarLIN “Edit” interface – currently requires access privileges to visit (will look at online in tomorrow’s session).