Google Dataset Search Evaluation CEOS WGISS-46, October 23, 2018 André Twele, Christian Strobl and Katrin Molch
Google Dataset Search – some quick facts Launched on 5 September 2018 Think of a "Google Scholar for data” Main aim: Facilitating the discoverability of datasets from thousands of repositories across the web Initial release mainly covers the environmental and social sciences, government data, and data from news organizations Relies on dataset providers to embed structured (meta-)data into their web sites using schema.org dataset or equivalent structures (W3C DCAT) for markup Formats: JSON-LD, RDFa 1.1 or Microdata syntax https://toolbox.google.com/datasetsearch
Evaluation – first impressions Quality of search results differs, probably as a result of the different portals where Google’s Search Engine retrieved it: For many search results, only a “Description”-field (equiv. to gmd:abstract in ISO) is shown
Evaluation – first impressions Only included for some search results: “Dataset published, created or updated”, “Dataset provided by”, “Time period covered”, etc. Rarely information on spatial properties, geographical coverage, etc. Dataset Search can detect if a dataset is present in more than one repository
Evaluation – first impressions
Evaluation – DLR EOC Catalogue DLR EOC Catalogue currently contains 184 entries (ISO 19115/19139) From a snapshot of 20 catalogue entries, 18 entries were discoverable through Google Dataset Search However, DLR currently does not provide metadata which can be directly processed by Google’s Search Engine ( schema.org/DCAT) …so how did DLR’s catalogue content make it to Google Dataset Search? www.europeandataportal.eu www.geoportal.rlp.de www.geoportal.hessen.de geo.spacebel.be …
Evaluation – Example “path” of metadata from its origin Indexing of schema.org / JSON-LD markup Harvesting of DCAT-AP records Data Cleaning, Replica Identification, Scholar Linking, Knowledge Graph Reconciliation Harvesting of ISO 19115/19139 records (Enriched) Metadata Index Ranked Results Catalogue
Testing tool for validating URLs/Code-snippets for structured data Test URL: https://www.europeandataportal.eu/data/en/dataset/f4d4079a-ada3-41d0-ba95-630ba232e147 („SRTM X-SAR - Digital Elevation Model (DEM) Tiles - Global“)
Evaluation – Other European Catalogues Selected ESA and EUMETSAT collections have all been retrievable through Google Dataset Search Most frequent metadata sources: geo.spacebel.be, fedeo.esa.int, cmr.earthdata.nasa.gov, data.nasa.gov, www.europeandataportal.eu Ranking of search results sometimes questionable: e.g. single value-added datasets are ranked higher than the original collection/dataset series Level of detail for individual search results is quite heterogeneous
Conclusions Completeness and quality of search results strongly depends on structured metadata added by dataset providers on their sites Adoption of open standards for describing structured data (schema.org, DCAT, JSON-LD) by Google will further encourage their usage and boost their adoption Availability of metadata replicas in several catalogues indexed by Google makes it sometimes hard to retrieve the original dataset from its portal Tendency that metadata properties/attributes get „lost in translation“ as a result of different schema transformations or portals through which they are offered
Links / References https://toolbox.google.com/datasetsearch (Dataset Search Portal) https://search.google.com/structured-data/testing-tool (Testing tool) https://developers.google.com/search/docs/data-types/dataset (Google developer guide on dataset discovery) https://productforums.google.com/forum/#!topic/webmasters/nPq4BW6iPIA (Google FAQ on structured data https://schema.org/Dataset (schema.org dataset markup) https://www.w3.org/TR/vocab-dcat/ Data Catalogue Vocabulary (DCAT) format
Thank you very much for your attention! André Twele DLR Earth Observation Center Andre.Twele@dlr.de