Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11.

Similar presentations


Presentation on theme: "Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11."— Presentation transcript:

1 Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11

2 18 24 30366 12 0 FUB 4248 D4.1 Assembly and maintenance of the PlanetData data set catalogue D4.2 Best practices on how to provide self-describing data D4.2 Best practices on how to provide self-describing data KIT KIT Work Plan View WP4 UPM D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal D4.4 Data quality benchmark dataset D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal Task 4.4 Assembly and maintenance of a catalogue of data provisioning tools Task 4.3 Development of best practices for providing self- describing data Task 4.2 Community-driven creation and maintenance of vocabularies Task 4.1 Assembly and maintenance of the PlanetData data set catalogue

3 18 24 30366 12 0 Task 5.1 Assembly and maintenance of PlanetData technology catalogue Task 5.2 Development of best practices of large-scale data management infrastructures D5.3 PlanetData data management tools catalogue and access portal D5.3 PlanetData data management tools catalogue and access portal EPFL 4248 D5.1PlanetData data management tools catalogue and access portal D5.1PlanetData data management tools catalogue and access portal D5.2 Best practices on how to deploy tools on large-scale infrastructures KIT Work Plan View WP5

4 Summary WP4 Assembly and maintenance of the PlanetData data set, vocabularies and tools catalogue; Community-driven creation and maintenance of vocabularies; Development of best practices; WP5 Assembly and maintenance of the PlanetData technology catalogue; Best practices for large-scale data management infrastructure;

5 Tasks Task 4.1 – Assembly and maintenance of the PlanetData data set catalogue (Leader: FUB) (M1 – M48) Task 4.2 – Community-driven creation and maintenance of vocabularies (Leader: KIT)(M1 – M48) Task 4.3 – Development of best practices for providing self-describing data (Leader: KIT) (M1 – M24) Task 4.4 – Assembly and maintenance of a catalogue of data provisioning tools (Leader: UPM) (M1 – M48) 11/02/11

6 Deliverables in Year 1 D 4.1 Data Sets Catalog Vocabularies Catalog D 5.1 Data Management Tools Catalog

7 Data Sets Catalog Where to maintain the catalog? How to catalog? What to catalog? How to provide access for humans and machines? How to organize a community around the catalog?

8 Repository: TheDataHub.org Maintained by Open Knowledge Foundation (OKF) and world-wide open data community Widely used catalog Dec 1st 2012: has 2418 datasets, 314 LOD Features of the portal: Tagging, Rating, Feedback, Discussions, Groups

9 Cataloguing Process Planet Data Editor Collected a list of new datasets → 49 new entries Updated existing entries (537 edits) Crowdsourcing: data providers and third parties Public call for action to mailing lists, OKFN blog Supported the community contributions Quality Assurance Tools to support cataloguing (validator, auto-complete) Joint work with LATC

10 Catalog Metadata QuickRef What? package name, title, url tag:lod topic shortname format-* Who? author || maintainer published by producer provenance metadata license When? version last updated Why? package description Where to find? example URI downloads/dumps SPARQL endpoint How much? triples links:* (outlinks) namespace (inlinks) vocab mappings

11 http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation How are datasets described? Catalog Metadata Resources: example URIs SPARQL endpoint RDF Dumps Sitemaps, VoID files

12 Cataloguing process overview

13 Catalog Entry Validator Checks levels of metadata completeness Step-by-step annotation instructions Already checks some quality indicators e.g. availability, provenance, access methods http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php

14 CKAN Entry Validator (2)

15 Auto-completion scripts For the entries that pass the validator, we can auto-complete metadata with information such as: Number of triples Links to other sources Vocabularies used Quality indicators

16 Catalog Access Portal For machines CKAN API (continuously improved by OKFN) VOID descriptions for LOD group (will be continuously improved in cooperation with LATC) For humans LOD Cloud Diagram State of the LOD Report

17 Access for machines

18 LOD Cloud Diagram

19 LOD Cloud Diagram (zoom in)

20 State of the LOD Cloud Triples by domainLinks by domain Domain # of datasets Triples%(Out-)Links% Media251,841,852,0615.82 %50,440,70510.01 % Geographic316,145,532,48419.43 %35,812,3287.11 % Government4913,315,009,40042.09 %19,343,5193.84 % Publications872,950,720,6939.33 %139,925,21827.76 % Cross-domain414,184,635,71513.23 %63,183,06512.54 % Life sciences413,036,336,0049.60 %191,844,09038.06 % User-generated content 20134,127,4130.42 %3,449,1430.68 % 29531,634,213,770503,998,829 http://www4.wiwiss.fu-berlin.de/lodcloud/state/

21 State of the LOD Cloud (2) SPARQL Endpoint: 68.14% RDF Dumps: 39.66% Provide provenance: 36.63 % Provide licensing: 17.84% vocabulary use:

22 Vocabularies Catalog Based on BTC Dataset (2.1 billion triples) Shows vocabulary usage in practice Executed on a 54 node Hadoop cluster Access portal: Searchable URI Lookup Top usage statistics Hosted at http://vocab.cc

23 Top Classes per Dataset

24 Top Properties per Dataset

25 Vocabularies Catalog vocab.cc search query results vocab.cc URI Lookup Results

26 Tools Catalog Initial focus on tools from the consortium Currently 15 tools Entry for Global Sensor Networks (GSN) Available from planet-data.eu

27 Tools Description Textual description What is it? Documentation Publications Requirements License Contact person/mailing list Organization Events Tags Produce Publish Consume Provisioning

28 Names of Tools in the Catalog CumulusRDF D2R DBpedia Spotlight GSN (Global Sensor Networks) Geometry2RDF LDIF LDSpider (Linked Data Spider) LarKC (Large Knowledge Collider) MonetDB NOR2O R2O&ODEMapster OKKAM Pubby R2R S2O Silk

29 Tools Catalog Related: LATC Tools Catalog 11 tools 5 tools in both, 10 new tools in PlanetData Proposal for next year: Join catalogs at linkeddata.org Jointly maintain catalog until LATC finishes Build a community → people can add their own tools Afterwards PlanetData takes over and maintains the catalog for another 2 years


Download ppt "Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11."

Similar presentations


Ads by Google