Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.sti-innsbruck.at © Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.

Similar presentations


Presentation on theme: "Www.sti-innsbruck.at © Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma."— Presentation transcript:

1 www.sti-innsbruck.at © Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma

2 www.sti-innsbruck.at Attribution Some slides from –Steffen Stadtmueller, Tobias Käfer, Andreas Harth; PlanetData D2.7 Recommendations for contextual data publishing –Thank you! 2

3 www.sti-innsbruck.at Agenda Terminology and Motivation Overview Recommendations 3

4 www.sti-innsbruck.at What is data quality? PlanetData D2.1 deliverable Quality metrics: (1) accessibility, (2) interoperability and understandability, (3) timelines, (4) openness, (5) verifiability, (6) consistency, (7) completeness, (8) conciseness, (9) structuredness, (10) relevancy, (11) validity, (12) reputation The quality metrics also imply a set of best practices: –interlinking of datasets –provide provenance and licensing metadata –use of widely deployed vocabularies –use deferenceable URIs of proprietary vocabulary terms –mapping of proprietary vocabularies to others vocabularies –provide data set-level metadata –refer to additional access methods PlanetData D2.4 deliverable General adherence to these best practices the LOD Cloud 4

5 www.sti-innsbruck.at What is context? PlanetData D2.3 deliverable “information that can help data consumers to improve their services” 3 context dimensions: –Spatial - information with relation to (geo)spatial aspects of the described entities in the data (e.g., the shape of a product), or the geo spatial aspects of the data itself (e.g., the location, where a dataset was developed) –Temporal - information with regard to time, dates or time periods. –Social - information about people and the relation between people 5

6 www.sti-innsbruck.at Motivation Quality of data is an increasingly important issue in the Web of Data The presence of context information (temporal, spatial, social) can increase the quality of a dataset by itself What recommendations to be followed for contextual data publishing 6

7 www.sti-innsbruck.at Overview Crawled 1.5 billion triples from the LOD cloud (BTC2014 – Billion Triple Challenge 2014 crawl) Extraction of 3 data subsets from BTC2014 snapshot according to context dimensions, i.e. (1) Spatial, (2)Temporal and (3) Social Evaluation of data quality metrics (D2.1 + D2.4) for each of the datasets Comparison with results of D2.4, i.e., metrics for general LOD sample from the same time period Deduction of recommendations from the results 7

8 www.sti-innsbruck.at BTC 2014 Snapshot BTC Snapshot as of April 11, 2014 Crawl started February 20, 2014 Tool: adapted LDSpider Document: All triples from the same location (URI) PLD: Pay Level Domain, used to define the border of a dataset (host location) 8 CrawlMetric: number ofValue BTC 2014 SnapshotQuads1,533,623,743 Documents14,233,739 PLDs21,817

9 www.sti-innsbruck.at Contextual Data Subsets Approach: –Identify context indicating vocabulary terms –Manual cleanup –Extraction of all documents (and their triples) that contain an indicating term Indicating terms: –Temporal: Properties used in conjunction with “temporal-typed” objects Temporal properties from http://purl.org/dc/terms/date, www.w3.org/2002/12/cal/ical#, http://swrc.ontoware.org/ontology#, etc.http://purl.org/dc/terms/datewww.w3.org/2002/12/cal/ical# http://swrc.ontoware.org/ontology# –Spatial: Classes used to type spatial entities (+ closure over rdfs:subClassOf and owl:equivalentClass) Classes from http://schema.org, http://www3.org/2003/01/geo/wg, http://purl.org/dc/terms, http://geovocab.org/spatial, www.opengis.net/ont/geospar, http://www.w3.org/ns/locn. etc.http://schema.orghttp://www3.org/2003/01/geo/wg http://purl.org/dc/termshttp://geovocab.org/spatialwww.opengis.net/ont/geospar http://www.w3.org/ns/locn –Social: Classes used to type social entities (+closure over rdfs:subClassOff and owl:equivalentClass) Classes from FOAF, schema.org, DBPedia, goodrealtions, etc. 9

10 www.sti-innsbruck.at Contextual Data Subsets 10

11 www.sti-innsbruck.at Recommendation The general recommendation is to follow the best practices as detailed in PlanetData D2.1 (see slide 5) Specific recommendations for providers of datasets with contextual information as guide on how to follow the best practices as follows: 11

12 www.sti-innsbruck.at Recommendation for Providing Links to Other Datasets A considerable part of the datasets are not interlinked Use owl:sameAs to establish identity links between contextual data items –owl:sameAs is the most used predicate to establish an outlink Promising datasets to find link targets (most used so far) –General: Dbpedia –Spatial: GeoNames –Social: DBLP (L3S), Semantic Web Dogfood –Temporal: no identity links for the entity directly possible; use Freebase and DBpedia for the annotated entity 12

13 www.sti-innsbruck.at Recommendation for Providing Provenance Data Use widely deployed vocabularies to express provenance information to ensure compatibility with existing tools and foster integration capabilities: –Dublin Core (specifically for temporal information) –MetaVocab (specifically for social information) –Cert Ontology For a thorough provisioning of provenance information we recommend W3C PROVenenance Interchange –Covers many aspects –Most used for general purposes 13

14 www.sti-innsbruck.at Recommendation for Providing License Data Most used predicates to indicate license information are –cc:license –dc:license Employ the most used predicates, as agents are more likely to search for them Use machine readable OKFN conformant or creative commons licenses –Already prevalent in contextual data subsets 14

15 www.sti-innsbruck.at Recommendation for Using Terms from Widely Deployed Vocabularies The use of well known vocabularies should always be preferred to allow agents to interpret the data without mappings to other vocabularies For temporal information Dublin Core is commonly employed (use in conjunction with ISO8601 encoded literals For spatial information WGS84 for points, NeoGeo for polygons For social information FOAF and CON If the well known vocabularies are insufficient use vocab.cc and LODstats to find vocabularies in the long tail 15

16 www.sti-innsbruck.at Recommendation for Dereferencability of Proprietary Vocabulary Terms Clear requirement from the Linked Data principles, which should always be adhered to when publishing linked data Provide RDF descriptions of vocabulary terms via HTTP, independently from the presence of context information Datasets with context information already follow this recommendation more often than the average on the LOD cloud 16

17 www.sti-innsbruck.at Recommendations for Mapping of Proprietary Vocabularies to Others Reuse of existing vocabularies should always be preferred over a proprietary vocabulary If proprietary vocabulary is necessary map the terms to well known and often used vocabularies to increase integration capabilities: –General purpose: Dublin Core, SKOS –Temporal: Dublin Core –Spatial: WGS84 –Social: FOAF Use subclass and subproperty relations to establish links Note: if equivalent class or equivalent property can be employed the direct use of the target term should be preferred! 17

18 www.sti-innsbruck.at Recommendation for Providing Dataset-level Metadata Only very few datasets make use of VoID information, so automated agents should not rely on the presence of such information Our analysis shows that the well-known URI mechanism is used more often However we recommend to (additionally) provide a backlink to the VoID file, so agents can soly rely on link traversal to identify all relevant information 18

19 www.sti-innsbruck.at Recommendation for Referring to Additional Access Methods The sensibility of providing an additional access method mostly depends on the size and structure of the dataset, not on the presence of context information. If an alternative access method is offered, we recommend to provide a machine-readable link to the access method in a VoID description. 19


Download ppt "Www.sti-innsbruck.at © Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma."

Similar presentations


Ads by Google