A Semantic Web View on Concepts and their Alignments Antoine Isaac Vrije Universiteit Amsterdam Europeana Concepts in Context, Köln, July 19 th 2010
Linked Data Principles 1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful information using standards (RDF, SPARQL) 4. Include links to other URIs, so that they can discover more things. Tim Berners-Lee, A way to publish Semantic Web data
A web of data Publish and re-use data via the web, building innovative applications over former data silos Principle #4 is crucial to this vision: Include links to other URIs, so that they can discover more things.
SKOS, Knowledge Organization Systems and Linked Data SKOS allows representing (simple) KOS data as RDF animals NT cats cats UF domestic cats RT wildcats BT animals SN used only for domestic cats domestic cats USE cats wildcats
SKOS, KOSs and LD SKOS allows bridging across KOSs from different contexts
Some landmark KOS LD implementations Many Libraries – not a surprise! Swedish National Library’s Libris catalogue and thesaurus Library of Congress’ vocabularies, including LCSH DNB’s Gemeinsame Normdatei (incl. SWD subject headings) Documentation at BnF’s RAMEAU subject headings OCLC’s DDC classification and VIAF STW economy thesaurus National Library of Hungary’s catalogue and thesauri (example) Other fields Wikipedia categories through Dbpedia New York Times subject headings IVOA astronomy vocabularies GEMET environmental thesaurus UMTHES Agrovoc Linked Life Data Taxonconcept UK Public sector vocabularies (e.g., )
KOS Alignments? Quite many of them are linked to some other resource LCSH, SWD and RAMEAU interlinked through MACS mappings GND linked to DBpedia and VIAF Libris linked to LCSH Agrovoc to CAT, NAL, SWD, GEMET NYT to freebase, DBpedia, Geonames dbPedia links are overwhelming Hungary, STW, TaxonConcept, GND… Is that enough? Are these links any good?
[Cyganiak, Jentzsch] Sparse linkage: the LD cloud
[Guéret, 2010] Sparse of linkage: another view
Linked Data Issues Mike Uschold’s “semantic elephants” Proliferation of URIs, Managing Coreference Versioning and URIs Overloading owl:sameAs
What kind of links? Coreference links are the most used (and needed) owl:sameAs skos:exactMatch skos:closeMatch rdfs:seeAlso umbel:isLike
Overloading owl:sameAs Formally, two URIs linked by owl:sameAs are inferred to have the same properties ex:a name “Antoine Isaac”. ex:b owl:sameAs ex:a. Implies ex:b name “Antoine Isaac”. Many owl:sameAs statements are asserted between resources that are only very similar [Halpin 2009] A same resource but in different contexts, a reference…
Case study: New York Times 10K concepts (places, descriptors, persons, organizations) Manually or automatically mapped by NYT staff to dbPedia, freebase, geonames Linking LD cloud to NYT articles! Allows to easily mix NYT content with other content Started with quite messy modeling dcterms:rightsHolder The New York Times Company. owl:sameAs
Clearer KOS alignments (1) What is being aligned? Concepts, documents, real-world entities “out there” (persons, places…) In principle owl:sameAs should not be applied across disjoint categories But even for one category there can be issues Two KOS concepts representing a same notion but with different management metadata attached (skos:changeNote)
Clearer KOS alignments (2) How is it aligned? Distinguish: exact co-reference conceptual similarity, including equivalence classification Making clearer distinctions between conceptual links skos:narrowMatch, skos:broadMatch, skos:relatedMatch Minimize ontological commitment for KOS data consumers skos:exactMatch: concepts can be used interchangeably across a wide range of information retrieval applications. skos:exactMatch is a transitive property skos:closeMatch: In order to avoid the possibility of "compound errors" when combining mappings across more than two concept schemes, skos:closeMatch is not declared to be a transitive property
Case study: New York Times (2) Data quality has considerably improved Factual data is at the concept itself, management data is at the resource representing the data source (context) rdf:type skos:Concept ; skos:prefLabel “Park Slope (NYC)” ; geo:lat “ ” ; owl:sameAs dcterms:rightsHolder “The New York Times Company” ; foaf:primaryTopic Still, for resources linked with owl:sameAs statements representing different modeling choices can be merged the DBpedia resource might not be a skos:Concept, or use different latitude format
Clearer KOS alignments (3) What is the alignment for? SKOS mapping properties use the notion of validity within one application context Application context for mapping has been investigated in thesaurus interoperability studies Application of alignments matters: STITCH application scenarios for Cultural Heritage: book re-indexing, thesaurus merging, query reformulation… A same alignment performs differently for different scenarios [Isaac 2008, Wang 2009]
Application-specific alignment evaluation Example: OAEI 2007 campaign, 3 matching tools evaluated for thesaurus merging & book re-indexing
Application-specific alignments Why? Take 2 thesauri at the Nat. Library of the Netherlands: GTT and Brinkman For thesaurus merging, gtt:excavation should be aligned to brinkman:excavation For book re-indexing, gtt:excavation should be aligned to brinkman:archeology_netherlands
Requires a finer representation grain for the context in which the alignment is produced Who created it? Manual vs. Automatic? Which alignment strategy or tool? Is there a degree of confidence?
Case study: New York Times (3) Using nyt:mapping_strategy property with nyt:manual or nyt:automatic: nyt:mapping_strategy Problem: it applies to the context file for the concept, not to the statement itself: owl:sameAs Using simple binary properties (skos:exactMatch…) between aligned resources does not allow for much flexibility
Ontology Matching community practices Community investigating the ontology and vocabulary matching issues Ontology Alignment Evaluation Initiative Matching tools produce some metadata Metadata repositories store and manage them – Bioportal – CATCH vocabulary and alignment repository … Consensus: richer alignment metadata is needed
From a simple representation
to a more complete one
Can LD accommodate complex representations? The strength of the LD vision lies in the relative simplicity of a standard representation LD provides a simple way to publish data and follow one’s nose to connected data Serendipity! Reification and metadata on links are not really compatible with it Higher barrier for data publication and consumption
Peaceful co-existence Applications with narrow scope and that require precise data can afford Selecting alignments they consume Exploiting finer-grained representations Creating finer-grained representations Simple data for applications that are simple and/or exploiting a wide range of datasets Simple mesh-up applications robust to (limited) approximation Web-scale applications Large-scale document retrieval, Concept discovery
Does it need to be perfect anyway? Do we really want to throw away crucial URI co-reference data? has 35,187,488 URIs in 11,285,263 bundles Extensive linking to dbPedia is useful, even with a type of link which is not used in the theoretically good way Cf. BBC content and data mesh-ups Issues with mixed quality are being tackled – as a “service to provide you with help finding URIs”, keeping track of data sources – Representation and exchange of provenance info is under active investigation
Peaceful co-existence (2) If you have complex representation, don’t be pedantic and publish simpler data, too! Articulation between LD (to discover links) and alignment repositories is needed Technically feasible, best practices have to be identified
Conclusions (Almost) any alignment is better than none This is a web of data, without links there’s almost no value There is already great linking happening! More involvement from this community would certainly help! Alignment themselves & Theoretical foundations
Thanks! Possible participation channels: Linked Open Data community ( and mailing list Library Linked Data W3C incubator group ( ) and community list
References [Halpin 2009] Harry Halpin, Pat Hayes. When owl:sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web. LDOW 2009 [Isaac, 2008] Antoine Isaac, Henk Matthezing, Lourens van der Meij, Stefan Schlobach, Shenghui Wang, Claus Zinn. Putting ontology alignment in context: usage scenarios, deployment and evaluation in a library case. ESWC 2008 [Wang, 2009] Shenghui Wang, Antoine Isaac, Balthasar Schopman, Stefan Schlobach, Lourens van der Meij. Matching multi-lingual subject vocabularies. ECDL 2009