Presentation is loading. Please wait.

Presentation is loading. Please wait.

2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI.

Similar presentations

Presentation on theme: "2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI."— Presentation transcript:

1 2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI

2 2 Activities: CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN DARIAH joint taskforce VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community. SMC – Semantic Mapping Component a module in the CMD-Infrastructure goal: „semantic search“ = enhance the search in the heterogeneous data collection (of CMDI) a) by exploiting the shared data categories (SMC on schema level) b) by expressing the data in RDF (SMC on instance level) Context

3 Context II - CLARIN-AT CCV – CLARIN Center Vienna CenterProfile CMD record CenterProfile CMD record expected ready by: 2013-06 Infrastructure services: CLARIN Metadata Repository SMC – Semantic Mapping Component SMC-Browser Controlled Vocabularies engagement in CLARIN + DARIAH task forces 3

4 Old vision conceptualization sketch from 2009 4

5 Potential usages for CV ● Metadata Generation, Curation ● Data-Enrichment / Annotation ● Data Analysis ● Search (Query Expansion, autocomplete, facets etc. ) ● needed for CMD2RDF - provide identifiers for entities (- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ? like equivalencies in Wikipedia (page for Johann Wolfgang Goethe): GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065 )Johann Wolfgang Goethe 5

6 Related Activities ● DARIAH Schema Registry + Crosswalk Registry ● LT-World @DFKI full-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD). ● CoNE – Control of Named Entities @MPDL/eSciDoc ● EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC). ● TextGrid ● ● FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Access - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) FRBR RDA 6

7 Candidate Vocabularies ● Data Categories / Concepts - ISOcatISOcat ● Languages - ISO-639ISO-639 ● Countries - country codescountry codes ● Persons - GND, VIAF, dbpedia? ● Organizations - GND, VIAF, dbpedia? ● Schlagwörter/Subjects - GND, LCSH ● Resource Typology - ● Tagsets!? (with mappings between tags) AAT - international Architecture and Arts Thesaurus GND - Gemeinsame Norm Datei (DNB) GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) VIAF - Virtual_International_Authority_File GND VIAF 7

8 ISOcat and CLAVAS export closed+simple DCs (perhaps even better to manually select) Third party applications use - ISOcat for explain() function - CLAVAS for value(/entity)-lists 8

9 informed query input information about available data categories and values for those categories can be used as base for a complex query-input widget with context-sensitive autocomplete however this rather only as fallback to autocomplete based on actual data 9

10 CMD  RDF Semantic Mapping on instance level express MD records in RDF (for LOD) => bind also values in MD fields to concepts Modelling aspects CMD Specification Data Categories CMD instances: - Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies => CLAVAS Ontological Relations Prefix namePrefix IRI rdf: rdfs: xsd: owl: skos: isocat: dcr: cmd: cmd_spec:? dce: dcterms: oa: ore: cr: used namespaces 10

11 11 Approach – Individuals/Instance Level One step when (pre)processing incoming new MD-sets 1.Express MD-Records as RDF-triples: 2.Identify potential target Domain Ontologies/Vocabularies 3.Create inverted Index: 4.Define lookup function: 5.Enrich dataset with new facts: 6.Property-values of Metadata-Records are linked to individuals of domain ontologies lookup(category, string-value) → label → entity

12 12 Candidate Categories/Properties ResourceType, Format, AnnotationLevelType → map to: isocat-DataCategories (Profiles: Metadata, Morphosyntax,...) Genre, Topic, Subject → map to: Taxonomies, Library Classification systems (LCSH, DDC, Dornseiff,...) Project, Institution, Person, Publisher open controlled vocabularies (real entities) → map to: CLAVAS-organisations, LT-World (perhaps others: LCCN, DBPedia?)

13 Next Steps Install current OpenSKOS at CCV – CLARIN Center Vienna synchronize 3 current datasets via OAI-PMH with sister instance at Meertens also to test the synchronization process (and implications) CMD2RDF „special groups vocabularies“ in CLARIN-AT Plant names Instruments 13

14 Appendix Explanations to SMC and CMDI 14

15 15 Semantic Mapping (schema level) - concept metadata fields in (completely) different profiles but bound to (the same) data categories (ConceptLinks) use this linkage when searching in the data i.e. allow the user to search a)„in the data category“ b)in a MD field but also all related fields from other profiles Multiple mapping levels: 1. just mapping based on the ConceptLink resolvable via ComponentRegistry different elements pointing to the same DatCat 2. use equivalence relations between DatCats from Relation Registry 3. use equivalence relations also between Container DatCats 4. use also other relations in Relation Registry (subClassOf, almostSameAs, …) 5. apply selected (user defined) relation sets from Relation Registry

16 16 CMDI linking components and elements in CMD profiles are bound to data categories the CMD records reference their profiles in Relation Registry data categories are related to each other in separate (possibly overlapping/contradicting) relation sets

17 17 Semantic Mapping Component separate CMDI module relies on information from ComponentRegistry, DCR, RelationRegistry is used by Metadata Repository / Service / Browser Task: resolution: dcrIndex ↔ cmdIndex dcrIndex :: (abstract) data category defined in DCR cmdIndex :: path to a field in a MDRecord (different from - query expansion: CQL(datcat) → CQL(cmdIndex[]) - query translation: e.g. CQL → XPath InputOutput dcrIndexisocat.DC-2545 (= isocat.resourceTitle) =>cmdIndex[][BamdesCommonFields.resourceTitle, imdi-corpus.Corpus.Title, …] cmdIndexActor.Role=>dcrIndexisocat:DC-2559 (participantRole)

18 18 Examples of DCR use in CMD metadata resourceName isocat:DC-2544 -CorpusProfile.Corpus.Metadata.Name -CorpusProfile.Corpus.SourceList.Source.Name -collection.GeneralInfo.Name -Session.Name -imdi-corpus.Corpus.Name -ToolService.GeneralInfo.Name -GTRP.Collection.GeneralInfo.Name -DIDDD.Collection.GeneralInfo.Name -Soundbites.Collection.GeneralInfo.Name -DynaSAND.Collection.GeneralInfo.Name BUT: CMD Element: „Name“           … CMD Element name |distinct Elems| |distinct DatCats| Name4011 Type168 Title146 Language106 ID115 format105 identifier65 Description314 Code84 date124 publisher94 source104 subject64 Creator63 Address53 Organisation33 Availability63 datatype83 contributor43

19 19 Examples of DCR use in CMD metadata II languageID isocat:DC-2482  LrtInventoryResource.LrtCommon.Languages.ISO639.iso-639-3-code  Session.MDGroup.Content.Content_Languages.Content_Language.Id  Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id  Session.Resources.WrittenResource.LanguageId  ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-code  ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-code  GTRP.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code  DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code  DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code languageName isocat:DC-2484  ToolService.Documentation.DocumentationLanguages.Language.LanguageName  ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName  GTRP.Collection.DocumentationLanguages.Language.LanguageName  DIDDD.Collection.DocumentationLanguages.Language.LanguageName  DynaSAND.Collection.DocumentationLanguages.Language.LanguageName dct:language  OLAC-DcmiTerms.language metadataLanguage isocat:DC-2543  CorpusProfile.Corpus.Metadata dominantLanguage isocat:DC-2468  Session.MDGroup.Content.Content_Languages.Content_Language.Dominant sourceLanguage isocat:DC-2494  Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage targetLanguage isocat:DC-2499  Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage implementationLanguage isocat:DC-3798 - ToolService.Tool.Implementation.implementationLanguage

20 20 DCR usage in Component Registry Datcats in CompReg288 ISOcat164 dc-elems15 dc-terms55 private ISOcat DatCats (?)54 Elements with Datcats82,38% Components with Datcats67 Data Categories Sets827 isocat (Metadata Profile#5)712 dublincore elements16 dublincore terms99 Component Registry CMD-Profiles53 standalone Components235*) overall Components298 distinct Elements893 all Elements3.030 all paths (profile/comp/elem4.565 Components structure as of 2012-05

21 SMC Browser 21 TODO feed with statistics of the instance data add relations from RELcat add operations on graphs (intersection, difference, …) Explore the Component Metadata Framework Profile specifications from Component Registry visualized as interactive graphs statistics (about reuse of Components)

22 SMC Browser Explore the Component Metadata Framework 22

Download ppt "2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI."

Similar presentations

Ads by Google