Presentation on theme: "Terminology Markup Framework and TBX-SKOS Interoperability"— Presentation transcript:
1Terminology Markup Framework and TBX-SKOS Interoperability Gerhard BudinUniversity of ViennaChair, ISO/TC 37/SC 23rd Ecoterm Group MeetingFAO, Rome17/18 May, 2006
2A Brief History Problems and Solutions Strong diversity of lexico-terminological resourcesData models, data structures + data semanticsDiversity of semantic, linguistic/cultural complexity and semantic depth/richnessDiversity of user groups and their requirementsSheer quantity of resourcesData interchange between organizations (within and across domains) as well as (distributed) data integration – early needs asking for immediate solutionsHistory of data modelingHistory of interchange standardsHistory of semantic interoperability management
4Developing the Terminology Markup Framework in order to cope with this complexity-diversity Based on empirical studies and practical user-driven requirements analysisMarkup/representation/modeling: XML, XMLS, RDF, UMLOpen standards strategy (ISO TC 37)ISO Data categories – meta-model element + semantics registry (RDF)ISO Terminology Markup Framework (TMF) – meta-model architecture and specifications (UML)ISO – Terminology Markup Language (XML)Instance for language industry: TBX Termbase Exchange Format (XML)Instance for lexicography/publishing: LexML ISO 1951Lexical Markup Framework (LMF) (UML)ISO 704 and ISO 1087 (foundational level)ISO (workflow and collaborative issues)Alignment with ISO 11179, W3C, OASIS, etc.
5Introduction to TBX TBX® stands for TermBase eXchange TBX is a Terminological Markup Framework (TMF) markup languageTMF is an ISO standard (16642)TBX is consistent with ISO (MARTIF)TBX is maintained by OSCAR (www.lisa.org)The TBX specification is free
6Who Should Care about TBX? If you don’t care about terminological consistency in terminology, then you have no reason to care about TBXIf you only need a simple bilingual list of terms (source term and target term) with no additional information, then you don’t need TBX; just use a two-column spreadsheet for your list
7On the other hand…If you do care about terminological consistency and you maintain one or more terminology databases (termbases), then you should be interested in TBX, unless you want your termbase to be locked into the terminology management software you are currently using.Portability of complex terminological data is the key benefit of TBX
8What does TBX look like? A TBX file is an XML document A TBX file consists of:A header that describes the fileA set of entries, one per concept in the termbaseFor each concept, a set of terms, grouped by language, that designate the conceptA terminological concept entry (termEntry)Can be multilingualCan be monolingual
9Example of a TBX file<?xml version='1.0'?> [+ ref to DTD/schema]<martif type='TBX' xml:lang='en'><martifHeader> [global info] </martifHeader><text><body> [concept entries] </body></text></martif>
11TBX Body <body> <termEntry id='C171'> [concept: a dollop of cream]</termEntry><termEntry id='C180'>[concept: frog legs]</body>
12TBX and Other Standards (1) TBX and ISO (TMF)(2) TBX and ISO (Data Categories)(3) TBX and SKOS
131: TBX and ISO 16642TBX is a TML (Terminological Markup Language) of TMF (ISO 16642) (see Annex B)TBX maps to the TMF meta-modelA TBX file is a TDC (terminological data collection)martifHeader provides GI (global information)termEntry: TE (terminological entry)langSet: LS (language section)tig/ntig: TS (term section)A TMF DCS (Data Category Selection) in TBX is in XCS (eXtensible Constraint Specification) formatTBX uses ISO for its XML style
14Terminological Data Collection (TDC) (Concept) Entry/Entries TMF MetamodelTerminological Data Collection (TDC)GlobalInformation(GI)ComplementaryInformation(CI)Terminological(Concept) Entry/Entries(TE)Language Section(s)(LS)Term Section(s)(TS)Term ComponentSection(s)(TCS)
15TMF and lexical resources In general, a terminological resource is organized into concept entries, each of which includes one or more terms designating a particular conceptIn general, a lexical resource is organized into lexical entries, each of which includes one or more senses of a particular lexical item (a word or phrase)A concept entry containing multiple terms can be split into multiple lexical entries, one per term, and multiple lexical entries associated with the same concept can be combined into one concept entry
162: TBX and ISO 12620All data categories in the default TBX DCS are taken from ISO 12620
173: TBX and SKOSA typical concept entry will contain a subject field to specify the domain of the concept.However, the subject field is typically some kind of hierarchy that is flattened into a string within TBXSKOS makes it possible to represent the subject field hierarchy as a hierarchy and then create a link within TBX
18Simple Knowledge Organization System (SKOS) “SKOS is an area of work developing specifications and standards to support the use of knowledge organisation systems (KOS) such as thesauri, classification schemes, subject heading lists, taxonomies, other types of controlled vocabulary, and perhaps also terminologies and glossaries, within the framework of the Semantic Web.” - (Accessed on 3/17/06)
21GEvTerm InitiativeThe information previously used dealing with food has been taken from FooNaVar, a project of the GEvTerm Initiative.The GEvTerm Initiative is a terminological database that has committed to being fully TBX and SKOS compliantAnother example of TBX in use is...
22C: Multilingual Thesaurus for Medieval Studies (MLTMS) “Imagine the ability to search across web-resources using your native modern european language and find appropriate primary and secondary sources in Latin, French, Italian, German, Spanish, English, etc., based upon the meaning rather than the form of the search term. Imagine having a tool that would enable you to search for a concept and be able to construct the forms it has taken historically as well as the ability to link outward for both evidence and argument. Imagine a tool that would enable you to study the slippage of concept which is beyond naming. Imagine having a tool that can deconstruct ontological orders asking for different kinds of readings.” -http://www.mith2.umd.edu/thes/ (Accessed on 3/17/06)
23Why did MLTSM use TBX?integration of terminological data from multiple sources;querying multiple termbases through a single user interface by passing data through a common intermediate format on a batch or dynamic basis;placing data on an FTP site for download by interested parties;peer review by colleagues of tentative entries- (Accessed on 3/17/06)
24MLTSM Sample <termEntry id='eid-VocCod-211.01'> <descrip type='subjectField'>personnel</descrip><descrip type='definition'>personne qui accomplit un travail copie ou d'écriture</descrip><langSet xml:lang='fr'><ntig><termGrp><term id='tid-voccod fr1'>copiste</term><termNote type='termType'>entryTerm</termNote></termGrp></ntig><term id='tid-voccod fr3'>écrivain</term><termNote type='termType'>synonym</termNote></langSet><langSet xml:lang='en'><term id='tid-voccod en1'>scribe</term></termEntry>
30OWL, RDF & XML Schema used to specify XMDR as UML used for 11179 Edition 2 Metamodel11179RelationalSchemaMetadataOWL XMDROntology &annotationsTypes &CardinalitiesXMDRXMLSchemaTRangXMDR’sRelax NGSchemaTriples: binarylabeled relationshipsRDF SpecXML SchemaLanguage specXML ObjectsWhat things go in own files?Which property direction stored?Sequential ordering of properties
31XMDR Prototype Example: dual purpose RDF/XML file: DEALL.1.5394.1.xml <DataElement rdf:about=""xml:base="http://xmdr.lbl.gov/xmdr/data/DEALL xml"><container rdf:resource="http://oaspub.epa.gov/edr"/><identifier rdf:parseType="Resource"><string rdf:datatype="&xsd;string">5394</string></identifier><version rdf:datatype="&xsd;string">1</version><administrationRecord rdf:parseType="Resource"><registrationStatus rdf:datatype="&xsd;string">Standard</registrationStatus><administrativeStatus rdf:datatype="&xsd;string">Final</administrativeStatus><creationDate rdf:datatype="&xsd;date"> </creationDate></administrationRecord><designation rdf:parseType="Resource"><context rdf:resource="CXT-Legacy.xml"/><sign xml:lang="en">Country Name</sign></designation><context rdf:resource="CXT-Long Abbreviation.xml"/><context rdf:resource="CXT-Medium Abbreviation.xml"/><context rdf:resource="CXT-Short Abbreviation.xml"/><sign xml:lang="en">Mail Cntry Nm</sign><designation rdf:parseType="Resource"><context rdf:resource="CXT-Registry.xml"/><context rdf:resource="CXT-Standard.xml"/><sign xml:lang="en">Mailing Address Country Name</sign></designation><definition rdf:parseType="Resource"><context rdf:resource="CXT-Legacy.xml"/><context rdf:resource="CXT-Long Abbreviation.xml"/><context rdf:resource="CXT-Medium Abbreviation.xml"/><context rdf:resource="CXT-Short Abbreviation.xml"/><text xml:lang="en">The name of the country where the addressee is located.</text></definition><type rdf:resource="RCDIS xml"/><domain rdf:resource="VDALL xml"/><meaning rdf:resource="DCDIS xml"/><example rdf:datatype="&xsd;string">United States</example></DataElement>
32XMDR XML schema provides a number of important benefits… Schema specifies what is required as well as what is legalDivides metadata into files conforming to XML schemaNormalizes data (ala’ relational “one fact in one place”)Facilitates XSLT transformations by reducing degrees of freedom to a canonical encoding within the RDF standardRelax NG used to create and check XMDR-it schemaRNG validator enforces many OWL ontology constraintsTRang automatically translates into XML schema syntax
33From texts and terminologies to ontologies Using the Risk scenarioTermbaseExport XMLDomain Models – meta-models -> patternsText corpusTerm extraction – comparative testing ProTerm, MultiTerm Extract, MultiCorporaAligning with termbaseOntology import -> editor
50TBX-SKOS interoperability DifferencesXML vs. RDFInherent flexibility + ”open” data modeling for a large variety of resources vs. traditional thesaurus data model as a default for a KOS (diff. scopes)TBX has documented use cases and mapping tools -> language industry standardDifferent semantics + vocabularies (12620 vs. thesaurus standard)CommonalitiesConceptual approachW3CVocabulary mapping (RDF)
51Terminological Data Collection (TDC) (Concept) Entry/Entries TMF MetamodelTerminological Data Collection (TDC)GlobalInformation(GI)ComplementaryInformation(CI)Terminological(Concept) Entry/Entries(TE)Language Section(s)(LS)Term Section(s)(TS)Term ComponentSection(s)(TCS)
54Term-Level Information Language Section(s) (LS)(TS * n …)Term Section(s)(TS)Term Section(s)(TS)Term Section(s)(TS)TermAdminis-trativeDatCatsConcept-RelatedDat-catsNotesTerm-relatedDatCats (TRD)TransactionDefinitionNoteContextDateTransfer-commentTransfer-commentResponsibilitySourceID
55SKOS VocabularySKOS Core is a model for expressing the structure and content of concept schemes (thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary).The SKOS Core Vocabulary is an application of the Resource Description Framework (RDF), that can be used to express a concept scheme as an RDF graph. Using RDF allows data to be linked to and/or merged with other RDF data by semantic web applications.
59Mapping TBX/12620 DatCats to SKOS Vocabulary TBX data categories (data element concepts in the sense of ISO/IEC ) contain instantiations of information that are expressed in SKOS using SKOS core vocabulary.Interoperability (a cross-walk between the two standards) depends on mapping between the two systems
60Data Collections collection We do not have this, although collections can be implied in some cases by the use of the coordinateConceptGeneric or possibly subordinateConceptGeneric markers.collectablePropertyWe do not have this; in SKOS one can assign rules to collections, which makes this useful as an ontology-like property.orderedCollectionNot available in our set, although many of our conceptual domains are structured as ordered lists.They are ordered by virtue of proximity, but we don't have a mechanism for enforcing order within the metadata structure.
61Collections, cont. memberList An RDF list containing the members of an ordered collectionWe aren’t sure why this is necessary; why not just use ordered collection?We are assuming the collection by itself embodies an unordered list.memberDefinition: member of a listIf indicated at all, this is embodied in TBX as1) a simple data category listed as a member of a conceptual domain2) as a coordinate concept or subordinate concept associated with a broader concept or topTerm
62Concept & Concept Schemes Embodied in TMF/TBX as the entire / termEntry /.conceptSchemeA concept system; represented via links and notation systemspropertiesDefined links and relationsTMF/TBX: no open class of properties or edges that can be freely definedMany pre-defined sets of property relations between individual data element types and between attributes and the members of their conceptual domains.
63Scheme Identification inSchemeWe have pointers to Classification Schemes, but our pointers for thesauri and hierarchical relations do not include a pointer to the name or identifier of a specific scheme.This is a lacuna for us and needs to be added.
64Subject (Domain) Identification isPrimarySubjectOf/ subject field level 1 /Definition: the primary subject of a resource12620 allows for 9 levels of granularty and TBX for 3 in defining the granularity of subject references within a schemeisSubjectOf/ subject field level 2 /; primarySubject[subject field + a restrictive constraint; 2nd highest level of granularity]subject/ subject field level 3 / ; /subject fields 3-9 /subjectIndicatorpublic subject indicator located using a URIMissing in TBX / 12620
65Labels (Terms, ConceptNames) Missing: label/ term /prefLabel (preferredLabel)/ term termType=preferred term / ; / descriptor /prefSymbol/ term termType=preferred term termType=symbol /altLabel/ term termType --> admitted term /altSymbol/ term termType=admitted term termType=symbol /hiddenLabelGenerally achieved using a security code reference or an authorization code
66Hierarchical Relations hasTopConcept/ topTerm /hasTopConcept points to URI which contains the top concept; we could choose to use this methodology.topConcept has been deprecated as a vocabulary item.broader/ broader term / (as a pointer to a thesaurus descriptor)/ superordinate term generic / (terminological concept system)narrower (hasNarrower)/ narrower term / (as a pointer to a thesaurus descriptor)/ narrower concept generic / (terminological concept system)
67General Relations related semanticRelation / related term / (thesaurus pointer)/ related concept / (terminological concept system)semanticRelationMissing example in the Vocabulary documenthow a semantic relation differs (if it does) from other conceptual relations?
68Concept Descriptiondefinition/ definition /example/ example /
69Notes changeNote editorialNote historyNote privateNote publicNote / admin type=modification note /The relation between Note and "change" is determined by the position of the note embedded in an <adminGroup> of type=modification.A note about a modification to a concept, not to an entry.]editorialNote/ adminNote /A note concerning the administration of a KOS resourcehistoryNote/ termProvenance /privateNote/ note / + authorization levelspublicNote/ note / + authorization levelsscopeNote
70Thank you for your attention Acknowledgements:Slides 5-28 together with Alan Melby, Sue Ellen WrightSlides Bruce BargmeyerSlide 35 WordNetSlides diff. sources, 43: ThesShow Legat/Stallbaumer44: GEMET, 45: Bandholtz, 46/47: Gangemi, 48: Wright, Miles/SKOS, together with Wright/MelbyGerhard Budin