ICS-FORTH April 10, 2002 1 Semantic Problems of Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science.

2 ICS-FORTH April 10, 2002 2 Thesaurus Mapping Thesaurus Interoperability Objectives: Global access to heterogeneous information sources Contextual problems of information sources: u Different providers u Different objectives u Overlapping topics/ themes Where do we need thesauri ? u Enhancing full text retrieval, query formulation aids u Querying structured data & metadata with controlled vocabularies u Classification systems for information organization

3 ICS-FORTH April 10, 2002 3 Thesaurus Mapping The Problem I ask for Cactus - you know Cholla... u I chaffinch - you fringilla coelebs u Ι dolls, Hopi - you kachina u I Champs Elysees - you France u I Greece, Acropolis - you restaurant Acropolis u I Architecture (studies) - you : Architecture (buildings) Thesauri differ u in language: natural, scientific or by convention u in subject: coverage, completeness and detail u in version: state of development

4 ICS-FORTH April 10, 2002 4 Thesaurus Mapping Thesaurus Transition

5 ICS-FORTH April 10, 2002 5 Thesaurus Mapping Why do we need mapping? Thesaurus mapping is central for: u Thesaurus merging u Thesaurus correlation / interlinking u Thesaurus federation Mapping can be concept-based: u Terms are identified with the set of objects they correctly classify u Broader terms are regarded to classify supersets u Correct mapping is defined through equivalent query results u Depends on term use rather than comprehension of a term u Mapping logic should conform with query paradigm (Z39.50?)

6 ICS-FORTH April 10, 2002 6 Thesaurus Mapping Two approaches – Three communities Automatic mapping: u Based on parallel indices/ similar documents u Statistical & neural network methods u Cheap and with optimal coverage u Missing intellectual insight u Cannot separate if terms express different aspects or if terms are used for different aspects. (May confuse mapping of concepts with concept co- occurrence in the document sample) u limited precision

7 ICS-FORTH April 10, 2002 7 Thesaurus Mapping Two approaches – Three communities Intellectual mapping: u Manual, based on expert knowledge about terms u Can be supported by Description Logics (Ontologies) u Expensive, but with high precision u Insight in structure and long-term stability Proposition: The intellectual structures are complex. Their investigation is helpful for better intellectual and refined statistical mapping methods.

8 ICS-FORTH April 10, 2002 8 AND English Thesaurus French Thesaurus English Vocabulary French Vocabulary interthesaurus relations for query expansion (thesaurus transition) linguistic translation as lead-in linguistic translation as lead-in +/- Interlingua for agreements +/- Thesaurus Mapping Translation and Mapping

9 ICS-FORTH April 10, 2002 9 o Interthesaurus relations (ISO 5964): partial equivalence Must become: broader equivalence (is subset of) narrower equivalence (is superset of) exact equivalence (same set as) inexact equivalence (overlaps with) good for FTR only single to multiple equivalence Must become: exact equivalence to BOOLEAN combination of target terms: AND (intersection), OR (union), NOT (complement) Thesaurus Mapping Logics of Mapping for Z39.50

10 ICS-FORTH April 10, 2002 10 BT Thesaurus Mapping Boolean AND-Combinations A B AND C Exact equivalence Boolean Compound Uses instances of both, B and C Combines properties of B and C Is NT of B, C and BT of their common narrower terms. C B

11 ICS-FORTH April 10, 2002 11 Thesaurus Mapping Issues of Mapping Logics for Z39.50 How to use Boolean expressions inversely : u Calculation of inferences u Boolean combinations to a post-coordinated thesaurus: How to index the existence of an incoming link ? Mappings must be complete: u Should guarantee recall over non-equivalent terms : preservation of precision or recall should be selectable u Should avoid redundancies, need consistency control ! u Should avoid Combinatorial explosion: Need cascading Thes A => Thes B => Thes C

12 ICS-FORTH April 10, 2002 12 BT Thesaurus Mapping Approximation by Inclusion A C B Broader equivalence Narrower equivalences

13 ICS-FORTH April 10, 2002 13 Thesaurus Mapping Obstacles to Thesaurus Transition Unclear coverage & incompatible organisation. u Special vocabularies often contain general terms, contract upper levels. No global abstraction levels. u Missing or contradictory NT/BT relations. u Loose NT semantics (like part-whole, see-also etc.). u Arbitrariness of monohierarchies : E.g. : A hierarchy of colorants, like red organic dye: organize it:by composition, production method or origin ? by color ? by physical property or function ?

14 ICS-FORTH April 10, 2002 14 Thesaurus Mapping Obstacles to Thesaurus Transition Term semantics. u Post-coordination should make use of DL: Combinations from disjoint facets: factories + grinding. Unclear rules for allowed combinations. How to attach and index synonyms in a post-coordinated hierarchy. u Use-induced incompatibility: E.G. Subject/object : brigde - bridge construction. u Complementary polysemy (Pustejowsky): Context-induced shifts of meaning: door, architecture etc. … cause context-related differences in hierarchy.

15 ICS-FORTH April 10, 2002 15 Thesaurus Mapping Complementary Polysemy and Minor Facets Minor facets provide explicit context criteria: u E.G. MDA archeological thesaurus: armour by construction : scale armour armour by form : cuirass armour by function : parade armour u Are these criteria idiosyncratic? u How do they relate to each other ? u How do they relate to compound term formation?

16 ICS-FORTH April 10, 2002 16 Thesaurus Mapping Minor Facets in the AAT The object facet (1998 edition) contains: u About 1640 facet indicators, u About 600 with explicit criteria (by form etc..) u Using 150 ! criteria Preliminary frequency analysis of criteria: u Form: 35%, function: 30%, placement: 15%, construction: 15%, social context: 5%… Hypothesis: u Minor facets criteria can be systematically generalized u Minor facet criteria are different kinds of NT relations

17 ICS-FORTH April 10, 2002 17 Thesaurus Mapping Narrower Terms for three Facets objects swords sword-like objects foils (swords) weapons sword-like Fighting and hunting cutting and thrusting fencing cutting and thrusting weapons Fencing swords Wooden swords Wooden Term specialization Criteria assignment

18 ICS-FORTH April 10, 2002 18 Thesaurus Mapping Explicit facet criteria for objects Hierarchy of object forms Hierarchy of construction features Hierarchy of functions and social roles Hierarchy of compound terms with embedded characteristic terms Descriptive aspects / description elements F a c e t

19 ICS-FORTH April 10, 2002 19 Thesaurus Mapping Summary of Semantic Problems We could identify four semantic problems (statistical methods are not sensitive to semantic problems) Logics of query term expansion between compatible hierarchies Theory of concept formation by compound terms, linguistic and semantic. KR should collaborate with experienced thesaurus editors. Understanding of context –dependency of term hierarchies: understanding of the role of complementary polysemy differences between subject and object classification. Meaning of terms versus meaning of term used for a document

20 ICS-FORTH April 10, 2002 20 Thesaurus Mapping What To Do Research: Deeper understanding. Investigation of polyhierarchies, polysemy and BT/NT semantics. Theory of concept formation by compound terms, from linguistics and logic. Use of ontologies as top-level thesauri, to provide. highest levels (like physical objects, actors, events). roles for concept formation (e.g. using, made for, made in). transition between single terms and terms in multiple fields (e.g. type: sword, material: wood versus wooden sword).

21 ICS-FORTH April 10, 2002 21 Thesaurus Mapping What To Do Protocols: enabling dynamic thesaurus transition u Metadata for description of the logic of a thesaurus BT/NT semantics, organization principles, lead-ins u Recall/precision control in thesaurus transition u DL-based post-coordination rules. Explicit use of Roles. Practice: Analysis of semantic heterogeneity u Comparing thesauri wrt logic of construction and intended use. u Understanding semantics of automatic mappings, integration of intellectual and automatic methods.

