Presentation is loading. Please wait.

Presentation is loading. Please wait.

GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.com.

Similar presentations


Presentation on theme: "GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.com."— Presentation transcript:

1 GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains
Patrice Lopez and Laurent Romary INRIA & HUB – IDSL

2 Overview GRISP (Generic Research Insight in Scientific and technical Publications) Multiple scientific and technical fields Multilingual (en, fr, de) Built from the compilation of open resources Sound conceptual model Mapping across a variety of domains Use of structural constraints Machine learning techniques for controlling the fusion process Our sources: MeSH, UMLS, Specialist Lexicon, Gene Ontology, ChEBI, WordNet, WOLF, SUMO, IPC, Wikipedia Result: several millions terms, concepts, semantic relations and definitions.

3 Why are we doing all this?
Terminology is the main vehicle by which technical and scientific units of knowledge are represented and conveyed (30-80%; Ahmad, 1996) Application to a large collection of multilingual and multi-domain patent documents Two underlying considerations: Cost of manually maintained terminological resources Cf. Biosis, IATE, TermScience Khayari et al., 2006: Modeling the heterogeneity of resources A lot of available resources online, based on heterogeneous organizational principles Underlying vision: Integrating knowledge engineering into current state of the art information retrieval and classification systems

4 Merging terminological resources
Related to the fusion of ontologies Ontologies are usually relatively small in size Semi-automatic methods: McGuinness et al., 2000 Fully automatic method Madhavan et al., 2001: exploit structural and linguistic matching Doan et al., 2001: Machine learning techniques (concepts and properties) Gal et al., 2005: fuzzy logic methods Existing work on merging classification systems Wang et al., 2008: Merging of subject headers in Digital Libraries Automatic merging techniques for heterogeneous terminologies has not been yet investigated Much richer linguistic content No formal organization of concepts Do not model facts or assertions

5 A quick reminder Terminological resources
Approximation of lexical semantics in specialized fields Based on a concept to term (onomasiological) model Naturally multilingual (term grouping according to languages) Existing standards ISO 704: editorial principles for building up a terminological resource ISO 16642: Abstract model for representing terminological databases Romary, 2001 ISO 30042: A concrete XML syntax (TBX) Note: terminology standards do not standardize terminologies!

6 Target terminological model
Multiple languages Multiple terms Variants, abbreviation, inflexions Multiple descriptions E.g. multiple definitions, complementing each other Additional information: illustrations, formulae, etc. Basic conceptual relations Local metadata Provides management information attached to the various terminological description levels (e.g. origin, validation level, register) Allows the creation of views (e.g. all MeSH entries; cf. Khayari et al., 2006) And yes, ISO (TMF) can all this! Main issue: identifying the relevant data category in the various source terminologies

7 Merging terminologies, merging models
TMF model 1 TMF model 2 Target model TMF model 2 TMF model 2

8 TMF in a nutshell Metadata (sources, revisions) Ontological relations, definition Terminological Data Collection (TDC) Terminological Entry Language Section Term Section Terminological Entry Language Section Term Section Dialectal information, definition Terminological Entry Language Section Term Section Grammatical information, register, … definition Terminological Entry Language Section Term Section + any kind of local metadata (origin, certainty, accessibility)

9 Merging terminologies, merging models
/definition/ TMF model 1 Data category mapping TMF model 2 /definition/ Target model TMF model 2 TMF model 2

10 Identifying domains Theoretical background GRISP
Non-ambiguity of a term within a domain E.g. 129 domains in MESH GRISP Set of 76 reference domains (see table 1) Scientific and technical domains of Wordnet Domains (Magnini and Cavaglià, 2000) Organised as a hierarchy Manual mapping from resource specific domains to our reference set

11 Merging concepts Identification of common concepts across terminological sources – core principles Baseline: same term + same domain = same concept Difficulties: Conflicting domain mapping, high polysemy of term variants and incorrectly positioned concepts (e.g. Wikipedia) Wrongly merged concepts Lost in precision for concept description Revised: same preferred term + same domain = same concept Source conformance rule: separated concepts in a given source cannot be further merged (by transitivity) Not applied to Wordnet, IPC and Wikipedia Smoothing down the rules: using machine learning techniques

12 Concept merging as a machine learning process
Concept pool Concept Concept Concept Concept Concept Concept Concept Concept Concept Concept Concept Concept Features Merging decision SVM (Support Vector Machine) and MLP (Multi-Layer Perceptron) binary classification models

13 Training process Training features Training data
(f1-2) sources (e.g. S1=“MeSH”, S2=“Wikipedia”) (f3) Number of common domains between the two concepts (f4) Number of same source-specific categorizations (f5) Boolean indicating if both preferred terms are identical (f6) Boolean indicating if both preferred terms are identical after stemming (f7) Ratio of identical terms given all terms (f8) Similarity measure of the definition texts, after stemming and based on negative KL divergence (f9) Number of domains of the merged concept (f10) Number of words of the longest common terms Training data Wikipedia – MeSH mapping Pascal database (INIST)

14 Result overview Observations:
Merger Concepts Terms Sem. Rel. Aggregation 1,503,818 3,140,726 970,864 Merg. Rule 1 1,457,538 3,157,179 1,022,303 Merg. Rule 2 1,476,508 3,114,711 971,218 SVM 1,450,688 3,195,118 1,088,446 MLP 1,451,710 3,192,325 1,081,955 Overall content: 596,865 definitions 1,321,988 source specific categorizations of concepts 20,000 acronyms 14,268 chemical formulas and 12,375 chemical structure identifiers. Observations: Small number of actual merges (cf. product names, chemical and medical entities) Merging relevant for frequently used concepts

15 Evaluation Merger Wiki/MeSH PASCAL Merging Rule 1 cov. 0.6464
acc cov acc Merging Rule 2 cov acc cov acc SVM cov acc cov acc MLP cov acc cov acc Random subset of 10% of the merging examples extracted from Wikipedia/MeSH mappings and from the PASCAL terminology Merging Rule 2 produces almost perfect merging but with a very low coverage Rule 1 extends the coverage at the price of a relatively high rate of merging error Machine Learning approaches further extend the coverage while maintaining a high precision

16 GRISP browser: radial engine
rendering rendering rendering

17

18 Application: Patatras
PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) Context: CLEF-IP competition Prior art search task (EPO documents) 1,9 million documents in English, French and German (more than 3 billion words) Ranked first for all subtasks of the evaluation track among 14 participants (Roda et al., 2009) Conceptual indexing of the CLEF-IP corpus Development of a term annotator based on GRISP Term variant matching after POS + lemmatization Concept disambiguation based on IPC classes of the documents 1.1 million different terms identified 176 million annotations

19 Results: Patatras Significant accuracy improvements for CLEF-IP
Combination of a word-based and concept-based ranked results with a regression model Based on 10,000 queries

20 Epilogue Online tool Free resource Constant evolution
Contact: Free resource Based on the freely available subset of resources Constant evolution Maintenance according to evolution of our sources Addition of further sources


Download ppt "GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL patrice_lopez@hotmail.com."

Similar presentations


Ads by Google