Download presentation
Presentation is loading. Please wait.
Published byFrancis Porter Modified over 8 years ago
1
www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing nv
2
www.landc.be Some questions How good can TeSSI work with existing terminologies ? How much better (if at all) is LinkBase for the same purpose ? Is there a benefit in combining LinkBase with external terminologies, and what would be the best strategy ? How can we use semantic indexing of documents to automatically find possible mistakes in LinkBase ? How can we use this strategy to find the best set-up for individual customers, while minimizing the manual effort of ontology alignment ?
3
www.landc.be To get some answers...... we indexed one text using 5 different, though related ontology set-ups;... we tried to define triggers that automatically can point to potential mistakes;... we studied retrieved ontology-entities pointed to by the triggers;... we are comparing the different options (but are not ready yet)
4
www.landc.be Presentation overview Content –Short introduction to the company –TeSSI: Terminology Supported Semantic Indexing –Set-up of the experiment –Definition of triggers –Where do individual retrieved index elements, pointed to by the triggers, come from ? –Some very preliminary conclusions Objectives: –inform audience on L&C’s state of the art –focus on the impact of the ontology
5
www.landc.be L&C
6
www.landc.be Business of Language & Computing (Tries to) solve the problem of unstructured text data management by empowering computers with an understanding of text. Applications that L&C develops using this technology include: Terminology and Ontology Management systems Indexing documents based on the meaning of the text Search and retrieval solutions that outperform other retrieval engines Extracting information out of free text documents Automated clinical coding of clinical free text towards ICD, SNOMED, etc. Knowledge Management, Semantic Web and others
7
www.landc.be AnthemMulti-TaleDomeGIUSelectC-CareLiquid Mobidev R/D ratio Homey Poirot Inface SCOP ?
8
www.landc.be L&C’s integrated approach Data structure and function library for language understanding Medical and linguistic knowledge required for language understanding NLU enabling tools for knowledge supported data-entry and -retrieval
9
www.landc.be Ontology as the cornerstone Formal Domain Ontology Lexicon Grammar Language A Lexicon Grammar Language B Cassandra Linguistic Ontology MEDDRA ICD SNOMED ICPC Others... Proprietary Terminologies
10
www.landc.be Author related QA
11
www.landc.be BFO/MedO and LinkBase BFO/MedO “validates”
12
www.landc.be Trilateral bootstrapping Document Collection LinkFactory Alignment Core Ontology Source Ontology Automatic pre-Ontology Buidling NLU assisted refinement Application Generation
13
www.landc.be Production Maintenance Research Seamless integration of production, maintenance and research corrected document gold standard corpus text to analyse client document possibly missing information WWW “WebAgent” “GapFinder” new term various “beans” term classification proposal TermModeller various “beans” tagged text Medico-Linguistic Ontology TeSSI,... relevance ranking uncorrected tagged document
14
www.landc.be TeSSI ®
15
www.landc.be TeSSI ®: Terminology Supported Semantic Indexing Based on LinkBase ®: –formal ontologies dealing with time, mereology, partonomy,... (Smith, Varzi, Cohn,...) –domain ontology structured according to the way languages are influenced by semantics (Bateman) –linking towards multiple 3rd party terminologies, classification systems, ontologies,... –multi-lingual Combines in-document statistics with spreading activation enforcement in LinkBase ® Implemented as a server
16
www.landc.be Architectural Overview TeSSI Server Index
17
www.landc.be The TeSSI-server Through Web-browser By mail
18
www.landc.be Syntax-based semantic tagging Sentence/clause identification wrt modality or negation Unrecognised words or deactivated stop-words Unknown term resolution
19
www.landc.be Unknown term resolution in TeSSI Simpler, much faster, but less powerful than LinkFactory’s TermModelling algorithm rewrites word patterns that are candidates for multiword terms into terms known in LinkBase –typical patterns: inflectional variances some NP-PP-NP and ADJNP patterns
20
www.landc.be Impact of the ontology on term-rewriting LinkBase only SNOMED-CT only
21
www.landc.be The TermModelling algorithm pulmonaryembolism ?? pulmonary pulmonaire embolism embolie infarction pulmonaire infarctus du poumon C1 lung poumon C2 lung embolism embolie pulmonaire pulmonary infarction C3 when more ontological information available
22
www.landc.be
23
www.landc.be Domain-entity identification
24
www.landc.be Meta-entity coding
25
www.landc.be Simple meta-entity coding
26
www.landc.be Correctly resolved ambiguity
27
www.landc.be Semantic info on proving and showing
28
www.landc.be Resolved ambiguity with coding
29
www.landc.be Relevance ranking
30
www.landc.be LB outcome
31
www.landc.be TeSSI works extremely well, but it is not perfect Some problem areas
32
www.landc.be NP recognition failures
33
www.landc.be Clause analysis could be better
34
www.landc.be Coding problems Better code not found BECAUSE OF good NP resolution Best SNOMED CT-code (CT-only set-up)
35
www.landc.be No CT-code attached
36
www.landc.be Unresolved ambiguity with correct coding
37
www.landc.be Unresolved ambiguity with unresolved coding
38
www.landc.be Ischaemia versus Ischemia
39
www.landc.be Summary of issues TeSSI very well accepted by our customers: –“best semantic indexing ever seen” –not necessarily “best buy” (price issue, complexity) Ontology (terminology) is the main driving force Very advanced NLU algorithms are still too slow for processing large amounts of documents The underlying ontology changes on a daily basis
40
www.landc.be IV) Candidate concept can be expressed by one word or token in at least one language. I) Candidate concept is explicitly represented in an external ontology/terminology towards which a mapping must be maintained. II) Polysemous terms for which all possible meanings are not yet represented. III) Reification of a newly indroduced relationship expressed by terms in at least one language or necessary for the representation of other concepts. V) Term found for which no concept exists yet. Introduction of new domain- entities
41
www.landc.be Current highlights Ship TeSSI with the minimal amount of information required to do the best job: –LinkBase-extractions –Adding third party information Rationale: –advanced processing less time consuming –less expensive Automatically compairing results of different extractions
42
www.landc.be Set-up of the experiment
43
www.landc.be Five related “ontologies” Pure LinkBase (with SNOMED-CT coding) –(LB) Pure SNOMED-CT –(CT) LinkBase + SNOMED-CT “loose” (with SNOMED-CT coding) –(LBCT-L) LinkBase + SNOMED-CT “all” (with SNOMED-CT coding) –(LBCT-A) Pure UMLS (January 2003 version) –(UMLS)
44
www.landc.be
45
www.landc.be SCT FINDING SITE IS_A SNOMED-CT : 192678004 : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : 192681009 : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : 1231004 : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : 7180009 : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : 15758002 : DISORDER OF MENINGES (DISORDER) IS_A CCC LBCT-”loose”
46
www.landc.be SNOMED-CT : 192678004 : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : 192681009 : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : 1231004 : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : 7180009 : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : 15758002 : DISORDER OF MENINGES (DISORDER) IS_A CCC SCT FINDING SITE IS_A LBCT-”all”
47
www.landc.be LBCT-L versus LBCT-A Main difference: –the L-version adds (virtually) to the core-ontology CT-specific information that is relevant to the loose concepts only (for which nothing is known in the core-ontology) –the A-version adds the complete CT-structure unless contradicted by the core-ontology.
48
www.landc.be
49
www.landc.be Some counts Z: from sentenceP: from terminology N: not disambiguated The more dense the ontology structure, the more P-type of retrieval LinkBase based systems recognise the most terms LinkBase based systems are best aware of term ambiguity
50
www.landc.be The UMLS-problem
51
www.landc.be Differential analysis principles for SNOMED-coded groups Look for “unexpected results” only: –manual expert-”scanning”: what codes seem strange having read the report ? –automatically CT-codes attached to various core-entities with different distributions CT-codes found in one specific group (LB, CT,..., based on selected domain-entity !) only –If too many, select on statistical significance: > 2 STD from population mean relevance CT-codes found in at least 2 groups, with further selection on the basis of: –> 2 STD from population mean relevance within one group –strongly differing population mean relevance over various groups
52
www.landc.be Some strange results by manual verification STD from mean relevance in set-up x found in LB / CT / LBCT-L / LBCT-A vector difference measure between 2 set-ups retrieval type and possible semantic ambiguity
53
www.landc.be Strange distribution triggered by different core-entities found only in LB setup
54
www.landc.be Found in CT only, but not elsewhere
55
www.landc.be CT-codes found in 2 of the 4 set-ups STD from mean relevance in set-up x relevance percentage in set-up x relevance order in set-up x vector difference measure between 2 set-ups
56
www.landc.be Searching for triggers
57
www.landc.be Unexpected cases in CT/LBCT-A
58
www.landc.be All statistically derived triggerings
59
www.landc.be Some numerical results statistically retrieved triggers dominate with involvement of CT
60
www.landc.be Origine of strange results
61
www.landc.be homonym not recognised in individual terminologies
62
www.landc.be not recognised at all in SNOMED-CT only set-up Wrong coding in LinkBase
63
www.landc.be
64
www.landc.be
65
www.landc.be
66
www.landc.be
67
www.landc.be Conclusions Most important sources of errors: –lexicon incompleteness –wrong ontological foundations of external terminologies –incomplete mapping from LinkBase to external terminologies We have good indications that a combination of triggers are good indicators for mistakes.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.