Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing.

Similar presentations


Presentation on theme: "Www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing."— Presentation transcript:

1 www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing nv

2 www.landc.be Some questions How good can TeSSI work with existing terminologies ? How much better (if at all) is LinkBase for the same purpose ? Is there a benefit in combining LinkBase with external terminologies, and what would be the best strategy ? How can we use semantic indexing of documents to automatically find possible mistakes in LinkBase ? How can we use this strategy to find the best set-up for individual customers, while minimizing the manual effort of ontology alignment ?

3 www.landc.be To get some answers...... we indexed one text using 5 different, though related ontology set-ups;... we tried to define triggers that automatically can point to potential mistakes;... we studied retrieved ontology-entities pointed to by the triggers;... we are comparing the different options (but are not ready yet)

4 www.landc.be Presentation overview Content –Short introduction to the company –TeSSI: Terminology Supported Semantic Indexing –Set-up of the experiment –Definition of triggers –Where do individual retrieved index elements, pointed to by the triggers, come from ? –Some very preliminary conclusions Objectives: –inform audience on L&C’s state of the art –focus on the impact of the ontology

5 www.landc.be L&C

6 www.landc.be Business of Language & Computing (Tries to) solve the problem of unstructured text data management by empowering computers with an understanding of text. Applications that L&C develops using this technology include: Terminology and Ontology Management systems Indexing documents based on the meaning of the text Search and retrieval solutions that outperform other retrieval engines Extracting information out of free text documents Automated clinical coding of clinical free text towards ICD, SNOMED, etc. Knowledge Management, Semantic Web and others

7 www.landc.be AnthemMulti-TaleDomeGIUSelectC-CareLiquid Mobidev R/D ratio Homey Poirot Inface SCOP ?

8 www.landc.be L&C’s integrated approach Data structure and function library for language understanding Medical and linguistic knowledge required for language understanding NLU enabling tools for knowledge supported data-entry and -retrieval

9 www.landc.be Ontology as the cornerstone Formal Domain Ontology Lexicon Grammar Language A Lexicon Grammar Language B Cassandra Linguistic Ontology MEDDRA ICD SNOMED ICPC Others... Proprietary Terminologies

10 www.landc.be Author related QA

11 www.landc.be BFO/MedO and LinkBase BFO/MedO “validates”

12 www.landc.be Trilateral bootstrapping Document Collection LinkFactory Alignment Core Ontology Source Ontology Automatic pre-Ontology Buidling NLU assisted refinement Application Generation

13 www.landc.be Production Maintenance Research Seamless integration of production, maintenance and research corrected document gold standard corpus text to analyse client document possibly missing information WWW “WebAgent” “GapFinder” new term various “beans” term classification proposal TermModeller various “beans” tagged text Medico-Linguistic Ontology TeSSI,... relevance ranking uncorrected tagged document

14 www.landc.be TeSSI ®

15 www.landc.be TeSSI ®: Terminology Supported Semantic Indexing Based on LinkBase ®: –formal ontologies dealing with time, mereology, partonomy,... (Smith, Varzi, Cohn,...) –domain ontology structured according to the way languages are influenced by semantics (Bateman) –linking towards multiple 3rd party terminologies, classification systems, ontologies,... –multi-lingual Combines in-document statistics with spreading activation enforcement in LinkBase ® Implemented as a server

16 www.landc.be Architectural Overview TeSSI Server Index

17 www.landc.be The TeSSI-server Through Web-browser By mail

18 www.landc.be Syntax-based semantic tagging Sentence/clause identification wrt modality or negation Unrecognised words or deactivated stop-words Unknown term resolution

19 www.landc.be Unknown term resolution in TeSSI Simpler, much faster, but less powerful than LinkFactory’s TermModelling algorithm rewrites word patterns that are candidates for multiword terms into terms known in LinkBase –typical patterns: inflectional variances some NP-PP-NP and ADJNP patterns

20 www.landc.be Impact of the ontology on term-rewriting LinkBase only SNOMED-CT only

21 www.landc.be The TermModelling algorithm pulmonaryembolism ?? pulmonary pulmonaire embolism embolie infarction pulmonaire infarctus du poumon C1 lung poumon C2 lung embolism embolie pulmonaire pulmonary infarction C3 when more ontological information available

22 www.landc.be

23 www.landc.be Domain-entity identification

24 www.landc.be Meta-entity coding

25 www.landc.be Simple meta-entity coding

26 www.landc.be Correctly resolved ambiguity

27 www.landc.be Semantic info on proving and showing

28 www.landc.be Resolved ambiguity with coding

29 www.landc.be Relevance ranking

30 www.landc.be LB outcome

31 www.landc.be TeSSI works extremely well, but it is not perfect Some problem areas

32 www.landc.be NP recognition failures

33 www.landc.be Clause analysis could be better

34 www.landc.be Coding problems Better code not found BECAUSE OF good NP resolution Best SNOMED CT-code (CT-only set-up)

35 www.landc.be No CT-code attached

36 www.landc.be Unresolved ambiguity with correct coding

37 www.landc.be Unresolved ambiguity with unresolved coding

38 www.landc.be Ischaemia versus Ischemia

39 www.landc.be Summary of issues TeSSI very well accepted by our customers: –“best semantic indexing ever seen” –not necessarily “best buy” (price issue, complexity) Ontology (terminology) is the main driving force Very advanced NLU algorithms are still too slow for processing large amounts of documents The underlying ontology changes on a daily basis

40 www.landc.be IV) Candidate concept can be expressed by one word or token in at least one language. I) Candidate concept is explicitly represented in an external ontology/terminology towards which a mapping must be maintained. II) Polysemous terms for which all possible meanings are not yet represented. III) Reification of a newly indroduced relationship expressed by terms in at least one language or necessary for the representation of other concepts. V) Term found for which no concept exists yet. Introduction of new domain- entities

41 www.landc.be Current highlights Ship TeSSI with the minimal amount of information required to do the best job: –LinkBase-extractions –Adding third party information Rationale: –advanced processing less time consuming –less expensive Automatically compairing results of different extractions

42 www.landc.be Set-up of the experiment

43 www.landc.be Five related “ontologies” Pure LinkBase (with SNOMED-CT coding) –(LB) Pure SNOMED-CT –(CT) LinkBase + SNOMED-CT “loose” (with SNOMED-CT coding) –(LBCT-L) LinkBase + SNOMED-CT “all” (with SNOMED-CT coding) –(LBCT-A) Pure UMLS (January 2003 version) –(UMLS)

44 www.landc.be

45 www.landc.be SCT FINDING SITE IS_A SNOMED-CT : 192678004 : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : 192681009 : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : 1231004 : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : 7180009 : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : 15758002 : DISORDER OF MENINGES (DISORDER) IS_A CCC LBCT-”loose”

46 www.landc.be SNOMED-CT : 192678004 : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : 192681009 : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : 1231004 : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : 7180009 : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : 15758002 : DISORDER OF MENINGES (DISORDER) IS_A CCC SCT FINDING SITE IS_A LBCT-”all”

47 www.landc.be LBCT-L versus LBCT-A Main difference: –the L-version adds (virtually) to the core-ontology CT-specific information that is relevant to the loose concepts only (for which nothing is known in the core-ontology) –the A-version adds the complete CT-structure unless contradicted by the core-ontology.

48 www.landc.be

49 www.landc.be Some counts Z: from sentenceP: from terminology N: not disambiguated The more dense the ontology structure, the more P-type of retrieval LinkBase based systems recognise the most terms LinkBase based systems are best aware of term ambiguity

50 www.landc.be The UMLS-problem

51 www.landc.be Differential analysis principles for SNOMED-coded groups Look for “unexpected results” only: –manual expert-”scanning”: what codes seem strange having read the report ? –automatically CT-codes attached to various core-entities with different distributions CT-codes found in one specific group (LB, CT,..., based on selected domain-entity !) only –If too many, select on statistical significance: > 2 STD from population mean relevance CT-codes found in at least 2 groups, with further selection on the basis of: –> 2 STD from population mean relevance within one group –strongly differing population mean relevance over various groups

52 www.landc.be Some strange results by manual verification STD from mean relevance in set-up x found in LB / CT / LBCT-L / LBCT-A vector difference measure between 2 set-ups retrieval type and possible semantic ambiguity

53 www.landc.be Strange distribution triggered by different core-entities found only in LB setup

54 www.landc.be Found in CT only, but not elsewhere

55 www.landc.be CT-codes found in 2 of the 4 set-ups STD from mean relevance in set-up x relevance percentage in set-up x relevance order in set-up x vector difference measure between 2 set-ups

56 www.landc.be Searching for triggers

57 www.landc.be Unexpected cases in CT/LBCT-A

58 www.landc.be All statistically derived triggerings

59 www.landc.be Some numerical results statistically retrieved triggers dominate with involvement of CT

60 www.landc.be Origine of strange results

61 www.landc.be homonym not recognised in individual terminologies

62 www.landc.be not recognised at all in SNOMED-CT only set-up Wrong coding in LinkBase

63 www.landc.be

64 www.landc.be

65 www.landc.be

66 www.landc.be

67 www.landc.be Conclusions Most important sources of errors: –lexicon incompleteness –wrong ontological foundations of external terminologies –incomplete mapping from LinkBase to external terminologies We have good indications that a combination of triggers are good indicators for mistakes.


Download ppt "Www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing."

Similar presentations


Ads by Google