Www.landc.be Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

AeroDAML Applying Information Extraction to Generate DAML Annotations Dr. Paul Kogut Lockheed Martin Management & Data Systems.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
W. Ceusters a, I. Desimpel a, B. Smith b, S. Schulz c a Language and Computing nv., Zonnegem, Belgium b IFOMIS, Leipzig, Germany c Dept. of.
ECO R European Centre for Ontological Research Ontology-based Error Detection in SNOMED-CT ® Werner Ceusters European Centre for Ontological Research Universität.
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
Ontology management for NLU: the L&C approach W. Ceusters CTO * Language & Computing nv, Zonnegem, Belgium.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Werner Ceusters Language & Computing nv Ontologies for the medical domain: current deficiencies in light of the needs of medical natural language.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
L & C Dr. W. Ceusters Language & Computing nv 1 L&C’s LinkBase: a multi-lingual Hub to medical terminologies Dr. W. Ceusters Dir R&D Language.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
The FIX Protocol as an Effective Solution for Algorithmic Trading Kevin Houstoun, Co-chair FPL Global Technical Committee, Consultant to HSBC.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Using Taxonomies Effectively in the Organization v. 2.0 KnowledgeNets 2001 Vivian Bliss Microsoft Knowledge Network Group
Survey of Semantic Annotation Platforms
Knowledge representation
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 6 Slide 1 Requirements Engineering Processes l Processes used to discover, analyse and.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
1 Introduction to Software Engineering Lecture 1.
Multimodal User Interface with Natural Language Classification for Clinicians At Point of Care Health Informatics Showcase Peter Budd Sponsors: NCCH -
Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of data Content analysis.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
This Briefing is: UNCLASSIFIED Aha! Analytics 2278 Baldwin Drive Phone: (937) , FAX: (866) A Recurring Knowledge Transfer Problem, Linked.
OWL Representing Information Using the Web Ontology Language.
Introduction to the Semantic Web and Linked Data
GEMET GEneral Multilingual Environmental Thesaurus leading the way to federated terminologies Stefan Jensen, Head of information services group with input.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Topic Maps introduction Peter-Paul Kruijsen CTO, Morpheus software ISOC seminar, april 5 th 2005.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
An Ontological Approach to Financial Analysis and Monitoring.
Software Engineering Lecture 8: Quality Assurance.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
SNOMED CT Vendor Introduction 27 th October :30 (CET) Implementation Special Interest Group Tom Seabury IHTSDO.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Semantics and the EPA System of Registries Gail Hodge IIa/ Consultant to the U.S. Environmental Protection Agency 18 April 2007.
FROM ONE NOMENCLATURES TO ANOTHER… Drs. Sven Van Laere.
Language, terminology and ontology in a medical context: theory en reality in industrial applications Werner CEUSTERS CTO Language & Computing.
 System Requirement Specification and System Planning.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Object-Oriented Software Engineering Using UML, Patterns, and Java,
Medical Natural Language Understanding now and tomorrow
Development of the Amphibian Anatomical Ontology
Chapter 13 Quality Management
CSE 635 Multimedia Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Indexing Medical Documents using related ontologies: towards a strategy for automatic quality assurance Dr. W. Ceusters CTO Language and Computing nv

Some questions How good can TeSSI work with existing terminologies ? How much better (if at all) is LinkBase for the same purpose ? Is there a benefit in combining LinkBase with external terminologies, and what would be the best strategy ? How can we use semantic indexing of documents to automatically find possible mistakes in LinkBase ? How can we use this strategy to find the best set-up for individual customers, while minimizing the manual effort of ontology alignment ?

To get some answers we indexed one text using 5 different, though related ontology set-ups;... we tried to define triggers that automatically can point to potential mistakes;... we studied retrieved ontology-entities pointed to by the triggers;... we are comparing the different options (but are not ready yet)

Presentation overview Content –Short introduction to the company –TeSSI: Terminology Supported Semantic Indexing –Set-up of the experiment –Definition of triggers –Where do individual retrieved index elements, pointed to by the triggers, come from ? –Some very preliminary conclusions Objectives: –inform audience on L&C’s state of the art –focus on the impact of the ontology

L&C

Business of Language & Computing (Tries to) solve the problem of unstructured text data management by empowering computers with an understanding of text. Applications that L&C develops using this technology include: Terminology and Ontology Management systems Indexing documents based on the meaning of the text Search and retrieval solutions that outperform other retrieval engines Extracting information out of free text documents Automated clinical coding of clinical free text towards ICD, SNOMED, etc. Knowledge Management, Semantic Web and others

AnthemMulti-TaleDomeGIUSelectC-CareLiquid Mobidev R/D ratio Homey Poirot Inface SCOP ?

L&C’s integrated approach Data structure and function library for language understanding Medical and linguistic knowledge required for language understanding NLU enabling tools for knowledge supported data-entry and -retrieval

Ontology as the cornerstone Formal Domain Ontology Lexicon Grammar Language A Lexicon Grammar Language B Cassandra Linguistic Ontology MEDDRA ICD SNOMED ICPC Others... Proprietary Terminologies

Author related QA

BFO/MedO and LinkBase BFO/MedO “validates”

Trilateral bootstrapping Document Collection LinkFactory Alignment Core Ontology Source Ontology Automatic pre-Ontology Buidling NLU assisted refinement Application Generation

Production Maintenance Research Seamless integration of production, maintenance and research corrected document gold standard corpus text to analyse client document possibly missing information WWW “WebAgent” “GapFinder” new term various “beans” term classification proposal TermModeller various “beans” tagged text Medico-Linguistic Ontology TeSSI,... relevance ranking uncorrected tagged document

TeSSI ®

TeSSI ®: Terminology Supported Semantic Indexing Based on LinkBase ®: –formal ontologies dealing with time, mereology, partonomy,... (Smith, Varzi, Cohn,...) –domain ontology structured according to the way languages are influenced by semantics (Bateman) –linking towards multiple 3rd party terminologies, classification systems, ontologies,... –multi-lingual Combines in-document statistics with spreading activation enforcement in LinkBase ® Implemented as a server

Architectural Overview TeSSI Server Index

The TeSSI-server Through Web-browser By mail

Syntax-based semantic tagging Sentence/clause identification wrt modality or negation Unrecognised words or deactivated stop-words Unknown term resolution

Unknown term resolution in TeSSI Simpler, much faster, but less powerful than LinkFactory’s TermModelling algorithm rewrites word patterns that are candidates for multiword terms into terms known in LinkBase –typical patterns: inflectional variances some NP-PP-NP and ADJNP patterns

Impact of the ontology on term-rewriting LinkBase only SNOMED-CT only

The TermModelling algorithm pulmonaryembolism ?? pulmonary pulmonaire embolism embolie infarction pulmonaire infarctus du poumon C1 lung poumon C2 lung embolism embolie pulmonaire pulmonary infarction C3 when more ontological information available

Domain-entity identification

Meta-entity coding

Simple meta-entity coding

Correctly resolved ambiguity

Semantic info on proving and showing

Resolved ambiguity with coding

Relevance ranking

LB outcome

TeSSI works extremely well, but it is not perfect Some problem areas

NP recognition failures

Clause analysis could be better

Coding problems Better code not found BECAUSE OF good NP resolution Best SNOMED CT-code (CT-only set-up)

No CT-code attached

Unresolved ambiguity with correct coding

Unresolved ambiguity with unresolved coding

Ischaemia versus Ischemia

Summary of issues TeSSI very well accepted by our customers: –“best semantic indexing ever seen” –not necessarily “best buy” (price issue, complexity) Ontology (terminology) is the main driving force Very advanced NLU algorithms are still too slow for processing large amounts of documents The underlying ontology changes on a daily basis

IV) Candidate concept can be expressed by one word or token in at least one language. I) Candidate concept is explicitly represented in an external ontology/terminology towards which a mapping must be maintained. II) Polysemous terms for which all possible meanings are not yet represented. III) Reification of a newly indroduced relationship expressed by terms in at least one language or necessary for the representation of other concepts. V) Term found for which no concept exists yet. Introduction of new domain- entities

Current highlights Ship TeSSI with the minimal amount of information required to do the best job: –LinkBase-extractions –Adding third party information Rationale: –advanced processing less time consuming –less expensive Automatically compairing results of different extractions

Set-up of the experiment

Five related “ontologies” Pure LinkBase (with SNOMED-CT coding) –(LB) Pure SNOMED-CT –(CT) LinkBase + SNOMED-CT “loose” (with SNOMED-CT coding) –(LBCT-L) LinkBase + SNOMED-CT “all” (with SNOMED-CT coding) –(LBCT-A) Pure UMLS (January 2003 version) –(UMLS)

SCT FINDING SITE IS_A SNOMED-CT : : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : : DISORDER OF MENINGES (DISORDER) IS_A CCC LBCT-”loose”

SNOMED-CT : : MENINGITIS OF UNSPECIFIED CAUSE (DISORDER) SNOMED-CT : : UNSPECIFIED MENINGITIS (DISORDER) SNOMED-CT : : MENINGES STRUCTURE (BODY STRUCTURE) IS_A SCT FINDING SITE MENINGES STRUCTURE SNOMED-CT : : MENINGITIS (DISORDER) IS_A MENINGITIS UNSPECIFIED MENINGITIS CCC DISORDER OF MENINGES SNOMED-CT : : DISORDER OF MENINGES (DISORDER) IS_A CCC SCT FINDING SITE IS_A LBCT-”all”

LBCT-L versus LBCT-A Main difference: –the L-version adds (virtually) to the core-ontology CT-specific information that is relevant to the loose concepts only (for which nothing is known in the core-ontology) –the A-version adds the complete CT-structure unless contradicted by the core-ontology.

Some counts Z: from sentenceP: from terminology N: not disambiguated The more dense the ontology structure, the more P-type of retrieval LinkBase based systems recognise the most terms LinkBase based systems are best aware of term ambiguity

The UMLS-problem

Differential analysis principles for SNOMED-coded groups Look for “unexpected results” only: –manual expert-”scanning”: what codes seem strange having read the report ? –automatically CT-codes attached to various core-entities with different distributions CT-codes found in one specific group (LB, CT,..., based on selected domain-entity !) only –If too many, select on statistical significance: > 2 STD from population mean relevance CT-codes found in at least 2 groups, with further selection on the basis of: –> 2 STD from population mean relevance within one group –strongly differing population mean relevance over various groups

Some strange results by manual verification STD from mean relevance in set-up x found in LB / CT / LBCT-L / LBCT-A vector difference measure between 2 set-ups retrieval type and possible semantic ambiguity

Strange distribution triggered by different core-entities found only in LB setup

Found in CT only, but not elsewhere

CT-codes found in 2 of the 4 set-ups STD from mean relevance in set-up x relevance percentage in set-up x relevance order in set-up x vector difference measure between 2 set-ups

Searching for triggers

Unexpected cases in CT/LBCT-A

All statistically derived triggerings

Some numerical results statistically retrieved triggers dominate with involvement of CT

Origine of strange results

homonym not recognised in individual terminologies

not recognised at all in SNOMED-CT only set-up Wrong coding in LinkBase

Conclusions Most important sources of errors: –lexicon incompleteness –wrong ontological foundations of external terminologies –incomplete mapping from LinkBase to external terminologies We have good indications that a combination of triggers are good indicators for mistakes.