Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.

Slides:



Advertisements
Similar presentations
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Advertisements

U. S. National Library of Medicine NLM Indexing Initiative Tools for NLP: MetaMap and the Medical Text Indexer Natural Language Processing: State of the.
Used in place of a noun pronoun.
Chapter 20: Natural Language Generation Presented by: Anastasia Gorbunova LING538: Computational Linguistics, Fall 2006 Speech and Language Processing.
1 Words and the Lexicon September 10th 2009 Lecture #3.
The Role of the UMLS in Vocabulary Control CENDI Conference “Controlled Vocabulary and the Internet” Stuart J. Nelson, MD.
Battling Scylla and Charybdis: The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus James J. Cimino Department of Medical Informatics.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Thesaurus Design and Development
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
MEDLINEplus: Your Gateway to Consumer Health Information on the Web.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2007 National Library of Medicine National Institutes of Health U.S. Dept. of Health.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Automated Classification of Medical Questions Using Semantic Parsing Techniques Paul E. Pancoast, MD Arthur B. Smith, MS Chi-Ren Shyu, PhD University of.
1 Betsy L. Humphreys, MLS Betsy L. Humphreys, MLS National Library of Medicine National Library of Medicine National Institutes of Health National Institutes.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2005 May 16 & 17, 2005 Rachel Kleinsorge.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Betsy L. Humphreys Betsy L. Humphreys Associate Director for Library Operations NLM, NIH, HHS NLM, NIH, HHS National Library.
Annual reports and feedback from UMLS licensees Kin Wah Fung MD, MSc, MA The UMLS Team National Library of Medicine Workshop on the Future of the UMLS.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Recent advances in the field of Family Medicine classifications ICPC into WHO-FIC J K Soler Wonca International Classification Committee.
I2B2 Shared Task 2011 Coreference Resolution in Clinical Text David Hinote Carlos Ramirez.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Use of the UMLS in Patient Care James J. Cimino, M.D. Center for Medical Informatics Columbia University.
Unit 5 Ch 6: Nomenclatures and Classification Systems Tuesday, April 5 th at 8PM EST HS Adrienne Palmer, BSPH, MHA, FACHE.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Levels of Linguistic Analysis
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Innovation at Jockey Club Sarah Roe School The use of core vocabulary in alternative and augmented communication (AAC) and language learning Presented.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
Methods We employ the UMLS Metathesaurus to annotate ICD-9 codes to MedDRA preferred terms (PTs) using the three-step process below. The mapping was applied.
Evidence-Based Medicine in PubMed PubMed for Trainers, Summer 2016 U.S. National Library of Medicine (NLM) and NN/LM Training Office.
Exploring Lexical Forms: A First-Generation Consumer Health Vocabulary
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Nouns Nouns Verbs Verbs Verbs Verbs Plurals Plurals Categories Side Tabs for Interactive Language Notebooks: Page 1 Pronouns Pronouns Nouns Nouns.
Prepositional Phrases as Adjectives and Adverbs
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
CS 430: Information Discovery
A Statistical Model for Parsing Czech
Using UMLS CUIs for WSD in the Biomedical Domain
Levels of Linguistic Analysis
Introduction to Information Retrieval
Concepts of Nursing NUR 212
CS246: Information Retrieval
Presentation transcript:

Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu

Research Purpose To find a method for classifying medical questions that are asked by clinicians To find a method for classifying medical questions that are asked by clinicians Hypothesis - Simply indexing by keywords isn’t enough to Hypothesis - Simply indexing by keywords isn’t enough to distinguish questions with different meanings but similar wording, or to distinguish questions with different meanings but similar wording, or to group questions with similar meanings but different words. group questions with similar meanings but different words.

Definitions Semantic Information – the meaning of the words Semantic Information – the meaning of the words Syntactic Information – the parts of speech of the words (word type, sentence part) Syntactic Information – the parts of speech of the words (word type, sentence part) Medical Questions – a question asked by a clinician Medical Questions – a question asked by a clinician Lexical Sources – sources of words and vocabularies Lexical Sources – sources of words and vocabularies UMLS – Unified Medical Language System UMLS – Unified Medical Language System

UMLS Ambitious project of the National Library of Medicine, begun in 1986 Ambitious project of the National Library of Medicine, begun in 1986 Help researchers retrieve and integrate electronic biomedical information from a variety of sources Help researchers retrieve and integrate electronic biomedical information from a variety of sources Links over 100 controlled vocabularies Links over 100 controlled vocabularies Assigns unique identifiers to medical concepts and strings Assigns unique identifiers to medical concepts and strings Maps the hierarchical relationships between the medical concepts Maps the hierarchical relationships between the medical concepts

Why Bother? (To classify medical questions?) Clinicians have questions when treating patients Clinicians have questions when treating patients Researchers have gathered collections of these questions Researchers have gathered collections of these questions No good method exists to classify the questions No good method exists to classify the questions How many times has a particular question been asked? How many times has a particular question been asked? Which questions should receive priority for evidence-based answers? Which questions should receive priority for evidence-based answers?

Examples What is the best way to treat acute pharyngitis? What is the best way to treat acute pharyngitis? How should I approach a patient with a sore throat? How should I approach a patient with a sore throat? What should I do with a patient with diabetes and insulin resistance? What should I do with a patient with diabetes and insulin resistance? What should I do with a patient with diabetes who is resistant to taking insulin? What should I do with a patient with diabetes who is resistant to taking insulin?

Methods Source Questions American researcher – observed clinicians at work American researcher – observed clinicians at work British researchers – questions sent in by clinicians – answered by researchers British researchers – questions sent in by clinicians – answered by researchers Australian researchers – questions sent in by clinicians – answered by researchers Australian researchers – questions sent in by clinicians – answered by researchers 4083 total questions 4083 total questions

Methods Source Vocabulary MRCON – a table from the Metathesaurus MRCON – a table from the Metathesaurus Lists the medical concepts by unique identifiers (CUI) and each string associated with a concept Lists the medical concepts by unique identifiers (CUI) and each string associated with a concept unique (string => 1 concept) unique (string => 1 concept) ambiguous (string => 2+ concepts) ambiguous (string => 2+ concepts) COLD – ambient temperature, viral respiratory infection, chronic obstructive lung disease COLD – ambient temperature, viral respiratory infection, chronic obstructive lung disease 2,247,454 strings associated with concepts 2,247,454 strings associated with concepts Non-medical Lexicon – from Roget’s Thesaurus Non-medical Lexicon – from Roget’s Thesaurus Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently) Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently) 749 terms in this lexicon 749 terms in this lexicon

String Matching Parsing program (written in C) Parsing program (written in C) Separates individual questions into 3-word, 2- word, 1-word windows Separates individual questions into 3-word, 2- word, 1-word windows Matches the window against MRCON and our lexicon Matches the window against MRCON and our lexicon Generates a report of: Generates a report of: Total number of words parsed Total number of words parsed Number of matches from unique, ambiguous, non- medical lists Number of matches from unique, ambiguous, non- medical lists Strings that didn’t match any of the lists Strings that didn’t match any of the lists

Results String – individual word or words that matched String – individual word or words that matched Hits – how often the string was found Hits – how often the string was found Words – total number of matching words (some strings have more than one word in them) Words – total number of matching words (some strings have more than one word in them) StringsHitsWords % match MRCONUnique4,53424,84430, % MRCONAmbiguous5749,2569, % Non- medical 20816,76817, % Unmatched2,32113, %

Results 100 strings occurred 7850 times – or 57.6% of the total matches 100 strings occurred 7850 times – or 57.6% of the total matches 712 strings => 3+ hits, 85% of all hits 712 strings => 3+ hits, 85% of all hits Our focus was on strings that didn’t match one of the source vocabularies Our focus was on strings that didn’t match one of the source vocabularies 19.1% didn’t match 19.1% didn’t match Hypothesis that additional terms not found in MRCON will be important for indexing Hypothesis that additional terms not found in MRCON will be important for indexing

Results Unmatched words – 2+ occurrences Unmatched words – 2+ occurrences Unique words Total Number Percent Verb % Noun % Preposition % Adj/Adv/Conj % Mix * % Pronoun % Integer % * can be more than one word type, depending on the context. Attacks, step, process all can be nouns or verbs

Discussion MRCON – selected because of low rate of ambiguous string-CUI combinations MRCON – selected because of low rate of ambiguous string-CUI combinations 89% unique string matches 89% unique string matches 11% ambiguous string matches 11% ambiguous string matches Other tables have greater word coverage, but have more ambiguity for each of the words Other tables have greater word coverage, but have more ambiguity for each of the words

Discussion Our word-matching results were similar to other researchers Our word-matching results were similar to other researchers Cimino matched 43% of words with Meta-1 (we had 56% MRCON matches) Cimino matched 43% of words with Meta-1 (we had 56% MRCON matches) Computers & Biomedical Research. Aug 1992;25(4): Computers & Biomedical Research. Aug 1992;25(4): Hersh matched 60% of words to medical terminology & names dictionary Hersh matched 60% of words to medical terminology & names dictionary (we had 79% combined lexicon matches) Proceedings/AMIA Annual Fall Symposium. p Proceedings/AMIA Annual Fall Symposium. p

Discussion Stop words – commonly removed by most normalization tools. Prepositions, conjunctions, pronouns Stop words – commonly removed by most normalization tools. Prepositions, conjunctions, pronouns Provide valuable contextual information. Provide valuable contextual information. Blood FOR an HIV-positive patient Blood FOR an HIV-positive patient Blood FROM an HIV-positive patient Blood FROM an HIV-positive patient Asprin AND warfarin Asprin AND warfarin Asprin OR warfarin Asprin OR warfarin

Discussion Integers Integers 186 distinct integers or integer word combinations 186 distinct integers or integer word combinations Occurred 647 times Occurred 647 times Additional modification of concepts Additional modification of concepts Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li Both are hyperkalemia, but the evaluation and management are markedly different Both are hyperkalemia, but the evaluation and management are markedly different

Discussion Verbs – largest category of unmatched words Verbs – largest category of unmatched words Include action and relation concepts Include action and relation concepts Non-medical lexicon contained some Non-medical lexicon contained some Treats, attends, increases, lessens, reduce, follows, starts, can, should, is, equal, improve Treats, attends, increases, lessens, reduce, follows, starts, can, should, is, equal, improve Verb tense changes the meaning of a question Verb tense changes the meaning of a question In a patient TAKING antibiotics In a patient TAKING antibiotics In a patient who TOOK antibiotics In a patient who TOOK antibiotics

Discussion Verbs may be conceptually related to medical concepts Verbs may be conceptually related to medical concepts Diagnose => Diagnosis Diagnose => Diagnosis Treat=> Treatment Treat=> Treatment Evaluate=> Evaluation Evaluate=> Evaluation Prescribe=> Prescription Prescribe=> Prescription In these cases the verb (relationship) is not equivalent to the noun (concept) In these cases the verb (relationship) is not equivalent to the noun (concept)

Summary We developed an application to We developed an application to Parse individual words from collections of medical questions Parse individual words from collections of medical questions Match the words (phrases) with lexical sources, codified by the UMLS Match the words (phrases) with lexical sources, codified by the UMLS Our results were better than previous investigators (for percentage of matched words) Our results were better than previous investigators (for percentage of matched words) We still have some work to do…. We still have some work to do….

Related Experiments We attempted to cluster questions by sequences of semantic types We attempted to cluster questions by sequences of semantic types Initial attempts mostly clustered common phrases such as “How should I” and “What is the” Initial attempts mostly clustered common phrases such as “How should I” and “What is the” We may repeat this method after discarding ‘stop phrases’ We may repeat this method after discarding ‘stop phrases’

Future Work Family Practice Inquiries Network (FPIN) has 200 questions that have associated MeSH terms manually assigned by librarians. Family Practice Inquiries Network (FPIN) has 200 questions that have associated MeSH terms manually assigned by librarians. We will look at these question-term groups for clustering purposes (with the hypothesis that they will not make distinct clusters). We will look at these question-term groups for clustering purposes (with the hypothesis that they will not make distinct clusters).

Future Work I will work with researchers at NLM to apply MetaMap to medical questions extract triplets (Medical Concept-Allowable Relation-Medical Concept) from questions. Drug-treats-Disease extract triplets (Medical Concept-Allowable Relation-Medical Concept) from questions. Drug-treats-Disease Insert the triplets into a vector-space model and look for clusters Insert the triplets into a vector-space model and look for clusters

Thank-you!! ???