Presentation on theme: "A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA"— Presentation transcript:
A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA firstname.lastname@example.org
2 NIH Program Project Grant A 5 year $ 10M joint interdisciplinary project between Medical School & CS faculty Project 1-- teleradaiology infrastructure Project 2-- neuroradiology workstation Project 3-- multimedia information architecture Project 4-- natural language processing for medical reports Project 5-- medical digital library
3 Project 5 Personnel Graduate students: Victor Z. Liu Wenlei Mao Qinghua Zou Consultants: Hooshang Kangaloo, M.D. Denies Aberle, M.D. Project leader: Wesley W. Chu
4 Data in a Medical Digital Library Structured data (patient lab data, demographic data,…)--CoBase Images (X rays, MRI, CT scans)--KMeD Free-text Patient reports Teaching files Literature News articles
5 System Overview Patient reports Medical literature Medical Digital Library (MDL) Teaching materials Query results Ad-hoc query Patient report for content correlation News Articles
6 A Sample Patient Report … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. …
7 Treatment- related articles ??? How to treat the disease Diagnosis- related articles ??? How to diagnose the disease Scenario Specific Retrieval … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. …
8 Challenge I: Indexing Extracting domain-specific key concepts in the free text for indexing Free-text: Lung cancer, small cell, stage II Concept terms in knowledge source: stage II small cell lung cancer Conventional methods use NLP Not scalable Cannot adapt to various forms of word permutation
9 Challenge II: Terms used in the query are too general Expanding the general terms in the query to specific terms that are used in the document Query: lung cancer, diagnosis options Document: … the effectiveness of chest x-ray and bronchography on patients with lung cancer … ? √ Query: lung cancer, chest x-ray, bronchography, …
10 Challenge III: Mismatching between terms used in query and documents Example Query: … lung cancer, … Document 3: anti-cancer drug combinations… ? ? ? Document 1: … lung carcinoma … Document 2: … lung neoplasm …
11 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents
12 IndexFinder: Extracting domain- specific key concepts Technique Permute words from text to generate concept candidates. Use knowledge base to select the valid candidates. Problem Valid candidates may be irrelevant to specific domain indexing.
13 Eliminating irrelevant concepts Syntactic filter: Limit permutation of words within a sentence. Semantic filter: Use the semantic type (e.g. body part, disease, treatment, diagnosis) to filter out irrelevant concepts Use ISA relationship to filter out general concepts and yield specific concepts.
14 IndexFinder Performance Two orders of magnitude faster than conventional approaches No NLP Knowledge base (UMLS) and index files are resided in main memory Time complexity is linear with the number of distinct words in the text Preliminary Evaluation IndexFinder generates 4% more concepts than conventional approaches (using a single noun phrase) All concepts are relevant
15 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents
16 Query Expansion (QE) Queries in the following form benefit from expansion: + e.g. lung cancer e.g. diagnosis options + e.g. lung cancer e.g. chest x-ray, bronchography expansion
17 Traditional QE Appends all terms that statistically co-occur with the key terms in the query Not semantically focused Original Query: lung cancer, diagnosis options expansion Expanded Query: lung cancer, radiotherapy, chemotherapy, antineoplastic agents, survival rate
18 Knowledge-based QE Knowledge source (UMLS, by the NLM) diagnoses Concept Disease or Syndrome Diagnostic Procedure Sign or Symptom Pharmacologic Substance lung cancer chest x-ray Semantic Type Key concept Specific supporting concepts A class of concepts that belong to a Semantic Type Body Parts Injury or Poisoning Semantic Network Metathesaurus diagnoses
19 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents
26 Conclusion Knowledge based (UMLS) approach provides scenario-specific medical free-text retrieval IndexFinder – use word permutation as well as syntactic and semantic filtering to extract domain-specific key concepts in the free text for indexing Knowledge-based query expansion – transform general terms in the query into the scenario specific terms used in the documents, giving the query a higher probability of matching with the relevant documents Phrase based indexing – transform document indexing into phrase paradigm (concept and its word stems) to improve retrieve effectiveness
27 Acknowledgement This research is supported in part by NIC/NIH Grant#4442511-33780
28 Indexing of free text Clinical text Prostate, right (biopsy) - fibromuscular and glandular hyperplasia C0194804:biopsy prostate >>T060:Diagnostic Procedure C0033577:prostate hyperplasia >>T046:Pathologic Function C0035621:right >>T080:Qualitative Concept C0259776:hyperplasia fibromuscular >>T046:Pathologic Function C0334000:hyperplasia glandular >>T046:Pathologic Function Concepts The problem: Extract key terms from free text. Represent in standard concept terms (e.g. UMLS concepts) Concept types
29 Extracting domain-specific key concepts Conventional approach Use NLP to discover noun phrases. Map each noun phrase into concepts. Problems A concept that is contained in a noun phrase will not be discovered. Difficult to scale to large text.
30 Generate concept candidates from free text Sort the concept terms (phrases) in the knowledge base (UMLS) by their length and assign each phrase a unique ID. Create an inverted index for the word(s) used in the phrases; each word has a list of phrase IDs. To generate a concept candidate: Remove replicated words. Based on the list of phrase IDs of each word, aggregate the occurrence of each phrase ID. The phrases with ID occurrences that are equal to their phrase lengths are the concept candidates.
31 Demo http://fargo.cs.ucla.edu/umls/search.aspx http://fargo.cs.ucla.edu/umls/search.aspx Test Texts Technically successful left lower lobe nodule biopsy. Preliminary localization CT images again demonstrate a left lower lobe nodule adjacent to the posterior segmental bronchus. CT scans obtained during biopsy demonstrate the coaxial cannula adjacent to the proximal aspect of the nodule. Surrounding pulmonary parenchymal hemorrhage as a result of the biopsy is also noted. There may be a tiny left apical air collection in the pleural space lateral to the apical bulla. Formal cytologic evaluation of the withdrawn specimen is pending at this time, although abnormal appearing "spindle" cells were identified during on-site cytopathologic evaluation of specimen adequacy.
32 References 1.Yuri L. Zieman and Howard L. Bleich. Conceptual Mapping of User’s Queries to Medical Subject Headings. Proc AMIA 1997. 2.Suresh Srinivasan, Thomas C. Rindflesch, William T. Hole, Alan R. Aronson, and James G. Mork. Finding UMLS Metathesaurus Concepts in MEDLINE. Proc AMIA 2002. 3.Alan R. Aronson, Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. Proc AMIA 2001. 4.Joshua C. Denny, Jeffrey D. Smithers, Anderson Spickard, III, Randolph A. Miller. A New Tool to Identify Key Biomedical Concepts in Text Documents. Proc AMIA 2002. 5. National Library of Medicine. Documentation, UMLS Knowledge Sources, 14 th Edition, January 2003. 6. Elkin PL, Cimino JJ, Lowe HJ, Aronow DB, Payne TH, Pincetl PS and Barnett GO. Mapping to MeSH: The art of trapping MeSH equivalence from within narrative text. Proc 12th SCAMC, 185-190, 1988. 7. Tuttle MS, Olson NE, Keck KD, Cole WG, Erlbaum MS, Sherertz DD et al. Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Methods Inf Med. 1998 Nov;37(4-5):373-83. 8.Hole W. T, Srinivasan S. Discovering Missed Synonymy in a Large Concept-Oriented Metathesaurus. Proc AMIA Symp 2000:354-358 9. Morioka CA, El-Saden S, Duckwiler, G. et al, Workflow Management of HIS/RIS Textual Documents with PACS Image Studies for Neuroradiology, Proc AMIA Symp 2003 (submitted for publication).
34 Traditional QE Statistical-based Any terms that statistically co-occur with the original query terms are appended Not semantically focused May expand terms irrelevant to the “treatment” of “lung cancer” e.g. “survival,” “survival rate,” …
35 Document Retrieval Find free-text documents to answer queries like: “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation.” “Cerebral edema secondary to infection, diagnosis and treatment.”
36 Vector Space Model (VSM) Leukocytosis Hyperthermia Words as terms d q d q
37 Stem-based VSM Morphological variants bear similar content E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Baseline of comparison
38 Shortcomings of Stem-based VSM Inability to capture multi-word concepts 1. “Increased intracranial pressure” Inability to utilize the relations between concepts: 2. Synonyms: “hyperthermia” and “fever” 3. IS-A relation: “hyperthermia” and “body temperature elevation”
39 Concept-based VSM Uses concepts in knowledge base (KB) as terms KB: Metathesaurus in UMLS Captures multi-word concepts Captures synonyms Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C0015967), (C0023518), (C0151740)…
40 Shortcomings of Concept-based VSM Concepts may be related: E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts Need to quantify conceptual relations Knowledge bases are often incomplete, which reduces the retrieval effectiveness
41 Shortcomings of Concept-based VSM (cont’d) Concepts may be related: The conceptual similarity measure, s(c i,c j ), quantifies relations between concepts. Knowledge bases are often incomplete, which reduces the retrieval effectiveness.
42 Incompleteness of the Knowledge Bases Missing concepts in KB, e.g., “Infiltrative small bowel process” (), (C0021852), () In general, concept-based VSM cannot outperform stem-based VSM (cerebral edema)(cerebral lesion) Missing links between related concepts, e.g.,
43 To Compare Retrieval Effectiveness The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N Retrieval effectiveness: Recall – the percentage of relevant documents retrieved so far Precision – the percentage of retrieved documents that are relevant
44 Evaluation of Phrase-based Document Similarity Due to the conceptual similarity s(c i,c j ) between concepts in p q and p d Due to the stem overlap in p q and p d