Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer Science Department University of California, Los Angeles
11/9-13/2002AMIA Outline Vector space model (VSM) in document retrieval Stem-based VSM Concept-based VSM Conceptual similarity Phrase-based VSM Retrieval effectiveness comparison Conclusion
11/9-13/2002AMIA Document Retrieval Find free-text documents to answer queries like, “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation. Cerebral edema secondary to infection, diagnosis and treatment.”
11/9-13/2002AMIA Vector Space Model (VSM) Leukocytosis Hyperthermia Words as terms d q d q
11/9-13/2002AMIA Stem-based VSM Morphological variants bear similar content E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Baseline of comparison
11/9-13/2002AMIA Shortcomings of Stem-based VSM Inability to capture multi-word concepts 1. “Increased intracranial pressure” Inability to utilize the relations between concepts: 2. Synonyms: “hyperthermia” and “fever” 3. IS-A relation: “hyperthermia” and “body temperature elevation”
11/9-13/2002AMIA Concept-based VSM Uses concepts in knowledge base (KB) as terms KB: Metathesaurus in UMLS Captures multi-word concepts Captures synonyms Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C ), (C ), (C )…
11/9-13/2002AMIA Shortcomings of Concept-based VSM Concepts may be related: E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts Need to quantify conceptual relations Knowledge bases are often incomplete, which reduces the retrieval effectiveness
11/9-13/2002AMIA Conceptual Similarity Evaluation c1 c2 c3 c4 Body temperature elevation Hyperthermia Disease Animal disease Node Distance d(c3,c4)=1 Descendant Count D(c3)=2 D(c4)=0
11/9-13/2002AMIA Deriving Conceptual Similarity From Hypernym Hierarchy c1 c2 c3 c4 Body temperature elevation Hyperthermia Disease Animal disease
11/9-13/2002AMIA Shortcomings of Concept-based VSM Concepts may be related: The conceptual similarity measure, s(c i,c j ), quantifies relations between concepts. Knowledge bases are often incomplete, which reduces the retrieval effectiveness.
11/9-13/2002AMIA Incompleteness of the Knowledge Bases Missing concepts in KB, e.g., “Infiltrative small bowel process” (), (C ), () In general, concept-based VSM cannot outperform stem-based VSM (cerebral edema)(cerebral lesion) Missing links between related concepts, e.g.,
11/9-13/2002AMIA Phrase-based Indexing Examples “Infiltrative small bowel process” [(); “infiltr”] [(C ); “smal”, “bowel”] [(); ”proces”] Query: “Cerebral edema” Document: “Cerebral lesion” [(C ); “cerebr”, “edem”] [(C ); “cerebr”, “lesion”] Query: “Hyperthermia, leukocytosis, increased intracranial pressure…” Phrases: [(C ); “hypertherm”] [(C ); “leukocytos”] [(C ); “increas”, “intracran”, “pressur”]…
11/9-13/2002AMIA Evaluate Phrase-based Document Similarity Due to the conceptual similarity s(c i,c j ) between concepts in p q and p d Due to the stem overlap in p q and p d
11/9-13/2002AMIA To Compare Retrieval Effectiveness The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N Retrieval effectiveness: Recall – the percentage of relevant documents retrieved so far Precision – the percentage of retrieved documents that are relevant
11/9-13/2002AMIA Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16% 100 queries vs. 5% 50 queries
11/9-13/2002AMIA Stem and Concept Similarity Contribution Weights : similarity contribution weight for concepts : similarity contribution weight for stems
11/9-13/2002AMIA Sensitivity of Retrieval Effectiveness to f s and f c Stems Concepts Optimal region
11/9-13/2002AMIA Computation Complexity Using Phrase-based VSM Data reorganization: Build separate indexes on stems and concepts Keep a list of related concepts c j ’s and conceptual similarity s(c i,c j ) with c i. Time complexities of document similarity calculation, same order of magnitude Stem-based VSM: Phrase-based VSM:
11/9-13/2002AMIA Conclusion A new document indexing paradigm based on phrases is proposed Use phrases (concept and its word stems) as terms Document similarity is derived from both the stem and the concept contributions Conceptual similarity quantifies the concept relations and improves retrieval effectiveness Stems remedy the incomplete coverage of the knowledge base (missing concepts and missing links between related concepts) Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM
11/9-13/2002AMIA Acknowledgement This research is supported in part by NIC/NIH Grant#
11/9-13/2002AMIA c1 c2 Concept Unrelated Model Comparison ? ? ? s1 s2 Stems p1 p2 Phrase Concept Unrelated Stem overlap in p1 and p2 p1 p2 Phrase Concept Related max(s(c1,c2), stem overlap in p1 and p2) c1 c2 Concept Related