Presentation is loading. Please wait.

Presentation is loading. Please wait.

Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer.

Similar presentations


Presentation on theme: "Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer."— Presentation transcript:

1 Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer Science Department University of California, Los Angeles

2 11/9-13/2002AMIA 20022 Outline Vector space model (VSM) in document retrieval Stem-based VSM Concept-based VSM Conceptual similarity Phrase-based VSM Retrieval effectiveness comparison Conclusion

3 11/9-13/2002AMIA 20023 Document Retrieval Find free-text documents to answer queries like, “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation. Cerebral edema secondary to infection, diagnosis and treatment.”

4 11/9-13/2002AMIA 20024 Vector Space Model (VSM) Leukocytosis Hyperthermia Words as terms d  q  d q

5 11/9-13/2002AMIA 20025 Stem-based VSM Morphological variants bear similar content E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Baseline of comparison

6 11/9-13/2002AMIA 20026 Shortcomings of Stem-based VSM Inability to capture multi-word concepts 1. “Increased intracranial pressure” Inability to utilize the relations between concepts: 2. Synonyms: “hyperthermia” and “fever” 3. IS-A relation: “hyperthermia” and “body temperature elevation”

7 11/9-13/2002AMIA 20027 Concept-based VSM Uses concepts in knowledge base (KB) as terms KB: Metathesaurus in UMLS Captures multi-word concepts Captures synonyms Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C0015967), (C0023518), (C0151740)…

8 11/9-13/2002AMIA 20028 Shortcomings of Concept-based VSM Concepts may be related: E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts Need to quantify conceptual relations Knowledge bases are often incomplete, which reduces the retrieval effectiveness

9 11/9-13/2002AMIA 20029 Conceptual Similarity Evaluation c1 c2 c3 c4 Body temperature elevation Hyperthermia Disease Animal disease Node Distance d(c3,c4)=1 Descendant Count D(c3)=2 D(c4)=0

10 11/9-13/2002AMIA 200210 Deriving Conceptual Similarity From Hypernym Hierarchy c1 c2 c3 c4 Body temperature elevation Hyperthermia Disease Animal disease

11 11/9-13/2002AMIA 200211 Shortcomings of Concept-based VSM Concepts may be related: The conceptual similarity measure, s(c i,c j ), quantifies relations between concepts. Knowledge bases are often incomplete, which reduces the retrieval effectiveness.

12 11/9-13/2002AMIA 200212 Incompleteness of the Knowledge Bases Missing concepts in KB, e.g., “Infiltrative small bowel process” (), (C0021852), () In general, concept-based VSM cannot outperform stem-based VSM (cerebral edema)(cerebral lesion) Missing links between related concepts, e.g.,

13 11/9-13/2002AMIA 200213 Phrase-based Indexing Examples “Infiltrative small bowel process” [(); “infiltr”] [(C0021852); “smal”, “bowel”] [(); ”proces”] Query: “Cerebral edema” Document: “Cerebral lesion” [(C0699725); “cerebr”, “edem”] [(C0221505); “cerebr”, “lesion”] Query: “Hyperthermia, leukocytosis, increased intracranial pressure…” Phrases: [(C0015967); “hypertherm”] [(C0023518); “leukocytos”] [(C0151740); “increas”, “intracran”, “pressur”]…

14 11/9-13/2002AMIA 200214 Evaluate Phrase-based Document Similarity Due to the conceptual similarity s(c i,c j ) between concepts in p q and p d Due to the stem overlap in p q and p d

15 11/9-13/2002AMIA 200215 To Compare Retrieval Effectiveness The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N Retrieval effectiveness: Recall – the percentage of relevant documents retrieved so far Precision – the percentage of retrieved documents that are relevant

16 11/9-13/2002AMIA 200216 Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16% 100 queries vs. 5% 50 queries

17 11/9-13/2002AMIA 200217 Stem and Concept Similarity Contribution Weights : similarity contribution weight for concepts : similarity contribution weight for stems

18 11/9-13/2002AMIA 200218 Sensitivity of Retrieval Effectiveness to f s and f c Stems Concepts Optimal region

19 11/9-13/2002AMIA 200219 Computation Complexity Using Phrase-based VSM Data reorganization: Build separate indexes on stems and concepts Keep a list of related concepts c j ’s and conceptual similarity s(c i,c j ) with c i. Time complexities of document similarity calculation, same order of magnitude Stem-based VSM: Phrase-based VSM:

20 11/9-13/2002AMIA 200220 Conclusion A new document indexing paradigm based on phrases is proposed Use phrases (concept and its word stems) as terms Document similarity is derived from both the stem and the concept contributions Conceptual similarity quantifies the concept relations and improves retrieval effectiveness Stems remedy the incomplete coverage of the knowledge base (missing concepts and missing links between related concepts) Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM

21 11/9-13/2002AMIA 200221 Acknowledgement This research is supported in part by NIC/NIH Grant#4442511-33780

22 11/9-13/2002AMIA 200222 c1 c2 Concept Unrelated Model Comparison ? ? ? s1 s2 Stems p1 p2 Phrase Concept Unrelated   Stem overlap in p1 and p2 p1 p2 Phrase Concept Related   max(s(c1,c2), stem overlap in p1 and p2) c1 c2 Concept Related  


Download ppt "Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer."

Similar presentations


Ads by Google