Presentation is loading. Please wait.

Presentation is loading. Please wait.

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng.

Similar presentations


Presentation on theme: "A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng."— Presentation transcript:

1 A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton 1

2 Outline Problem statement Techniques and methods Experimental results Discussion and conclusion 2 CIKM 2008 By Clement Yu from UIC

3 Problem statement Given a complex biological question, output relevant passages (or excerpts) where the answer can be found. 3 CIKM 2008 By Clement Yu from UIC

4 What [GENES] are involved in insect segmentation? A sample question: A sample relevant passage: An Example 4 CIKM 2008 By Clement Yu from UIC In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects. Target: GENES Qualification concepts: 1) insect 2) segmentation [hb, ftz, and eve are targets found in the passage]

5 Technique and methods Identify concepts in queries and texts Use of domain knowledge Related concepts (query expansion) Gene symbol disambiguation Conceptual IR models 5 CIKM 2008 By Clement Yu from UIC

6 In texts Window size: all component words appear within a certain window size. An example :...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon, but not rectal, cancer than do women who...”, [Query concept: colon cancer] Identify concepts in queries and texts In queries PubMed automatic term mapping 6 CIKM 2008 By Clement Yu from UIC

7 Use of domain knowledge Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant. Example: Query: What [GENES] are involved axon guidance in C.elegans? An irrelevant passage because of a different species: “ We describe DPTP52F, which is probably the last remaining RPTP encoded in the Drosophila genome. Ptp52F mutations cause specific CNS and motor axon guidance phenotypes, and exhibit genetic interactions with mutations in the other Rptp genes”. [ Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans ] 7 CIKM 2008 By Clement Yu from UIC

8 Use of domain knowledge Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes from Entrez gene and map them to the TREC entity types. An example: [Target types] : TUMOR TYPES [Dictionary] : UMLS Metathesaurus [Instances] : Lung Cancer; T-cell lymphoma; Pheochromocytoma 8 CIKM 2008 By Clement Yu from UIC

9 Related concepts Synonyms Hyponyms (one-level only) Hypernyms (one-level only) Lexical variants Related abbreviations 9 CIKM 2008 By Clement Yu from UIC

10 Related concepts : lexical variants Type 1: Automatically generate lexical variants using manually created heuristics: e.g., PLA2  PLA 2, PLAII, and PLA II Note: PLA2: Phospholipase A2 10 CIKM 2008 By Clement Yu from UIC

11 Related concepts : lexical variants Type 2: Retrieve additional lexical variants from a term database of MEDLINE e.g., PLA2  PL-A2 Note: PLA2: Phospholipase A2 11 CIKM 2008 By Clement Yu from UIC

12 Related concepts – Lexical variants 12 CIKM 2008 By Clement Yu from UIC 6 sub types of Type 3 Type 3.1:Identical after stemming Example: APC: "antigen presenting cell" ≈ "antigen presented cell" Type 3.2: Different by a small edit distance Example: HPV: "Human papillomavirus" ≈ "Human papillomaviral" Type 3.3: Identical after normalization Example: NFkb: "Nuclear factor kappa beta" ≈ "Nuclear factor kb" Type 3.4: Different ordering Example: Abeta: "amyloid beta protein“ ≈ "beta amyloid protein" Type 3.5: Extra words Example: ACD: " cerebral amyloid angiopathies " ≈ " cerebral beta amyloid angiopathies " Type 3.6: Internal abbreviations Example: APC: "ag presenting cell" ≈ "antigen presenting cell" Type 3: Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr.

13 Related concepts: related abbreviations Abbreviations whose definitions (or long- forms) consume the query concept. For example some related abbreviations for concept “lung cancer” are):  SCLC (small cell lung cancer)  LCSS (lung cancer symptom scale)  NSCLC(non-small cell lung cancer) CIKM 2008 By Clement Yu from UIC 13

14 Gene symbol disambiguation CIKM 2008 By Clement Yu from UIC 14 3 simple rules are defined to disambiguate gene symbols from  Abbreviations of non-gene meanings (Rule 1 & 2) Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154- KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [ NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”]  Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor, was one of the genes identified in this study. ” [“ Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”]

15 Conceptual IR Models Model 1  Differentiate target instances Model 2  Equally weight target instances CIKM 2008 By Clement Yu from UIC 15

16 Conceptual IR Models – Model 1 CIKM 2008 By Clement Yu from UIC 16

17 Conceptual IR Models – Model 2 CIKM 2008 By Clement Yu from UIC 17

18 Experimental results Data sets and evaluation metrics Impact of different techniques and methods Comparison with best reported results CIKM 2008 By Clement Yu from UIC 18

19 Data sets and evaluation metrics Query collection: 36 questions collected from biologists in 2007. Document collection : 162,259 Highwire full-text documents in HTML format. Performance Metrics  Passage MAP  Aspect MAP  Document MAP CIKM 2008 By Clement Yu from UIC 19

20 Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 20

21 Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 21

22 Comparison with best reported results CIKM 2008 By Clement Yu from UIC 22 The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval).

23 Summary Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness. Achieved significant improvement over the best reported results Compared two conceptual IR models in retrieval effectiveness Evaluated a simple method for gene symbol disambiguation 23 CIKM 2008 By Clement Yu from UIC

24 Conclusions 1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness. 24 CIKM 2008 By Clement Yu from UIC

25 Conclusions 2 : The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses. 25 CIKM 2008 By Clement Yu from UIC

26 Future work Improve the quality of target instances retrieved from different resources Improve gene symbol disambiguation method Handle pronouns More evaluations on other gold standards 26 CIKM 2008 By Clement Yu from UIC

27 Questiosn Thanks CIKM 2008 By Clement Yu from UIC 27


Download ppt "A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng."

Similar presentations


Ads by Google