BioQA - A question answering system for the biomedical domain

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
An Information Retrieval and Extraction System for C. elegans Literature.
Biological literature mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Literature Mining for the Biologists Santhosh J. Eapen
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Gene Ontology Luis Tari. Gene Ontology (GO) URL: Gene Ontology is A hierarchy of roles of genes.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Information Retrieval
The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Artificial Intelligence
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Concept Recognition, Information Retrieval, and Machine Learning in Genomics Question Answering J. Gregory Caporaso Center for Computational Pharmacology.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Lars Juhl Jensen Biomedical text mining. exponential growth.
WELNS 670: Wellness Research Design Chapter 3. The Problem: The Heart of the Research Process Chapter 3.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Artificial Intelligence By Michelle Witcofsky And Evan Flanagan.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Protein association networks with STRING
STRING Large-scale data and text mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Introduction Artificial Intelligent.
CSE 635 Multimedia Information Retrieval
Citation-based Extraction of Core Contents from Biomedical Articles
Batyr Charyyev.
Network biology An introduction to STRING and Cytoscape
Presentation transcript:

BioQA - A question answering system for the biomedical domain Luis Tari

Question Answering (QA) What is QA? “QA is an interactive human computer process that encompasses understanding a user information need, typically expressed in a natural language query; retrieving relevant documents, data, or knowledge from selected sources; extracting, qualifying and prioritizing available answers from these sources; and presenting and explaining responses in an effective manner.” Cited from “New Directions in Question Answering” Why QA? One of the ultimate goals in AI (human-level AI, Turing’s test, …) A move beyond keyword query, finding what we really want to know

QA How is QA different from a search engine? QA Search Engine Check out www.brainboost.com QA Search Engine Queries in Natural Language (Questions) Queries based on keywords Present answers to users Users find the answers from retrieved results Some natural language process is used to determine answers Mostly keywords and ranking to retrieve results

Text Retrieval Conference (TREC) An annual activity of information retrieval (IR) research sponsored by the National Institute for Standards and Technology (NIST). TREC is organized into “tracks” of common interest. Research groups work on a common source of data and a common set of queries or tasks. The goal is to allow comparisons across systems and approaches in a research-oriented, collegial manner.

TREC Genomics Track TREC Genomics Track focuses on the retrieval of information from biomedical literature. Ad-hoc retrieval on a set of 4.5 millions of articles, in which 25% of them have no abstracts. 50 topics (queries) organized in 5 templates

TREC Genomics Templates Find articles describing standard methods or protocols for doing some sort of experiment or procedure. Find articles describing the role of a gene involved in a given disease. Find articles describing the role of a gene in a specific biological process. Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease. Find articles describing one or more mutations of a given gene and its biological impact.

BioQA A QA system for the biomedical domain A great deal of genomics information resources are available Entrez Gene, PubMed, UniProt, Gene Ontology, UMLS, many many more… BioQA utilizes some of the genomics resources, whereas a generic QA does not Keyword search is not enough Consider the following examples

Example 1 Suppose as a biologist, I want to know the role of the gene interferon beta in the disease multiple sclerosis. Query to PubMed: “interferon beta” AND “multiple sclerosis” Oops… interferon beta IS also the name of a treatment. I’m not a medical doctor so I don’t really care…. bad example…

Example 2 Query: “interferon beta” AND “multiple sclerosis” good example Hmm… this is more like what I am looking for….

Objectives of BioQA Phase 1 Phase 2 Phase 3 Retrieve relevant articles with respect to the specific needs of user’s questions Phase 2 Extract and present answers to the users Phase 3 Answer questions that require simple reasoning

BioQA Prototype Offline Subsystem

BioQA Prototype Online Subsystem

Main Components of BioQA Phase 1 Question Processing and Query Formation Entity Recognition Indexing Pronoun Resolution Extraction Ranking

Question Processing and Query Formation Process questions so that keywords are extracted to form queries for retrieval Incorporate synonyms for the keywords Consider the question: “What is the role of PRNP in mad cow disease?” First idea Get all the nouns from the question But we do not want a query that includes “role” Second idea Identify all the entities from the question and treat them as keywords But what if we are unable to identify some of the entities?

Question Processing and Query Formation Third idea – making use of dependency grammar (Link Grammar) keyword(N2) :- noun(N1), noun(N2), Mp(N1,X), J(X,N2). In the following example, N1= “role” and N2= “X” in the question +-----------------Xp-----------------+ | +--------MVp--------+ | | +---Ost---+ | | +---Ws--+Ss*w+ +--Ds-+-Mp-+J+ +J+ | | | | | | | | | | | LEFT-WALL what is.v the role.n of X in Y ?

Entity Recognition To recognize gene symbols, disease names Lots of resources on gene symbols: Entrez Gene, HUGO, … disease names: MeSH, UMLS, … Why is Entity Recognition still an issue? “CDC28” can be written as “Cdc28”, “Cdc28p”, “cdc-28” “hairy” is a gene name “GSS” is a synonym of “PRNP”, but “GSS” itself is also a gene which is unrelated to “PRNP”! Two tasks Recognize gene names given a biomedical article Generate gene symbol synonyms and variants given a gene symbol in a query Notice that “Cdc28”, “Cdc28p”, “cdc-28” are not synonyms of “CDC28”, so they are not listed in dictionaries such as Entrez Gene

Entity Recognition Various approaches: Machine learning techniques to recognize names on the basis of their characteristic features Dictionary-based methods with generation of variants Dictionary-based + Part-of-Speech methods Rule-based methods Some of the best Entity Taggers: ABNER GAPSCORE

Anaphora Resolution Pronominal Anaphors Sortal Anaphors Resolving third-person pronouns and reflexive pronouns Example: “BRCA1 interacts with Smad2. It also interacts with Smad3.” Sortal Anaphors “In this report, we show that virus infection of cells results in a dramatic hyperacetylation of histones H3 and H4 that is localized to the IFN-beta promoter. … Thus, coactivator-mediated localized hyperacetylation of histones may play a crucial role in inducible gene expression. [PMID: 10024886] Which histones?

Anaphora Resolution “Ethanol was found to inhibit the function of this chimeric receptor in a manner similar to that of nACh alpha 7 receptors. Because the inhibition transfers with the amino-terminal domain of the receptor, the observations suggest that the amino-terminal domain of the receptor is involved in the inhibition.” [PMID: 8863848]

Extraction To extract knowledge from text Knowledge such as protein-protein interactions, gene-disease relations, … Can be used in presenting answers Extracting protein-protein interactions “Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation.” [PMID: 15920482] Should extract the following interactions from the above text: Cdc28 binds Clb2 Swe1 is phosphorylated by Clb2-Cdc28 complex Cdc5 is involved in Swe1 phosphorylation.

Extraction Extraction of other relations “… Furthermore, PACT colocalized with viral replication complex in the infected cells. Thus the observed effect of PACT is novel and PACT is involved in the regulation of viral replication …” [PMID: 11401490] Should extract the following relations from the above text: PACT colocalized with viral replication complex in the infected cells PACT is involved in the regulation of viral replication

Extraction Two main directions towards extraction: However, Cooccurrence Identify entities that co-occur within abstracts Frequency-based scoring scheme to rank the extracted relationships NLP Combine the analysis of syntax and semantics Using extraction rules that are implemented manually or learned automatically from annotated corpus However, Cooccurrences sometimes do not actually mean correct relations Cannot infer directional relationships from cooccurrences

Hard Lessons learned from TREC Synonyms from gene dictionary is NOT enough Generating gene symbol variants is essential One query is not enough to do the job Generating query variants, which are slight variations of the original query. For instance, the query “inhibitory synoptic transmission” can have the variants “synoptic transmission” and “inhibitory transmission”.

more…. Abstracts related to a gene family can be relevant as well Suppose we want to know about the gene COPII, we may want to know COP, COPI as well Abstracts can merely mention an entity as an example e.g. [PMID 10232877]: GSTM1 is mentioned to be related to breast cancer as an example, but article is about GSTM1 and alcoholism.

Future Components Structural Feedback Answer Presentation Semantics of Words Simple Reasoning using Domain Knowledge

Structural Feedback Problem: Can we use the underlying “structures” among the relevant articles to improve the retrieval process? [IBM] Goal: To learn the “structures” of abstracts that are identified as relevant. Idea: Learn the structure of articles (such as common words, MeSH terms) identified to be relevant by domain experts identified to be relevant by users

Answer Presentation To present answers to users in a precise and concise manner Current Status: relevant “answers” are presented to the users in the form of abstracts Problem: Not concise enough for users Ideas: Retrieve small passage of text, based on proximity of keywords [LCC02] and simple cosine similarity between sentences [Singapore05]. Extraction using NLP Use text summarization techniques to present answers [PSB06].

Semantics of Words WordNet – a resource that provides synonyms of words in different senses; relations between words Question: “What is the role of IDE in Alzheimer’s Disease?” Abstract (PMID:12161276): “… IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons …” Semantic relation between “role” and “play” [from WordNet]: role: function, purpose, role, use play: is_a(play_use) So we can say “role”, “play”, “use” are related. Answer: The role of IDE is in the degradation and clearance of human amyloid beta from migroglial cells and neurons.

Simple Reasoning using Domain Knowledge (Example 1) Question: “Does IDE play a role in Alzheimer’s Disease (AD)?” Retrieved Abstract (PMID:12161276): “… The insulin degrading enzyme (IDE) is an attractive candidate gene since previous studies have identified a possible role that IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons …” Domain knowledge: AD is a nervous system disease. Neurons are related to the nervous system. Answer: Yes, IDE plays a role in AD because AD is a nervous system disease and IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons.

Simple Reasoning using Domain Knowledge (Example 2) Question: Does MMS2 involve in cancer? Domain Knowledge about MMS2 MMS2 is known to be involved in biological processes such as cell proliferation and the ubiquitin cycle, based on the Gene Ontology. Cell Proliferation – cell growth Ubiquitin cycle – regulating proteins' half-lives Either way, that is deviating from the normal half- life, and that is not a good thing.

Simple Reasoning using Domain Knowledge (Example 2 cont.) Domain Knowledge about cancer Abnormal growth of tissues Sometimes in cancer, we find that the ubiquitin cycle is deregulated, leading to certain proteins having extra long or extra short half-lives. Answer: Yes. Since MMS2 is involved in regulating cell proliferation and ubiquitin cycle, MMS2 is possibly involved in cancer. Challenges: How to represent such knowledge Where to get such domain knowledge

Potential Projects Learning Answer Presentation Extraction Structural Feedback Rules for describing keywords in questions Answer Presentation Passage retrieval, extraction Extraction gene-disease, gene-biological process relations Sortal Resolution Semantics of Words

References Literature mining for the biologist: from information retrieval to biological discovery. Lars Juhl Jensen, Jasmin Saric and Peer Bork. Nature Reviews Genetics 7, 119-129 (February 2006). Anaphora Resolution Anaphora Resolution in Biomedical Literature. Jose Castano, Jason Zhang, James Pustejovsky. Extraction of Gene-Disease Relations Association of genes to genetically inherited diseases using data mining. Perez-Iratxeta C, Bork P, Andrade MA. Nature Genetics 31, 316-319 (2002). G2D: A Tool for Mining Genes Associated to Disease. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. BMC Genetics 6, 45 (2005). Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning. Hong-Woo Chun, Yoshimasa Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki, and Jun'ichi Tsujii. PSB 2006. Structural Feedback [IBM] Rie Kubota Ando, Mark Dredze, Tong Zhang. TREC 2005 Genomics Track Experiments at IBM Watson.

References Answer Presentation WordNet Resources [LCC02] Dan I. Moldovan, Mihai Surdeanu: On the Role of Information Retrieval and Information Extraction in Question Answering Systems. SCIE 2002: 129-147. [Singapore05] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan and Tat-Seng Chua, Question Answering Passage Retrieval Using Dependency Relations, In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR 2005), Salvador, Brazil, August 15 -19, 2005. [PSB06] Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter. Finding GeneRIFs via Gene Ontology Annotations. To appear in PSB 2006. WordNet Resources [WordNetSim] Pedersen, Patwardhan, and Michelizzi. WordNet::Similarity - Measuring the Relatedness of Concepts. Appears in the Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), July 25-29, 2004, San Jose, CA (Intelligent Systems Demonstration). [SenseRelate] Michelizzi. Semantic Relatedness Applied to All Words Sense Disambiguation. Master of Science Thesis, Department of Computer Science, University of Minnesota, Duluth, July, 2005.