University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Improved TF-IDF Ranker
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Timothy H. W. Chan, Calum MacAulay, Wan Lam, Stephen Lam, Kim Lonergan, Steven Jones, Marco Marra, Raymond T. Ng Department of Computer Science, University.
University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Minimally Supervised Event Causality Identification Quang Do, Yee Seng, and Dan Roth University of Illinois at Urbana-Champaign 1 EMNLP-2011.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Semantic Processing with Context Analysis
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Retrieval Performance Evaluation - Measures
Presentation transcript:

University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline Razvan C. Bunescu Raymond J. Mooney Machine Learning Group Department of Computer Sciences University of Texas at Austin {razvan, Arun K. Ramani Edward M. Marcotte Institute for Cellular and Molecular Biology Center for Computational Biology and Bioinformatics University of Texas at Austin {arun,

2 University of Texas at Austin Machine Learning Group Introduction Two orthogonal approaches to mining binary relations from a collection of documents: –Information Extraction: Relation Extraction from individual sentences; Aggregation of the results over the entire collection. –Co-occurrence Statistics: Compute (co-)occurrence counts over the entire corpus; Use statistical tests to detect whether co-occurrence is due to chance. Aim: Combine the two approaches into an integrated extraction model.

3 University of Texas at Austin Machine Learning Group Outline Introduction. Two approaches to relation extraction: –Information Extraction. –Co-occurrence Statistics. Integrated Model. Evaluation Corpus. Experimental Results. Future Work & Conclusion.

4 University of Texas at Austin Machine Learning Group Information Extraction Most IE systems detect relations only between entities mentioned in the same sentence. The existence & type of the relationship is based on lexico-semantic cues inferred from the sentence context. Given a pair of entities, corpus-level results are assembled by combining the confidence scores that the IE system associates with each occurrence. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

5 University of Texas at Austin Machine Learning Group Relation Extraction using a Subsequence Kernel Subsequences of words and POS tags are used as Implicit Features. Assumes the entities have already been annotated. Exponential penalty factor is used to downweigh longer word gaps. Generalization of the extraction system from [Blaschke et al., 2001]. The system is trained to ouput a normalized confidence value for each extraction. interaction of (3) PROT (3) with PROT [Bunescu et al., 2005].

6 University of Texas at Austin Machine Learning Group Aggregating Corpus-Level Results S1S1 S2S2 SnSn Information ExtractionSentences......Confidences Aggregation

7 University of Texas at Austin Machine Learning Group Aggregation Operators Maximum Noisy-OR Average AND

8 University of Texas at Austin Machine Learning Group Outline Introduction. Two approaches to relation extraction: Information Extraction. –Co-occurrence Statistics. Integrated Model. Evaluation Corpus. Experimental Results. Future Work & Conclusion.

9 University of Texas at Austin Machine Learning Group Co-occurrence Statistics Compute (co-)occurrence counts for the two entities in the entire corpus. Based on these counts, detect if the co-occurrence of the two entities is due to chance, or to an underlying relationship. Can use various statistical measures: –Pointwise Mutual Information (PMI) –Chi-square Test (  2 ) –Log-Likelihood Ratio (LLR)

10 University of Texas at Austin Machine Learning Group Pointwise Mutual Information N : the total number of protein pairs co-occurring in the same sentence in the entire corpus. P(p 1,p 2 )  n 12 /N : the probability that p 1 and p 2 co-occur in the same sentence. P(p 1, p)  n 1 /N : the probability that p 1 co-occurs with any protein in the same sentence. P(p 2, p)  n 2 /N : the probability that p 2 co-occurs with any protein in the same sentence. The higher the sPMI(p 1, p 2 ) value, the less likely it is that p 1 and p 2 co- occurred by chance => they may be interacting.

11 University of Texas at Austin Machine Learning Group Outline Introduction. Two approaches to relation extraction: Information Extraction. Co-occurrence Statistics. Integrated Model. Evaluation Corpus. Experimental Results. Future Work & Conclusion.

12 University of Texas at Austin Machine Learning Group Integrated Model [Local] The sentence-level Relation Extraction (SSK) uses information that is local to one occurrence of a pair of entities (p 1,p 2 ). [Global] The corpus-level Co-occurrence Statistics (PMI) are based on counting all occurrences of a pair of entities (p 1,p 2 ). [Local & Global] Achieve a more reliable extraction performance by combining the two orthogonal approaches into an integrated model.

13 University of Texas at Austin Machine Learning Group Integrated Model Rewrite sPMI as: Instead of counting 1 for each co-occurrence, use the confidence ouput by the IE system => a weighted PMI: Can use any aggregation operator:

14 University of Texas at Austin Machine Learning Group Outline Introduction. Two approaches to relation extraction: Information Extraction. Co-occurrence Statistics. Integrated Model. Evaluation Corpus. Experimental Results. Future Work & Conclusion.

15 University of Texas at Austin Machine Learning Group Evaluation Corpus An evaluation corpus needs to provide two types of information: –The complete list of interactions mentioned in the corpus. –Annotations of protein mentions, together with their gene identifiers. The corpus was compiled based on the HPRD [ and NCBI [ databases: –Every interaction is linked to a set of Medline articles that report the corresponding experiment. –An interaction is specified as a tuple containing: The LocusLink (EntrezGene) identifiers of the proteins involved. The PubMed identifiers of the corresponding Medline articles.

16 University of Texas at Austin Machine Learning Group Evaluation Corpus (cont’ed) FNLC filamin C, gamma ABPA ABPL FNL2 gamma filamin filamin 2 filamin C, gamma MYOZ1 myozenin 1... FATZ … Participant Genes (XML) (NCBI) Interactions (XML) (HPRD) We found that this protein binds to three other Z-dics proteins; therefore we have named it FATZ, gamma-filamin, alpha-actinin and telethonin binding protein of the Z-disc. Medline Abstracts (XML) (NCBI)

17 University of Texas at Austin Machine Learning Group Gene Name Annotation and Normalization NCBI provides a comprehensive dictionary of human genes, where each gene specifies is specified by its unique identifier, and qualified with: –an official name, –a description, –a list of synonyms, –a list of protein names. All these names (including the description) are considered as referring to the same entity. Use a dictionary-based annotation, similar to [Cohen, 2005].

18 University of Texas at Austin Machine Learning Group Gene Name Annotation and Normalization Each name is reduced to a normal form, by: 1)Replacing dashes with spaces 2)Introducing spaces between letters and digits 3)Replacing Greek letters with their Latin counterparts 4)Substituting Roman numerals with Arabic numerals 5)Decapitalizing the first word (if capitalized). The names are further tokenized, and checked against a dictionary of 100K English nouns. Names associated with more than one gene identifier (i.e. ambiguous names) are ignored. The final gene name dictionary is implemented as a trie-like structure.

19 University of Texas at Austin Machine Learning Group Outline Introduction. Two approaches to relation extraction: Information Extraction. Co-occurrence Statistics. Integrated Model. Evaluation Corpus. Experimental Results. Future Work & Conclusion.

20 University of Texas at Austin Machine Learning Group Experimental Results Compare four methods on the task of interaction extraction: –Information Extraction: [SSK.Max] Relation extraction with the subsequence kernel (SSK), followed by an aggregation of corpus-level results using Max. –Co-occurrence Statistics: [PMI] Pointwise Mutual Information. [HG] The HyperGeometric distribution method from [Ramani et al., 2005]. –Integrated Model: [PMI.SSK.Max]The combined model of PMI & SSK. Draw Precision vs. Recall graphs, by ranking the extractions and choosing only the top N interactions, while N is varying.

21 University of Texas at Austin Machine Learning Group Experimental Results

22 University of Texas at Austin Machine Learning Group Future Work Derive an evaluation corpus from a potentially more accurate database (Reactome). Investigate combining IE with other statistical tests (LLR,  2 ). Design an IE method that is trained to do corpus-level extraction (as opposed to sentence-level extraction).

23 University of Texas at Austin Machine Learning Group Conclusion Introduced an integrated model that combines two orthogonal approaches to corpus-level relation extraction: –Information Extraction (SSK). –Co-occurrence Statistics (PMI). Derived an evaluation corpus from the HPRD and NCBI databases. Experimental results show a more consistent performance across the PR curve.

24 University of Texas at Austin Machine Learning Group Aggregating Corpus-Level Results Two entities p 1 and p 2 are mentioned in a corpus C of n sentences, C  {S 1, …, S n }. The IE system outputs a confidence value for each of the n occurrences: The corpus-level confidence value is computed using an aggregation operator  :

25 University of Texas at Austin Machine Learning Group Experimental Results Compare [PMI] and [HG] on the task of extracting interactions from the entire Medline. Use the shared protein function benchmark from [Ramani et al. 2005]. –Calculate the extent to which interactions partners share functional annotations, as specified in the KEGG and GO databases. –Use a Log-Likelihood Ratio (LLR) scoring scheme to rank the interactions: Also plot the scores associated with the HPRD, BIND and Reactome databases.

26 University of Texas at Austin Machine Learning Group Experimental Results