Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.

Slides:



Advertisements
Similar presentations
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
Natural Language Processing Group Department of Computer Science University of Sheffield, UK Improving Semi-Supervised Acquisition of Relation Extraction.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
A Language Independent Method for Question Classification COLING 2004.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Using Semantic Relatedness for Word Sense Disambiguation
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
WCPM 1 Chang-Tsun Li Department of Computer Science University of Warwick UK Image Clustering Based on Camera Fingerprints.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
A Brief Introduction to Distant Supervision
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts Natural Language Processing Group Department of Computer Science University of Sheffield, UK

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System The Challenge The challenge is to extract Genic Interactions from biomedical texts, such as MedLine abstracts.  A genic interaction involve genes and proteins  The interactions are directional but no guarantee that genes and proteins always fill the same slot. GerE stimulates cotD transcription and inhibits cotA transcription in vitro by sigma K RNA polymerase, as expected from in vivo studies, and, unexpectedly, profoundly inhibits in vitro transcription of the gene (sigK) that encode sigma K.  6 genes and proteins mentioned  Five pairs interact: GerE  cotD, GerE  cotA, sigma K  cotA, GerE  SigK and sigK  sigma K

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System We represent extraction patterns as paths in a dependency tree  Dependency trees represent text by linking each sentence word with those words which directly modify it.  For example the noun phrase “the brown dog” is represented by two dependency relations:  In these experiments we used MINIPAR (Lin, 1999) to generate the dependency trees from which the extraction patterns were taken. Extraction Patterns the browndog det adj

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Extraction Patterns Given the dependency tree representing the phrase “…AGENT represses the transcription of TARGET…” we extract chain shaped paths as extraction patterns. verb[v/repress](subj[n/AGENT]) verb[v/repress](obj[n/transcription](of[n/TARGET])) verb[v/repress](obj[n/transcription]+subj[n/AGENT]) verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT])

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Extraction Patterns The nodes in the dependency trees can be either:  Lexical items (i.e. words)  Semantic categories such as gene, protein, agent, target, etc. Lexical items are represented in lower case Semantic categories are capitalised For example in the pattern verb[v/transcribe](subj[n/GENE]+obj[n/PROTEIN]) transcribe is a lexical item and GENE and PROTEIN are semantic categories.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Learning Extraction Patterns Iterative Learning Algorithm 1.Begin with set of seed patterns which are known to be good extraction patterns 2.Compare every other pattern with the ones known to be good 3.Choose the highest scoring of these and add them to the set of good patterns 4.Stop if enough patterns have been learned, else repeat from step 2. Seeds Candidates Rank Patterns

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Pattern Similarity We determine the similarity between two patterns using a vector space model inspired by that commonly used in IR.  Each pattern can be represented by a set of pattern element-filler pairs  The set of pattern element-filler pairs in a corpus forms the basis for a vector space where the value is 1 if a pattern contains the pair, 0 otherwise. The similarity of two patterns can then be computed as: This is the cosine measure augmented with a matrix W which lists the similarity between each pattern element-filler pair.  The similarity between pattern element-filler pairs is computed using a WordNet similarity measure proposed by Banerjee and Pederson (2002) referred to as Adapted Lesk.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Pattern Similarity Extraction Patterns a. verb[v/block](subj[n/protein]) b. verb[v/repress](subj[n/enzyme]) c. verb[v/promote](subj[n/protein]) Matrix Labels 1. subj_protein, 2. subj_enzyme, 3. verb_block, 4. verb_repress, 5. verb_promote Similarity Values sim(a, b) = sim(a, c) = 0.55 sim(b, c) = Similarity Matrix

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Acquiring Patterns We use this approach to learn patterns containing a known agent or target from the training data. The texts are pre-processed to include AGENT and TARGET as semantic class labels. We restricted certain terms (e.g. repress) so that only certain domain specific senses in WordNet were used for similarity calculations. At each iteration of the algorithm we accepted up to 4 new patterns which were within 0.95 of the best pattern being accepted. The algorithm was allowed to run until no more patterns could be acquired.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Seed Patterns We used the following seed patterns in all experiments: verb[v/transcribe](by[n/AGENT]+obj[n/TARGET]) verb[v/be](of[n/AGENT]+s[n/expression](of[n/TARGET])) verb[v/inhibit](obj[n/activity](nn[n/TARGET])+subj[n/AGENT]) verb[v/bind](mod[r/specifically](to[n/TARGET])+subj[n/AGENT]) verb[v/block](obj[n/capacity](of[n/TARGET])+subj[n/AGENT]) verb[v/regulate](obj[n/expression](nn[n/TARGET])+subj[n/AGENT]) verb[v/require](obj[n/AGENT]+subj[n/gene](nn[n/TARGET])) verb[v/repress](obj[n/transcription](of[n/TARGET])+subj[n/AGENT])

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Extracting Relations Text from which we wish to extract relations is processed to produce extraction patterns in the same way as before. Any pattern which matches an acquired pattern is used to extract information.  The acquired patterns match with AGENT and TARGET matching anything  Not all patterns contain both an AGENT and TARGET so post- processing links part relations together. So for example  The pattern verb[v/stimulates](subj[n/AGENT]+obj[n/TARGET])  Matches against verb[v/stimulates](subj[n/GerE]+obj[n/cotD])  Resulting in the interaction GerE  cotD

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Challenge Evaluation We submitted three runs for evaluation  Baseline: A simple baseline system which pairs all dictionary elements in a sentence with each other in both orders.  Basic: A system trained on the basic data set without coreference as provided for the LLL-05 challenge.  Expanded: A system trained on the basic data set augmented with 78 automatically acquired weakly labelled (Craven & Kumlien, 1999) MedLine sentences. The basic and expanded systems differ only in the training data used to acquire the extraction patterns.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Challenge Evaluation SystemPrecisionRecallF-measure Baseline10.6% (53/500)98.1% (53/54)19.1% Basic22.2% (6/27)11.1% (6/54)14.8% Expanded21.6% (8/37)14.8% (8/54)17.5% The baseline system did not achieve 100% recall as some constructs, such as “… A activates or represses B…” requires two interactions between A and B to be recognised. Both approaches have low recall but a precision twice that of the baseline system. While the performance is low it seems that supplying extra training data improves the performance of our approach.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Analysis If we examine the algorithm at each iteration instead of just the final result we can see that:  The seed patterns are unable to extract a single interaction, i.e. the initial F-measure is zero.  As the seeds do not extract relations the performance of the system is solely due to the acquired patterns.  The algorithm is fairly resilient to the acquisition of bad patterns, i.e. with few exceptions, the F- measure steadily increases.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Analysis Previously our implementation had been used only to perform sentence filtering (Stevenson & Greenwood, 2005), i.e. determining if a given sentence contains an interaction or not. Using the acquired patterns to perform sentence filtering results in an F-measure of 47.5%. Given the small amount of training data (181 sentences) this looks promising.  Nédellec et al. (2001) reported F- measure of 80% over similar data but using 900 training sentences.

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Outline of Talk The Challenge Extraction Patterns Acquiring And Using Extracting Patterns Challenge Evaluation Analysis Conclusions and Future Work

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Conclusions & Future Work We used a pattern representation based on dependency trees and an iterative algorithm to learn representative patterns.  The seed patterns were not well suited to the task and future work will include experimenting with different seed sets.  The small amount of training data seems to hinder our approach (adding 78 extra sentences saw a 2.7% increase in F-measure) The similarity measure we adopted seems well suited to this task where similar meaning can be conveyed in different ways. Other issues for future work:  We used MINIPAR to produce the dependency trees. We intend to try other dependency parser to see if they are more suited to biomedical texts.  We intend to continue our work on sentence filtering as this would provide a useful first step in any extraction system.

Any Questions? Copies of these slides can be found at:

August 7th 2005LLL05: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Bibliography Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing (CICLING-02), Mark Craven and Johan Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, Dekan Lin. MINIPAR: a minimalist parser. Maryland Linguistics Colloquium. University of Maryland, College Park Claire Nédellec and Mohamed Ould Abdel Vetah and Philippe Bessières. Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In Proceedings of the Conference on Practical Knowledge Discovery in Databases (PKDD'2001), Mark Stevenson and Mark A. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), 2005.