1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

Slides:



Advertisements
Similar presentations
Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Coreference Based Event-Argument Relation Extraction on Biomedical Text Katsumasa Yoshikawa 1), Sebastian Riedel 2), Tsutomu Hirao 3), Masayuki Asahara.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.
TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Using Natural Language Program Analysis to Locate and understand Action-Oriented Concerns David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
© Jesse Davis 2006 View Learning Extended: Learning New Tables Jesse Davis 1, Elizabeth Burnside 1, David Page 1, Vítor Santos Costa 2 1 University of.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System Mark A. Greenwood Mark Stevenson Yikun Guo Henk Harkema Angus Roberts.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Learning Ensembles of First-Order Clauses for Recall-Precision Curves Preliminary Thesis Proposal Mark Goadrich Department of Computer Sciences University.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Learning Ensembles of First- Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
SCORE AS YOU LIFT (SAYL) A Statistical Relational Learning Approach to Uplift Modeling Houssam Nassif 1, Finn Kuusisto 1, Elizabeth S. Burnside 1, David.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Frank DiMaio and Jude Shavlik Computer Sciences Department
Semi-Supervised Clustering
Sample Selection for Statistical Parsing
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Louis Oliphant and Jude Shavlik
Mark Goadrich Computer Sciences Department
Mark Rich & Louis Oliphant
Panagiotis G. Ipeirotis Luis Gravano
CS246: Information Retrieval
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA

2/24 Learning Language in Logic Biomedical Information Extraction Challenge Two tasks: with and without co-reference 80 sentences for training 40 sentences for testing Our approach: Gleaner (ILP ‘04) Fast ensemble ILP algorithm Focused on recall and precision evaluation L L L

3/24 A Sample Positive Example Given: Medical Journal abstracts tagged with genic interaction relations Do: Construct system to extract genic interaction phrases from unseen text ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. L L L

4/24 All unlabeled word pairings? Wastes time with irrelevant words We know the testset will include a dictionary Use only unlabeled pairings of words in dictionary 106 positive, 414 negative without co-reference 59 positive, 261 negative with co-reference What is a Negative Example? L L L

5/24 Tagging and Parsing verbnounverbprepnoun sentence noun phrase … verb phrase prep phrase noun phrase ykuD was transcribed by SigK RNA … L L L

6/24 Some Additional Predicates High-scoring words in agent phrases depend, bind, protein, … High-scoring words in target phrases gene, promote, product High-scoring BETWEEN agent & target negative, regulate, transcribe, … Medical Subject Headings (MeSH) canonized method for indexing biomedical articles in_mesh(RNA), in_mesh(gene) L L L

7/24 Even More Predicates Lexical Predicates Internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases agent_before_target(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) L L L

8/24 Link Parser (CMU) creates parse tree Root lemma of each word (not used) 27 Syntactic Information Predicates complement_of_N_N(Word, Word) modifier_ADV_V(Word, Word) object_V_Passive_N(Word, Word) Enriched Data From Committee L L L

9/24 Gleaner Definition of Gleaner One who gathers grain left behind by reapers Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Keep wide range of clauses usually discarded Create separate theories for different recall ranges

10/24 Aleph - Background Seed Example A positive example that our clause must cover Bottom Clause All predicates which are true about seed example seed agent_target(A,T,S)

11/24 Aleph - Learning Aleph learns theories of clauses (Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives and repeat until threshold of positives covered Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles

12/24 Gleaner - Background Rapid Random Restart ( Zelezny et al ILP 2002 ) Stochastic selection of initial clause Time-limited local heuristic search Randomly choose new initial clause and repeat seed initial 1initial 2

13/24 Gleaner - Learning Precision Recall Create B Bins Generate Clauses Record Best per Bin Repeat for K seeds

14/24 Gleaner - Combining Combine K clauses per bin If at least L of K clauses match, call example positive How to choose L ? L=1 then high recall, low precision L=K then low recall, high precision We want a collection of high precision theories spanning space of recall levels

15/24 Gleaner - Overlap Take topmost curve of overlapping theories Recall Precision

16/24 Gleaner - Practical Use Precision Recall Generate Curve User Selects Recall Bin Return Classifications With L of K Confidence Recall = 0.50 Precision = 0.70

17/24 agent_target(Agent, Target, Sentence) :- several_phrases_in_sentence(Sentence), some_wordPOS_in_sentence(Sentence, novelword), n(Agent), alphabetic(Agent), word_parent(Agent, F), phrase_contains_internal_cap_word(F, noun, _), few_POS_in_phrase(F, novelword), in_between_target_phrases(Agent, Target, _), n(Target) Recall, 0.93 Precision on without co-reference training set Sample Extraction Clause

18/24 agent_target(Agent, Target, Sentence) :- avg_length_sentence(Sentence), n(Agent), word_previous(Target,_), in_between_target_phrases(Agent, Target, _) Recall, 0.49 Precision on without co-reference training set Sample Extraction Clause

19/24 Experimental Methodology Used other trainset for tuneset in both cases Testset unlabeled, but dictionary provided Included sentences with no positives 936 total testset examples generated Parameter Settings Gleaner (20 recall bins) seeds = 100 clauses = 25,000 Aleph (0.75 minimum accruacy) nodes = {1K, 25K)

20/24 LLL Without Co-reference Results Gleaner Basic Gleaner Enriched Aleph Basic 1K

21/24 LLL With Co-reference Results Gleaner Basic Gleaner Enriched Aleph Basic 1K

22/24 We Need More Datasets LLL Challenge task is small Would prefer to do cross-validation Need labels for testset Our ILP’04 dataset open to community ftp://ftp.cs.wisc.edu/machine-learning/shavlik- group/datasets/IE-protein-location Biomedical information-extraction tasks Genetic Disorder (Ray and Craven 2001) Genia BioCreAtiVe

23/24 Conclusions Contributions Develop large amount of background knowledge Exploit normally discarded clauses Visually present precision and recall trade-off Proposed Work Achieve gains in High-Recall areas Reduce overfitting when using enriched data Increase diversity of learned clauses

24/24 Acknowledgements USA DARPA Grant F USA Air Force Grant F USA NLM Grant 5T15LM USA NLM Grant 1R01LM UW Condor Group David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jessie Davis