Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

Similar presentations


Presentation on theme: "1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude."— Presentation transcript:

1 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA

2 2/24 Learning Language in Logic Biomedical Information Extraction Challenge Two tasks: with and without co-reference 80 sentences for training 40 sentences for testing Our approach: Gleaner (ILP ‘04) Fast ensemble ILP algorithm Focused on recall and precision evaluation L L L

3 3/24 A Sample Positive Example Given: Medical Journal abstracts tagged with genic interaction relations Do: Construct system to extract genic interaction phrases from unseen text ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. L L L

4 4/24 All unlabeled word pairings? Wastes time with irrelevant words We know the testset will include a dictionary Use only unlabeled pairings of words in dictionary 106 positive, 414 negative without co-reference 59 positive, 261 negative with co-reference What is a Negative Example? L L L

5 5/24 Tagging and Parsing verbnounverbprepnoun sentence noun phrase … verb phrase prep phrase noun phrase ykuD was transcribed by SigK RNA … L L L

6 6/24 Some Additional Predicates High-scoring words in agent phrases depend, bind, protein, … High-scoring words in target phrases gene, promote, product High-scoring BETWEEN agent & target negative, regulate, transcribe, … Medical Subject Headings (MeSH) canonized method for indexing biomedical articles in_mesh(RNA), in_mesh(gene) L L L

7 7/24 Even More Predicates Lexical Predicates Internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases agent_before_target(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2) L L L

8 8/24 Link Parser (CMU) creates parse tree Root lemma of each word (not used) 27 Syntactic Information Predicates complement_of_N_N(Word, Word) modifier_ADV_V(Word, Word) object_V_Passive_N(Word, Word) Enriched Data From Committee L L L

9 9/24 Gleaner Definition of Gleaner One who gathers grain left behind by reapers Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Keep wide range of clauses usually discarded Create separate theories for different recall ranges

10 10/24 Aleph - Background Seed Example A positive example that our clause must cover Bottom Clause All predicates which are true about seed example seed agent_target(A,T,S)

11 11/24 Aleph - Learning Aleph learns theories of clauses (Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives and repeat until threshold of positives covered Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles

12 12/24 Gleaner - Background Rapid Random Restart ( Zelezny et al ILP 2002 ) Stochastic selection of initial clause Time-limited local heuristic search Randomly choose new initial clause and repeat seed initial 1initial 2

13 13/24 Gleaner - Learning Precision Recall Create B Bins Generate Clauses Record Best per Bin Repeat for K seeds

14 14/24 Gleaner - Combining Combine K clauses per bin If at least L of K clauses match, call example positive How to choose L ? L=1 then high recall, low precision L=K then low recall, high precision We want a collection of high precision theories spanning space of recall levels

15 15/24 Gleaner - Overlap Take topmost curve of overlapping theories Recall Precision

16 16/24 Gleaner - Practical Use Precision Recall Generate Curve User Selects Recall Bin Return Classifications With L of K Confidence Recall = 0.50 Precision = 0.70

17 17/24 agent_target(Agent, Target, Sentence) :- several_phrases_in_sentence(Sentence), some_wordPOS_in_sentence(Sentence, novelword), n(Agent), alphabetic(Agent), word_parent(Agent, F), phrase_contains_internal_cap_word(F, noun, _), few_POS_in_phrase(F, novelword), in_between_target_phrases(Agent, Target, _), n(Target). 0.14 Recall, 0.93 Precision on without co-reference training set Sample Extraction Clause

18 18/24 agent_target(Agent, Target, Sentence) :- avg_length_sentence(Sentence), n(Agent), word_previous(Target,_), in_between_target_phrases(Agent, Target, _). 0.76 Recall, 0.49 Precision on without co-reference training set Sample Extraction Clause

19 19/24 Experimental Methodology Used other trainset for tuneset in both cases Testset unlabeled, but dictionary provided Included sentences with no positives 936 total testset examples generated Parameter Settings Gleaner (20 recall bins) seeds = 100 clauses = 25,000 Aleph (0.75 minimum accruacy) nodes = {1K, 25K)

20 20/24 LLL Without Co-reference Results Gleaner Basic Gleaner Enriched Aleph Basic 1K

21 21/24 LLL With Co-reference Results Gleaner Basic Gleaner Enriched Aleph Basic 1K

22 22/24 We Need More Datasets LLL Challenge task is small Would prefer to do cross-validation Need labels for testset Our ILP’04 dataset open to community ftp://ftp.cs.wisc.edu/machine-learning/shavlik- group/datasets/IE-protein-location Biomedical information-extraction tasks Genetic Disorder (Ray and Craven 2001) Genia BioCreAtiVe

23 23/24 Conclusions Contributions Develop large amount of background knowledge Exploit normally discarded clauses Visually present precision and recall trade-off Proposed Work Achieve gains in High-Recall areas Reduce overfitting when using enriched data Increase diversity of learned clauses

24 24/24 Acknowledgements USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 UW Condor Group David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jessie Davis


Download ppt "1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude."

Similar presentations


Ads by Google