Download presentation
Presentation is loading. Please wait.
1
Learning Ensembles of First-Order Clauses That Optimize Precision-Recall Curves
Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Ph. D. Defense August 13th, 2007
2
Biomedical Information Extraction
*image courtesy of SEER Cancer Training Site Database Structured
3
Biomedical Information Extraction
4
Biomedical Information Extraction
NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life. Remove middle extraction
5
Outline Biomedical Information Extraction Inductive Logic Programming
Gleaner Extensions to Gleaner GleanerSRL Negative Salt F-Measure Search Clause Weighting (time permitting)
6
Inductive Logic Programming
Machine Learning Classify data into categories Divide data into train and test sets Generate hypotheses on train set and then measure performance on test set In ILP, data are Objects … person, block, molecule, word, phrase, … and Relations between them grandfather, has_bond, is_member, …
7
Seeing Text as Relational Objects
alphanumeric(…) internal_caps(…) verb(…) phrase_child(…, …) Word Phrase long_sentence(…) phrase_parent(…, …) noun_phrase(…) Sentence
8
Protein Localization Clause
prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).
9
ILP Background Seed Example Bottom Clause
A positive example that our clause must cover Bottom Clause All predicates which are true about seed example seed prot_loc(P,L,S) prot_loc(P,L,S):- alphanumeric(P) Animate sample rule at level 1 and 2 prot_loc(P,L,S):- alphanumeric(P),leading_cap(L)
10
Clause Evaluation Prediction vs Actual Focus on positive examples
Positive or Negative True or False TP FP TN FN actual prediction Focus on positive examples Recall = Precision = FN TP + R P 2PR + F1 Score = FP TP +
11
Protein Localization Clause
prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall Precision F1 Score
12
Aleph (Srinivasan ‘03) Aleph learns theories of clauses
Pick positive seed example Use heuristic search to find best clause Pick new seed from uncovered positives and repeat until threshold of positives covered Sequential learning is time-consuming Can we reduce time with ensembles? And also increase quality?
13
Outline Biomedical Information Extraction Inductive Logic Programming
Gleaner Extensions to Gleaner GleanerSRL Negative Salt F-Measure Search Clause Weighting
14
Gleaner (Goadrich et al. ‘04, ‘06)
Definition of Gleaner One who gathers grain left behind by reapers Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Search clause space with Rapid Random Restart Keep wide range of clauses usually discarded Create separate theories for diverse recall
15
Gleaner - Learning Precision Recall Create B Bins Generate Clauses
Record Best per Bin Precision Recall
16
Gleaner - Learning Seed K . Seed 3 Seed 2 Seed 1 Recall
17
Gleaner - Ensemble Clauses from bin 5 ex1: prot_loc(…) 12 Pos Neg
47 ex3: prot_loc(…) 55 ex2: prot_loc(…) 47 ex1: prot_loc(…) . 12 ex598: prot_loc(…) 5 Neg Pos Pos Neg ex599: prot_loc(…) 14 . ex600: prot_loc(…) 2 ex601: prot_loc(…) 18 .
18
Gleaner - Ensemble Recall Precision 1.0 Examples Score Precision
pos3: prot_loc(…) 1.00 0.05 55 neg28: prot_loc(…) 0.50 0.05 52 pos2: prot_loc(…) 0.66 0.10 47 . neg4: prot_loc(…) 0.12 0.85 18 neg475: prot_loc(…) 17 pos9: prot_loc(…) 0.13 0.90 17 neg15: prot_loc(…) 0.12 0.90 16 .
19
Gleaner - Overlap For each bin, take the topmost curve Precision
Recall Precision
20
How to Use Gleaner (Version 1)
Generate Tuneset Curve User Selects Recall Bin Return Testset Classifications Ordered By Their Score Precision Recall = 0.50 Precision = 0.70 Recall
21
Gleaner Algorithm Divide space into B bins
For K positive seed examples Perform RRR search with precision x recall heuristic Save best clause found in each bin b For each bin b Combine clauses in b to form theoryb Find L of K threshold for theorym which performs best in bin b on tuneset Evaluate thresholded theories on testset
22
Aleph Ensembles (Dutra et al ‘02)
Compare to ensembles of theories Ensemble Algorithm Use K different initial seeds Learn K theories containing C rules Rank examples by the number of theories
23
YPD Protein Localization
Hand-labeled dataset (Ray & Craven ’01) 7,245 sentences from 871 abstracts Examples are phrase-phrase combinations 1,810 positive & 279,154 negative 1.6 GB of background knowledge Structural, Statistical, Lexical and Ontological In total, 200+ distinct background predicates Performed five-fold cross-validation
24
Evaluation Metrics Area Under Precision-Recall Curve (AUC-PR)
1.0 Area Under Precision-Recall Curve (AUC-PR) All curves standardized to cover full recall range Averaged AUC-PR over 5 folds Number of clauses considered Rough estimate of time Precision Recall 1.0
25
PR Curves - 100,000 Clauses
26
Protein Localization Results
27
Other Relational Datasets
Genetic Disorder (Ray & Craven ’01) 233 positive & 103,959 negative Protein Interaction (Bunescu et al ‘04) 799 positive & 76,678 negative Advisor (Richardson and Domingos ‘04) Students, Professors, Courses, Papers, etc. 113 positive & 2,711 negative
28
Genetic Disorder Results
REDO GRAPH
29
Protein Interaction Results
30
Advisor Results
31
Gleaner Summary Gleaner makes use of clauses that are not the highest scoring ones for improved speed and quality Issues with Gleaner Output is PR curve, not probability Redundant clauses across seeds L of K clause combination
32
Outline Biomedical Information Extraction Inductive Logic Programming
Gleaner Extensions to Gleaner GleanerSRL Negative Salt F-Measure Search Clause Weighting
33
Estimating Probabilities - SRL
Given highly skewed relational datasets Produce accurate probability estimates Gleaner only produces PR curves Recall Precision
34
GleanerSRL Algorithm (Goadrich ‘07)
Gleaner Algorithm Divide space into B bins For K positive seed examples Perform RRR search with precision x recall heuristic Save best clause found in each bin b Create propositional feature-vectors Learn scores with SVM or other propositional learning algorithms Calibrate scores into probabilities Evaluate probabilities with Cross Entropy For each bin b Combine clauses in b to form theoryb Find L of K threshold for theorym which performs best in bin b on tuneset Evaluate thresholded theories on testset
35
GleanerSRL Algorithm
36
Learning with Gleaner Precision Recall Generate Clauses Create B Bins
Record Best per Bin Repeat for K seeds Precision Recall
37
Creating Feature Vectors
Clauses from bin 5 K Boolean Pos Neg 1 1 Binned ex1: prot_loc(…) 1 12 Pos Neg 1 . .
38
Learning Scores via SVM
39
Calibrating Probabilities
Use Isotonic Regression (Zadrozny & Elkan ‘03) to transform SVM scores into probabilities 0.66 0.50 0.00 1.00 Probability Class Score Examples
40
GleanerSRL Results for Advisor
(Davis et al. 05) (Davis et al. 07)
41
Outline Biomedical Information Extraction Inductive Logic Programming
Gleaner Extensions to Gleaner GleanerSRL Negative Salt F-Measure Search Clause Weighting
42
Diversity of Gleaner Clauses
43
Negative Salt Seed Example Salt Example
A positive example that our clause must cover Salt Example A negative example that our clause should avoid seed prot_loc(P,L,S) salt
44
Gleaner Algorithm Divide space into B bins
For K positive seed examples Select Negative Salt example Perform RRR search with salt-avoiding heuristic Save best clause found in each bin b For each bin b Combine clauses in b to form theoryb Find L of K threshold for theorym which performs best in bin b on tuneset Evaluate thresholded theories on testset Perform RRR search with precision x recall heuristic Save best clause found in each bin b For each bin b Combine clauses in b to form theoryb Find L of K threshold for theorym which performs best in bin b on tuneset Evaluate thresholded theories on testset
45
Diversity of Negative Salt
46
Effect of Salt on Theorym Choice
47
Negative Salt AUC-PR
48
Outline Biomedical Information Extraction Inductive Logic Programming
Gleaner Extensions to Gleaner GleanerSRL Negative Salt F-Measure Search Clause Weighting
49
Gleaner Algorithm Divide space into B bins
For K positive seed examples Perform RRR search with F Measure heuristic Perform RRR search with precision x recall heuristic Save best clause found in each bin b For each bin b Combine clauses in b to form theoryb Find L of K threshold for theorym which performs best in bin b on tuneset Evaluate thresholded theories on testset
50
RRR Search Heuristic Heuristic function directs RRR search
Can provide direction through F Measure Low values for encourage Precision High values for encourage Recall
51
F0.01 Measure Search
52
F1 Measure Search
53
F100 Measure Search
54
F Measure AUC-PR Results
Genetic Disorder Protein Localization
55
Weighting Clauses Alter the L of K combination in Gleaner
Within Single Theory Cumulative weighting schemes successful Precision highest-scoring scheme Within Gleaner Precision beats Equal Wgt’ed and Naïve Bayes Significant results on genetic-disorder dataset
56
Weighting Clauses Clauses from bin 5 Pos Neg Cumulative
∑(precision of each matching clause) ∑(recall of each matching clause) ∑(F1 measure of each matching clause) W1 W3 W4 1 ex1: prot_loc(…) Naïve Bayes and TAN learn probability for example Ranked List max(precision of each matching clause) Weighted Vote ave(precision of each matching clause) Pos Neg .
57
Dominance Results Statistically significant dominance in i,j
Precision is never dominated Naïve Bayes competitive with cumulative
58
Weighting Gleaner Results
59
Conclusions and Future Work
Gleaner is a flexible and fast ensemble algorithm for highly skewed ILP datasets Other Work Proper interpolation of PR Space (Goadrich et al. ‘04, ‘06) Relationship of PR and ROC Curves (Davis and Goadrich ‘06) Future Work Explore Gleaner on propositional datasets Learn heuristic function for diversity (Oliphant and Shavlik ‘07)
60
Acknowledgements USA DARPA Grant F30602-01-2-0571
USA Air Force Grant F USA NLM Grant 5T15LM USA NLM Grant 1R01LM UW Condor Group Jude Shavlik, Louis Oliphant, David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Patricia Brennan, AnHai Doan, Jesse Davis, Frank DiMaio, Ameet Soni, Irene Ong, Laura Goadrich, all 6th Floor MSCers
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.