Learning Ensembles of First- Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison.

Slides:



Advertisements
Similar presentations
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Learning Algorithm Evaluation
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Classification and risk prediction
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
© Jesse Davis 2006 View Learning Extended: Learning New Tables Jesse Davis 1, Elizabeth Burnside 1, David Page 1, Vítor Santos Costa 2 1 University of.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.
Integrating Machine Learning and Physician Knowledge to Improve the Accuracy of Breast Biopsy Inês Dutra University of Porto, CRACS & INESC-Porto LA Houssam.
Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Experimental Evaluation of Learning Algorithms Part 1.
Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,
Learning from Observations Chapter 18 Through
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Learning Ensembles of First-Order Clauses for Recall-Precision Curves Preliminary Thesis Proposal Mark Goadrich Department of Computer Sciences University.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Methods: Bagging and Boosting
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Evaluating Results of Learning Blaž Zupan
Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude.
Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Discovering Interesting Patterns for Investment Decision Making with GLOWER-A Genetic Learner Overlaid With Entropy Reduction Advisor : Dr. Hsu Graduate.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
(c) M Gerstein '06, gerstein.info/talks 1 CS/CBB Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
SCORE AS YOU LIFT (SAYL) A Statistical Relational Learning Approach to Uplift Modeling Houssam Nassif 1, Finn Kuusisto 1, Elizabeth S. Burnside 1, David.
Frank DiMaio and Jude Shavlik Computer Sciences Department
7. Performance Measurement
Semi-Supervised Clustering
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
An Empirical Comparison of Supervised Learning Algorithms
Results for all features Results for the reduced set of features
Evaluating Results of Learning
Evaluating classifiers for disease gene discovery
CSSE463: Image Recognition Day 11
Louis Oliphant and Jude Shavlik
Mark Goadrich Computer Sciences Department
Mark Rich & Louis Oliphant
Panagiotis G. Ipeirotis Luis Gravano
Evaluating Classifiers for Disease Gene Discovery
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Learning Ensembles of First- Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Ph. D. Defense August 13th, 2007

Biomedical Information Extraction *image courtesy of SEER Cancer Training Site DatabaseStructured

Biomedical Information Extraction

NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

Outline Biomedical Information Extraction Biomedical Information Extraction Inductive Logic Programming Inductive Logic Programming Gleaner Gleaner Extensions to Gleaner Extensions to Gleaner –GleanerSRL –Negative Salt –F-Measure Search –Clause Weighting (time permitting)

Inductive Logic Programming Machine Learning Machine Learning –Classify data into categories –Divide data into train and test sets –Generate hypotheses on train set and then measure performance on test set In ILP, data are Objects … In ILP, data are Objects … –person, block, molecule, word, phrase, … and Relations between them and Relations between them –grandfather, has_bond, is_member, …

Seeing Text as Relational Objects Phrase Sentence Word alphanumeric(…) internal_caps(…)verb(…) phrase_child(…, …) long_sentence(…) phrase_parent(…, …) noun_phrase(…)

Protein Localization Clause prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

ILP Background Seed Example Seed Example –A positive example that our clause must cover Bottom Clause Bottom Clause –All predicates which are true about seed example seed prot_loc(P,L,S) prot_loc(P,L,S):- alphanumeric(P) prot_loc(P,L,S):- alphanumeric(P),leading_cap(L)

Clause Evaluation Prediction vs Actual Prediction vs Actual Positive or Negative True or False FNTP + FPTP + TPTP FP TN FN actual prediction RP 2PR + F1 Score = Focus on positive examples Focus on positive examples Recall = Precision =

Protein Localization Clause prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence) Recall0.51 Precision0.23 F1 Score

Aleph (Srinivasan ‘03) Aleph learns theories of clauses Aleph learns theories of clauses –Pick positive seed example –Use heuristic search to find best clause –Pick new seed from uncovered positives and repeat until threshold of positives covered Sequential learning is time-consuming Sequential learning is time-consuming Can we reduce time with ensembles? Can we reduce time with ensembles? And also increase quality? And also increase quality?

Outline Biomedical Information Extraction Biomedical Information Extraction Inductive Logic Programming Inductive Logic Programming Gleaner Gleaner Extensions to Gleaner Extensions to Gleaner –GleanerSRL –Negative Salt –F-Measure Search –Clause Weighting

Gleaner (Goadrich et al. ‘04, ‘06) Definition of Gleaner Definition of Gleaner –One who gathers grain left behind by reapers Key Ideas of Gleaner Key Ideas of Gleaner –Use Aleph as underlying ILP clause engine –Search clause space with Rapid Random Restart –Keep wide range of clauses usually discarded –Create separate theories for diverse recall

Gleaner - Learning Precision Recall Create B Bins Create B Bins Generate Clauses Generate Clauses Record Best per Bin Record Best per Bin

Gleaner - Learning Recall Seed 1 Seed 2 Seed 3 Seed K......

Gleaner - Ensemble ex1: prot_loc(…) 12 ex2: prot_loc(…) 47 ex3: prot_loc(…) 55 ex598: prot_loc(…) 5 ex599: prot_loc(…) 14 ex600: prot_loc(…) 2 ex601: prot_loc(…) ex2: prot_loc(…) 47 Pos Neg Pos Neg Pos Clauses from bin 5

Gleaner - Ensemble Recall Precision 1.0 pos3: prot_loc(…) neg28: prot_loc(…) pos2: prot_loc(…) neg4: prot_loc(…) neg475: prot_loc(…). pos9: prot_loc(…) neg15: prot_loc(…) ScoreExamples PrecisionRecall

Gleaner - Overlap For each bin, take the topmost curve For each bin, take the topmost curve Recall Precision

How to Use Gleaner (Version 1) Precision Recall Generate Tuneset Curve Generate Tuneset Curve User Selects Recall Bin User Selects Recall Bin Return Testset Classifications Ordered By Their Score Return Testset Classifications Ordered By Their Score Recall = 0.50 Precision = 0.70

Gleaner Algorithm Divide space into B bins Divide space into B bins For K positive seed examples For K positive seed examples –Perform RRR search with precision x recall heuristic –Save best clause found in each bin b For each bin b For each bin b –Combine clauses in b to form theory b –Find L of K threshold for theory m which performs best in bin b on tuneset Evaluate thresholded theories on testset Evaluate thresholded theories on testset

Aleph Ensembles (Dutra et al ‘02) Compare to ensembles of theories Compare to ensembles of theories Ensemble Algorithm Ensemble Algorithm –Use K different initial seeds –Learn K theories containing C rules –Rank examples by the number of theories

YPD Protein Localization Hand-labeled dataset (Ray & Craven ’01) Hand-labeled dataset (Ray & Craven ’01) –7,245 sentences from 871 abstracts –Examples are phrase-phrase combinations  1,810 positive & 279,154 negative 1.6 GB of background knowledge 1.6 GB of background knowledge –Structural, Statistical, Lexical and Ontological –In total, 200+ distinct background predicates Performed five-fold cross-validation Performed five-fold cross-validation

Evaluation Metrics Area Under Precision- Recall Curve (AUC-PR) Area Under Precision- Recall Curve (AUC-PR) –All curves standardized to cover full recall range –Averaged AUC-PR over 5 folds Number of clauses considered Number of clauses considered –Rough estimate of time Recall Precision 1.0

PR Curves - 100,000 Clauses

Protein Localization Results

Other Relational Datasets Genetic Disorder (Ray & Craven ’01) Genetic Disorder (Ray & Craven ’01) –233 positive & 103,959 negative Protein Interaction (Bunescu et al ‘04) Protein Interaction (Bunescu et al ‘04) –799 positive & 76,678 negative Advisor (Richardson and Domingos ‘04) Advisor (Richardson and Domingos ‘04) –Students, Professors, Courses, Papers, etc. –113 positive & 2,711 negative

Genetic Disorder Results

Protein Interaction Results

Advisor Results

Gleaner Summary Gleaner makes use of clauses that are not the highest scoring ones for improved speed and quality Gleaner makes use of clauses that are not the highest scoring ones for improved speed and quality Issues with Gleaner Issues with Gleaner –Output is PR curve, not probability –Redundant clauses across seeds –L of K clause combination

Outline Biomedical Information Extraction Biomedical Information Extraction Inductive Logic Programming Inductive Logic Programming Gleaner Gleaner Extensions to Gleaner Extensions to Gleaner –GleanerSRL –Negative Salt –F-Measure Search –Clause Weighting

Estimating Probabilities - SRL Given highly skewed relational datasets Given highly skewed relational datasets Produce accurate probability estimates Produce accurate probability estimates Gleaner only produces PR curves Gleaner only produces PR curves Recall Precision

Gleaner Algorithm GleanerSRL Algorithm (Goadrich ‘07) Divide space into B bins Divide space into B bins For K positive seed examples For K positive seed examples –Perform RRR search with precision x recall heuristic –Save best clause found in each bin b For each bin b For each bin b –Combine clauses in b to form theory b –Find L of K threshold for theory m which performs best in bin b on tuneset Evaluate thresholded theories on testset Evaluate thresholded theories on testset Create propositional feature-vectors Create propositional feature-vectors Learn scores with SVM or other propositional learning algorithms Learn scores with SVM or other propositional learning algorithms Calibrate scores into probabilities Calibrate scores into probabilities Evaluate probabilities with Cross Entropy Evaluate probabilities with Cross Entropy

GleanerSRL Algorithm

Learning with Gleaner Precision Recall Create B Bins Create B Bins Generate Clauses Generate Clauses Record Best per Bin Record Best per Bin Repeat for K seeds Repeat for K seeds

Creating Feature Vectors ex1: prot_loc(…) 12 Pos Neg Pos Neg Clauses from bin 5 1 Binned K Boolean

Learning Scores via SVM

Calibrating Probabilities Use Isotonic Regression (Zadrozny & Elkan ‘03) to transform SVM scores into probabilities Use Isotonic Regression (Zadrozny & Elkan ‘03) to transform SVM scores into probabilities Score Probability Examples Class

GleanerSRL Results for Advisor (Davis et al. 05) (Davis et al. 07)

Outline Biomedical Information Extraction Biomedical Information Extraction Inductive Logic Programming Inductive Logic Programming Gleaner Gleaner Extensions to Gleaner Extensions to Gleaner –GleanerSRL –Negative Salt –F-Measure Search –Clause Weighting

Diversity of Gleaner Clauses

Negative Salt Seed Example Seed Example –A positive example that our clause must cover Salt Example Salt Example –A negative example that our clause should avoid seed prot_loc(P,L,S) salt

Gleaner Algorithm Divide space into B bins Divide space into B bins For K positive seed examples For K positive seed examples –Perform RRR search with precision x recall heuristic –Save best clause found in each bin b For each bin b For each bin b –Combine clauses in b to form theory b –Find L of K threshold for theory m which performs best in bin b on tuneset Evaluate thresholded theories on testset Evaluate thresholded theories on testset –Select Negative Salt example –Perform RRR search with salt-avoiding heuristic –Save best clause found in each bin b For each bin b For each bin b –Combine clauses in b to form theory b –Find L of K threshold for theory m which performs best in bin b on tuneset Evaluate thresholded theories on testset Evaluate thresholded theories on testset

Diversity of Negative Salt

Effect of Salt on Theory m Choice

Negative Salt AUC-PR

Outline Biomedical Information Extraction Biomedical Information Extraction Inductive Logic Programming Inductive Logic Programming Gleaner Gleaner Extensions to Gleaner Extensions to Gleaner –GleanerSRL –Negative Salt –F-Measure Search –Clause Weighting

Gleaner Algorithm Divide space into B bins Divide space into B bins For K positive seed examples For K positive seed examples –Perform RRR search with precision x recall heuristic –Perform RRR search with F Measure heuristic –Save best clause found in each bin b For each bin b For each bin b –Combine clauses in b to form theory b –Find L of K threshold for theory m which performs best in bin b on tuneset Evaluate thresholded theories on testset Evaluate thresholded theories on testset

RRR Search Heuristic Heuristic function directs RRR search Heuristic function directs RRR search Can provide direction through F Measure Can provide direction through F Measure Low values for encourage Precision Low values for encourage Precision High values for encourage Recall High values for encourage Recall

F 0.01 Measure Search

F 1 Measure Search

F 100 Measure Search

F Measure AUC-PR Results Genetic DisorderProtein Localization Genetic DisorderProtein Localization

Weighting Clauses Alter the L of K combination in Gleaner Alter the L of K combination in Gleaner Within Single Theory Within Single Theory –Cumulative weighting schemes successful –Precision highest-scoring scheme Within Gleaner Within Gleaner –Precision beats Equal Wgt’ed and Naïve Bayes –Significant results on genetic-disorder dataset

Weighting Clauses ex1: prot_loc(…) Pos Neg Pos Neg Clauses from bin W1W1 0 W3W3 W4W4 0 Cumulative ∑(precision of each matching clause) ∑(recall of each matching clause) ∑(F1 measure of each matching clause) Naïve Bayes and TAN learn probability for example Ranked List max(precision of each matching clause) Weighted Vote ave(precision of each matching clause)

Dominance Results Statistically significant dominance in i,j Statistically significant dominance in i,j Precision is never dominated Precision is never dominated Naïve Bayes competitive with cumulative Naïve Bayes competitive with cumulative

Weighting Gleaner Results

Conclusions and Future Work Gleaner is a flexible and fast ensemble algorithm for highly skewed ILP datasets Gleaner is a flexible and fast ensemble algorithm for highly skewed ILP datasets Other Work Other Work –Proper interpolation of PR Space (Goadrich et al. ‘04, ‘06) –Relationship of PR and ROC Curves (Davis and Goadrich ‘06) Future Work Future Work –Explore Gleaner on propositional datasets –Learn heuristic function for diversity (Oliphant and Shavlik ‘07)

Acknowledgements USA DARPA Grant F USA DARPA Grant F USA Air Force Grant F USA Air Force Grant F USA NLM Grant 5T15LM USA NLM Grant 5T15LM USA NLM Grant 1R01LM USA NLM Grant 1R01LM UW Condor Group UW Condor Group Jude Shavlik, Louis Oliphant, David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Patricia Brennan, AnHai Doan, Jesse Davis, Frank DiMaio, Ameet Soni, Irene Ong, Laura Goadrich, all 6th Floor MSCers Jude Shavlik, Louis Oliphant, David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Patricia Brennan, AnHai Doan, Jesse Davis, Frank DiMaio, Ameet Soni, Irene Ong, Laura Goadrich, all 6th Floor MSCers