Presentation is loading. Please wait.

Presentation is loading. Please wait.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.

Similar presentations


Presentation on theme: "Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National."— Presentation transcript:

1 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea (cornelia@cs.iastate.edu) Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007

2 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Background and Motivation  Machine Learning methods offer some of the most cost- effective approaches to building predictive models  One problem – multiple approaches  Needed: comparing the effectiveness of different predictive classifiers  Difficulty: different data selection and evaluation procedures

3 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

4 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Predict a label for each element in a given sequence  Example:  Identify post-translational modification residues M K LI TI L C F L S R L L P S L T Q E S S Q EID Glycosylated? H3N+H3N+ COO - Phosphorylated?

5 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Example:  Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGD SDILTTLA 0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000

6 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Training Data Test Data Learning System Resulting Classifier Validation Performance on test set All Data

7 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Sliding Window Approach: Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: 1111110011111110011111001011111100000001111101000000 Target residue Class label. VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1.

8 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

9 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Performance Evaluation K-Fold Cross-Validation: S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times

10 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based Cross-Validation Procedure:  Extract windows from all sequences in the dataset  Partition the set of windows into k disjoint subsets  Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times windows

11 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Sequence-Based Cross-Validation Procedure:  Partition the set of sequences into k disjoint subsets  Extract windows from sequences in each subset  Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times sequences

12 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based vs. Sequence-Based Cross-Validation  Window-Based Cross-Validation:  Train and test sets are likely to contain some windows that originate from the same sequence.  This violates the independence assumption between train and test sets.  Sequence-Based Cross-Validation:  Windows belonging to the same sequence end up in the same set.

13 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Machine Learning Classifiers  Support Vector Machine:  0/1 String Kernel  Example:  Naïve Bayes:  Identity Window: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[x i =y i ] = 010010010000000 x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L VKKFGGEVVKAGNIL

14 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets  O-GlycBase dataset:  contains experimentally verified glycosylation sites  http://www.cbs.dtu.dk/databases/OGLYCBASE/ http://www.cbs.dtu.dk/databases/OGLYCBASE/  RNA-Protein Interface dataset, RB147 :  consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank.  http://bindr.gdcb.iastate.edu/RNABindR/  Protein-Protein Interface dataset:  consists of protein-binding protein sequences

15 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets Number of positive and negative instances used in our experiments DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase216216812147 RNA-Protein147433627988 Protein-Protein4223509204

16 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

17 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Experimental Design Questions:  How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation?  How do the results vary when we vary the size of the dataset?

18 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase

19 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results a) O-glycBase b) RNA-Protein Interface c) Protein-Protein Interface AUC CC

20 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

21 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Conclusions  Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation.  The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV.  Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence.

22 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Jivko Sinapov Drena Dobbs Vasant Honavar


Download ppt "Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National."

Similar presentations


Ads by Google