Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Background and Motivation Machine Learning methods offer some of the most cost- effective approaches to building predictive models One problem – multiple approaches Needed: comparing the effectiveness of different predictive classifiers Difficulty: different data selection and evaluation procedures
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Predict a label for each element in a given sequence Example: Identify post-translational modification residues M K LI TI L C F L S R L L P S L T Q E S S Q EID Glycosylated? H3N+H3N+ COO - Phosphorylated?
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Example: Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGD SDILTTLA
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Training Data Test Data Learning System Resulting Classifier Validation Performance on test set All Data
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Sliding Window Approach: Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: Target residue Class label. VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Performance Evaluation K-Fold Cross-Validation: S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based Cross-Validation Procedure: Extract windows from all sequences in the dataset Partition the set of windows into k disjoint subsets Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times windows
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Sequence-Based Cross-Validation Procedure: Partition the set of sequences into k disjoint subsets Extract windows from sequences in each subset Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times sequences
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based vs. Sequence-Based Cross-Validation Window-Based Cross-Validation: Train and test sets are likely to contain some windows that originate from the same sequence. This violates the independence assumption between train and test sets. Sequence-Based Cross-Validation: Windows belonging to the same sequence end up in the same set.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Machine Learning Classifiers Support Vector Machine: 0/1 String Kernel Example: Naïve Bayes: Identity Window: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[x i =y i ] = x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L VKKFGGEVVKAGNIL
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets O-GlycBase dataset: contains experimentally verified glycosylation sites RNA-Protein Interface dataset, RB147 : consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. Protein-Protein Interface dataset: consists of protein-binding protein sequences
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets Number of positive and negative instances used in our experiments DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase RNA-Protein Protein-Protein
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Experimental Design Questions: How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation? How do the results vary when we vary the size of the dataset?
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results a) O-glycBase b) RNA-Protein Interface c) Protein-Protein Interface AUC CC
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Conclusions Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV. Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Jivko Sinapov Drena Dobbs Vasant Honavar