Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.

Slides:



Advertisements
Similar presentations
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Advertisements

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational.
Classification and risk prediction Usman Roshan. Disease risk prediction What is the best method to predict disease risk? –We looked at the maximum likelihood.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Bioinformatics and Computational Biology Graduate Program Carla Mann December 11, 2014 Rocky Mountain Bioinformatics Conference Snowmass, CO RNABindRPlus.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.
Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/
The Broad Institute of MIT and Harvard Classification / Prediction.
Experimental Evaluation of Learning Algorithms Part 1.
Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.
B IOINFORMATICS AND C OMPUTATIONAL B IOLOGY A Computational Method to Identify RNA Binding Sites in Proteins Jeff Sander Iowa State University Rocky 2006.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Validation methods.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Learning to Detect and Classify Malicious Executables in the Wild by J
Artificial Intelligence Research Laboratory
Ontology-Based Information Integration Using INDUS System
Perceptron Learning for Chinese Word Segmentation
Artificial Intelligence 9. Perceptron
Presentation transcript:

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Background and Motivation  Machine Learning methods offer some of the most cost- effective approaches to building predictive models  One problem – multiple approaches  Needed: comparing the effectiveness of different predictive classifiers  Difficulty: different data selection and evaluation procedures

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Predict a label for each element in a given sequence  Example:  Identify post-translational modification residues M K LI TI L C F L S R L L P S L T Q E S S Q EID Glycosylated? H3N+H3N+ COO - Phosphorylated?

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Example:  Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGD SDILTTLA

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Training Data Test Data Learning System Resulting Classifier Validation Performance on test set All Data

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification  Sliding Window Approach: Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: Target residue Class label. VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Performance Evaluation K-Fold Cross-Validation: S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based Cross-Validation Procedure:  Extract windows from all sequences in the dataset  Partition the set of windows into k disjoint subsets  Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times windows

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Sequence-Based Cross-Validation Procedure:  Partition the set of sequences into k disjoint subsets  Extract windows from sequences in each subset  Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times sequences

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based vs. Sequence-Based Cross-Validation  Window-Based Cross-Validation:  Train and test sets are likely to contain some windows that originate from the same sequence.  This violates the independence assumption between train and test sets.  Sequence-Based Cross-Validation:  Windows belonging to the same sequence end up in the same set.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Machine Learning Classifiers  Support Vector Machine:  0/1 String Kernel  Example:  Naïve Bayes:  Identity Window: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[x i =y i ] = x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L VKKFGGEVVKAGNIL

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets  O-GlycBase dataset:  contains experimentally verified glycosylation sites   RNA-Protein Interface dataset, RB147 :  consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank.   Protein-Protein Interface dataset:  consists of protein-binding protein sequences

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets Number of positive and negative instances used in our experiments DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase RNA-Protein Protein-Protein

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Experimental Design Questions:  How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation?  How do the results vary when we vary the size of the dataset?

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results a) O-glycBase b) RNA-Protein Interface c) Protein-Protein Interface AUC CC

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline  Macromolecular Sequence Classification  Performance Evaluation  Window-Based Cross-Validation  Sequence-Based Cross-Validation  Experiments  Conclusions

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Conclusions  Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation.  The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV.  Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Jivko Sinapov Drena Dobbs Vasant Honavar