Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L.

Slides:



Advertisements
Similar presentations
Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,
Advertisements

Secondary structure prediction from amino acid sequence.
From Decision Trees To Rules
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Protein Structures.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
Protein Tertiary Structure Prediction
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Cheng-Lung Huang Mu-Chen Chen Chieh-Jen Wang
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Learning to Detect and Classify Malicious Executables in the Wild by J
Queensland University of Technology
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Introduction to Bioinformatics II
Protein Structure Prediction
Sequence Based Analysis Tutorial
Protein Structures.
Generalizations of Markov model to characterize biological sequences
Presentation transcript:

Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L. Brinkman Dept. of Mol. Biology & Biochemistry Simon Fraser University, BC Canada

2 1. Problem Introduction Gram-negative bacteria  Medically important disease-causing bacteria  5 sub-cellular localizations (2 layers of membranes) 1. Cytoplasmic 2. Inner Membrane 3. Periplasmic 4. Outer Membrane 5. Extra-cellular

3 Outer Membrane Proteins Predicting outer membrane proteins (OMPs) of Gram-negative bacteria attached to the “outer membrane” of Gram- negative bacterial cell Particularly useful as drug target

4 Outer Membrane Protein (Cont.) structure:  -strands, form central barrel shape Inner turns, shorter stretches Outer loops, longer stretches Outer membrane Extracellular side Periplasmic side Outer loop Inner turn  -strand

5 Challenges Identifying OMPs from sequence information alone Discriminative sequence patterns of OMPs would be helpful

6 Challenges (Cont.) favor precision over recall lengthy time and laborious effort to study targeted drug in lab Actual OMPActual non-OMPSubtotal Classified as OMPTPFPA Classified as non-OMPFNTNB SubtotalCDE Confusion Matrix Overall accuracy = (TP+TN) / E Precision = TP / A Recall = TP / C

7 2. Dataset OMP sequence dataset Excellent quality ( Protein sequences (strings over alphabet of 20 letters) e.g. MNQIHK… Two classes with imbalanced distributions DataNumber of sequences Percentage of each class Minimum length Maximum length Average length OMP % Non-OMP % Total

8 Evaluation Majority of data is non-OMP, overall accuracy is determined mainly by non-OMP prediction; Precision is our main concern (  90%); Recall should be maintained at reasonable level (  50%).

9 3. Related Work Existing sub-cellular localization predictors Inner membrane proteins have  -helix structures – prediction is highly accurate Prediction of cytoplasmic, periplasmic and extracellular proteins  neural networks, covariant discriminate algorithm, Markov chain models, support vector machines (highest accuracy: 91%)  Do not apply to OMPs

10 Existing work on OMP prediction Neural networks, Hydrophobicity analysis, Combination of methods (homology analysis, amino acid abundance) Current state-of-the-art Hidden Markov Models by Martelli et al. [1][1] Use HMM to model OMPs according to their 3D structures Training set is small (12 proteins with known 3D structures) Overall accuracy: 89%; Recall: 84%; Precision: 46%.

11 4. Algorithms Motivations Frequent subsequence mining is helpful  frequent subsequence: consecutive amino acids that occur frequently in OMPs OMP characteristics  Common structure in OMPs  Different regions have different characteristic sequence residues  Model local similarities by frequent subsequences and highly variable regions by wild cards (*X*X*…) => Association-Rule-based classification

12 Algorithm 1: Rule-Based Classification Mine frequent subsequences X (consecutive amino acids) only from OMP class (support(X)  MinSup). Remove trivial similarities by restricting minimum length (MinLgh) of frequent subsequences Find frequent patterns (*X*X*…) Build classifier using frequent pattern rules (*X*X*…  OMP).

13 Algorithm 1: Refined The previous classifier performs good in precision, but poor in recall A second level of classifier is built on top of the existing classifier New training data: cases covered by the default rule in the first classifier Apply same pattern-mining and classifier-building process Future case is first matched against the 1 st classifier; if it is classified as OMP, we accept it; otherwise the 2 nd classifier is used.

14 Algorithm 2: SVM-based Classification Support Vector Machines (SVM) [5][5] Excellent performer in previous biological sequence classification Data needs to be transformed for SVM to be used (sequences => vectors) Frequent subsequences of OMPs are used as features. Protein sequences are mapped into binary vectors.

15 5. Empirical Studies 5 Classification methods Single-level Rule-Based Classification (SRB) Refined Rule-Based Classification (RRB) SVM-based Classification (SVM-light [6] ) [6] Martelli’s HMM See5 (latest version of C4.5) 5-fold cross validation (same folding for all algorithms)

16 Summary of Classifier Comparison SVM outperforms all methods RRB is the 2nd best performer Both SVM and RRB outperform HMM Improvement from SRB to RRB shows that refinement works

17 Other Biological Benefits (Rule-Based Classifiers) Sequential rules (obtained by SRB/RRB) lead to biological insights Mapped to both β-strands and periplasmic turn regions Assist in developing 3D models for proteins Identification of primary drug target regions conserved sequences in the surface-exposed regions are ideal targets for new diagnostics and drugs

18 6. Conclusions and Future Work Contributions Provide excellent predictors for OMP prediction; Obtained interpretable sequential patterns for further biological benefits; Proposed the use of frequent subsequences for SVM feature extraction; Demonstrated the usefulness of data mining techniques in biological sequence analysis.

19 Future Work Include more features of sequences, e.g., secondary structure, additional properties of proteins, barrel and turn sizes, polarity of amino acides, etc. Explore ways to extract symbolic information from SVMs

20 References 1. Martelli P., Fariselli P., Krogh A. and Casadio R., A sequence-profile-based HMM for predicting and discrimating  barrel membrane proteins, Bioinformatics, 18(1) 2002, S46-S53, Wang J., Chirn G., Marr T., Shapiro B., Shasha D. and Zhang K., Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results, SIGMOD-94, Minnesota, USA, Wang K., Zhou S. and He Y., Growing Decision Tree on Support-less Association Rules, KDD’00, Boston, MA, USA, Quinlan J., C4.5: programs for machine learning, Morgan Kaufmann Publishers, Vapnik V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, Joachims T., Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, software downloadable at 7. Rulequest Research, Information on See5/C5.0, at info.htmlhttp:// info.html