Study of Protein Prediction Related Problems Ph.D. candidate 2013.10.16 Le-Yi WEI 1.

Slides:



Advertisements
Similar presentations
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Advertisements

Protein Structure Prediction
Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Structural bioinformatics
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Similar Sequence Similar Function Charles Yan Spring 2006.
Scalable Text Mining with Sparse Generative Models
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Structures.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Protein Tertiary Structure Prediction
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Proteins Secondary Structure Predictions Structural Bioinformatics.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Construction of Substitution Matrices
Protein Secondary Structure, Bioinformatics Tools, and Multiple Sequence Alignments Finding Similar Sequences Predicting Secondary Structures Predicting.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Spam Detection Ethan Grefe December 13, 2013.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.
Protein Classification Using Averaged Perceptron SVM
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Feature Extraction Introduction Features Algorithms Methods
Introduction Feature Extraction Discussions Conclusions Results
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Protein Structures.
The future of protein secondary structure prediction accuracy
Decision tree ensembles in biomedical time-series classifaction
Protein structure prediction.
Presentation transcript:

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1

123 Background Methods Experiments Contents 2

Background 3

>Example PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQ EFFPKFKGLTTADELKKSADVRWHAERIINAVDDAVASMDDTEKMS MKLRNLSGKHAKSFQVDPEYFKVLAAVIADTVAAGDAGFEKLMSMI 4 Definition of protein 20 different amino acids … AC D V W Y

Protein prediction related problems 5 Protein Protein structural class prediction Protein fold prediction Multi-functional enzyme prediction Protein remote homology detection Other protein-related problems, etc. Protein subcellular localization prediction

6 Common points Treat the protein-related problems as classification tasks Query protein sequence Data presentation Classification algorithms Predicted results The framework of a classification task Two major components

Methods 7

Feature extraction methods 8  Primary sequence based  Secondary structure based  Sequence-structure based e.g. Physicochemical features, N-gram, Functional Domain, PSSM-profile (auto-covariance), etc. e.g. Secondary sequence based, and probability matrix based e.g. Triple-sequence-structure features

Primary-sequence based 9 n-gram model Given a query protein sequence: Compute Obtain

10 A query protein sequence … … … Database sequence 1 Database sequence 2 Database sequence 3 Database sequence n-2 Database sequence n-1 Database sequence n … … … PSI-BLAST Functional protein database Feature vector Primary-sequence based Functional domain … … …

11 Position-Specific Score Matrix (PSSM) Protein database PSI-BLAST Primary-sequence based Evolution information

12 20-D features Primary-sequence based AAC features Compute Obtain

13 20*g-D features Primary-sequence based Auto-covariance (AC) transformation Compute Obtain

14 Primary-sequence based PSSM profileFrequency profile Consensus sequence Consensus sequence: A query sequence:

15 Secondary structure based Secondary structure sequence SLFEQLGGQAAVQAVTAQFYANIQAD A example of a query protein sequence : CCHEHEEEEECCCCHHHHHHEEEEECC Predicted secondary structure sequence, which has three states: PSI-PRED C (coil), H (Helix), E (strand)

16 Secondary structure based Structure state confidence matrix A example of a structure state confidence matrix: A query protein sequence Predicted structure sequence Predicted confidence

17 Secondary structure based Global structural features Compute Obtain Structure state confidence matrix:

18 Secondary structure based Local structural features ComputeObtain Structure state confidence matrix:

19 Sequence-structure based The framework of triple sequence-structure feature extraction method

20 Classification algorithms  Commonly used classification algorithms e.g. Support Vector Machine (SVM), Random Forest (RF), SMO, Naive Bayes, etc.  Ensemble classification algorithms e.g. Majority Vote, Average Probability, Selective Ensemble, etc.

Experiments 21

22 The framework of RF_PSCP Webserver site :

23 Datasets Three benchmark datasets Three updated large-scale datasets Sequence similarity Protein structural class prediction

24 Results Comparison with existing methods on three benchmark datasets

25 Results Tests of the proposed method on three updated large-scale datasets

26 Results Comparison with different combinations of feature subsets on three benchmark datasets

27 Results Optimization of Random forest classifier

28

Q&A ! 29