Prediction of RNA Binding Protein Using Machine Learning Technique

Slides:



Advertisements
Similar presentations
Amino Acids and Proteins B.2. there are about 20 amino acids that occur naturally they are the basic “building blocks” of life/proteins.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.
Homework 2 (due We, Feb. 5): Reading: Van Holde, Chapter 1 Van Holde Chapter 3.1 to 3.3 Van Holde Chapter 2 (we’ll go through Chapters 1 and 3 first. 1.Van.
3.2 Review PBS.
What is the DNA code? What is the connection between genes and proteins?  DNA is read in segments, called genes  A gene is a particular sequence of.
College 4. Coordination interaction A dipolar bond, or coordinate covalent bond, is a description of covalent bonding between two atoms in which both.
Sequence similarity.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Protein Tertiary Structure Prediction
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structure Prediction Jaume Bacardit, Michael Stout, Jonathan D. Hirst, Kumara.
Diverse Macromolecules. V. proteins are macromolecules that are polymers formed from amino acids monomers A. proteins have great structural diversity.
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Amino Acids and Proteins B.2. Properties of 2-amino acids (B.2.2) Zwitterion (dipolar) – amino acids contain both acidic and basic groups in the same.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Homework 2 (due We, Feb. 1): Reading: Van Holde, Chapter 1 Van Holde Chapter 3.1 to 3.3 Van Holde Chapter 2 (we’ll go through Chapters 1 and 3 first. 1.Van.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Intended Learning Objectives You should be able to… 1. Give 3 examples of proteins that are important to humans and are currently produced by transgenic.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Construction of Substitution matrices
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
PROTEIN STRUCTURE (Donaldson, March 10,2003) What are we trying to learn about genes and their proteins: Predict function for unknown protein by comparison.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Jaume Bacardit, Michael Stout, Jonathan D
CHM 708: MEDICINAL CHEMISTRY
Statistical Machine Learning Methods for Bioinformatics IV
Logistic Regression: To classify gene pairs
Introduction-2 Important molecular interactions in Biomolecules
© SSER Ltd..
Proteins Primary structure: Amino acids link together to form a linear polypeptide. The primary structure of a protein is a linear chain of amino acids.
Protein Structure Prediction and Protein Homology modeling
Feature Extraction Introduction Features Algorithms Methods
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
There are four levels of structure in proteins
Diverse Macromolecules
3.2 Review PBS.
Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
3.2 Review PBS.
Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Presentation transcript:

Prediction of RNA Binding Protein Using Machine Learning Technique Avdesh Mishra Reecha Khanal Md Tamjidul Hoque Date: 4/07/2018

Overview: Importance of RNA-Binding Proteins (RBPs) Dataset Collection Features Extraction Feature Encoding Techniques Feature Ranking Machine Learning Results Conclusions

Why RNA Binding Protein Prediction? RNA-binding proteins play important roles in many biological functions mRNA stability Stress response Cell cycle Tumor differentiation Apoptosis Gene regulation at post-transcriptional levels

non-RBP Protein Chains Dataset Preparation: Collection of validation dataset: PISCES UniProt Database Sequence Identity : 25% 68084 RBPs X-ray resolution: 3Å Sequence Length: 50 – 10,000 amino acids 14389 non-RBP Protein Chains

CD-HIT: 14389 Protein Chains 68084 RBPs Sequence Identity Cutoff >= 25% Protein Length: 50 – 10,000 amino acids 7077 NonRBPs 2770 RBPs

Final balanced dataset consists of: Previously established dataset: 2780 RBPs 7077 NRBPs We prepared a balanced dataset by taking a subset of previously established dataset. Proteins with non-standard amino acids removed Redundancy removed Final balanced dataset consists of: 1700 NonRBPs 1700 RBPs

Feature Set for ASA Prediction Feature Set For RNA BINDING PROTEIN PREDICTION Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM The property in which molecules repel water molecues Separation of electric charge leading to a molecule or a chemical group having electric dipole or multipole moment. A measure of how easily an electron cloud is distorted by an electric field. The Volume occupied by an individual atom or an molecule. (Solvent Accessibility) The measure of surface area accessible to a solvent (Secondary Structure) The prediction of structure of an Amino Acid. (Position Specific Scoring Matrix) Evolutionary information obtained from sequence alignment computed using PSI-BLAST

Sequence and Feature Vector encoding. Feature Set For RNA BINDING PROTEIN PREDICTION (2518 Features) Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM Solvent Accessibility (13) Two different types of amino acids (buried, exposed ) probabilities predicted using ACCPro. Then C-T-D is applied Secondary Structure Probabilities (21) Three different secondary structure (helix, beta and coil) probabilities predicted using SSPro. Then C-T-D is applied (1900 PSSM-DDT + 100 PSSM-SDT 400 PSSM-EDT) 2400 Features extracted by PSSM distance transformation (21) 20 Amino Acids Divided into three different Groups and later C-T-D is applied.

C-T-D (Composition, Transition, and Distribution) Composition: composition of a particular group of amino acid in the sequence Transition: change of amino acids from one group to other as we go linearly through the sequence Distribution: how one amino acid group is distributed throughout the protein sequence C-T-D (Composition, Transition, and Distribution)

C-T-D(Composition, Transition, and Distribution) Property Group 1 Group 2 Group 3 Hydrophobicity Polar R, K, E, D, Q, N Neutral G, A, S, T, P, H, Y Hydrophobic C, V, L, I, M, F, W Normalized van der Waals Volume 0 – 0.278 G, A, S, C, T, P, D 2.95 – 4.0 N, V, E, Q, I, L 4.43 – 8.08 M, H, K, F, R, Y, W Polarity 4.92 – 6.2 L, I, F, W, C, M, V, Y 8.0 – 9.2 P, A, T, S 10.4 – 13.0 H, Q, R, K, N, E, D Polarizability 0 – 0.108 G, A, S, D, T 0.128 – 0.186 C, P, N, V, E, Q, I, L 0.219 – 0.409 K, M, H, F, R, Y, W

Example: A E AAA E A EE AAAAA E A EEE AA EE A EEE AA E Number of A’s (n1) = 16 | Number of E’s (n2) = 12 Composition for n1 = 16/28 Composition for n2 = 12/28 Transition : (15/29*100) since there are 15 transitions from A to E or E to A Distribution: For A: first position of sequence 1st  (1/30*100) 25%  5 50%  12 75%  20 100%  21 Using this approach we obtain 21 dimensional vector.

PSSM-DT (position specific scoring matrix- distance transformation) PSSM-DDT: PSSM-DDT measures the occurrence probabilities of pairs of different amino acids separated by a distance of d in a protein from the PSSM profile. Distance Between Two pairs of Amino Acids ** i1, i2  the two pairs of different amino acids * L  Length of protein sequence

PSSM-DT (position specific scoring matrix- distance transformation) PSSM-SDT: Measures the occurrence probabilities of a pair of same amino acids separated by a distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** i  individual amino acid * L  Length of protein sequence

PSSM-DT (position specific scoring matrix- distance transformation) PSSM-EDT: Measures non-co-occurrence probability for two amino acids separated by a certain distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** Ax, Ay  the two pairs of different amino acids * L  Length of protein sequence

Final Training set of 1001 Features obtained Feature Ranking: Minimum redundancy maximum relevance (mrmr) feature selection technique Final Training set of 1001 Features obtained

Machine Learning Approach: Training Features for Base-classifiers X = {f1, f2, f3, …, f2518} KNN Classifier GBC LOGREG Training Features for Meta-classifiers X = {PKNN-bind , PKNN-non-bind , PGBC-bind , PGBC-non-bind, PLOGREG-bind , PLOGREG-non-bind, f1, f2, f3, …, f2518} SVM Classifier StackRBPPrediction

Results: Accuracy, Sensitivity, and Specificity was calculated using 10-fold cross validation. Model Prediction of RNA-Binding-Protein using Stacking OVERALL ACCURACY (ACC) 91.24%

Comparing different machine learning Approaches: Accuracy Logistic Regression (LOGREG) 88.52 Support Vector Machine (SVM) 90.53% Stacking 91.24%

Fig: Comparison with a recently proposed similar predictor (RBPPred): Model RBPPred Our Predictor Accuracy 67.82% 91.25% Fig: Comparison with a recently proposed similar predictor (RBPPred):

Thank you for your attention.