Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prediction of RNA Binding Protein Using Machine Learning Technique

Similar presentations


Presentation on theme: "Prediction of RNA Binding Protein Using Machine Learning Technique"— Presentation transcript:

1 Prediction of RNA Binding Protein Using Machine Learning Technique
Avdesh Mishra Reecha Khanal Md Tamjidul Hoque Date: 4/07/2018

2 Overview: Importance of RNA-Binding Proteins (RBPs) Dataset Collection
Features Extraction Feature Encoding Techniques Feature Ranking Machine Learning Results Conclusions

3 Why RNA Binding Protein Prediction?
RNA-binding proteins play important roles in many biological functions mRNA stability Stress response Cell cycle Tumor differentiation Apoptosis Gene regulation at post-transcriptional levels

4 non-RBP Protein Chains
Dataset Preparation: Collection of validation dataset: PISCES UniProt Database Sequence Identity : 25% 68084 RBPs X-ray resolution: 3Å Sequence Length: 50 – 10,000 amino acids 14389 non-RBP Protein Chains

5 CD-HIT: 14389 Protein Chains 68084 RBPs
Sequence Identity Cutoff >= 25% Protein Length: 50 – 10,000 amino acids 7077 NonRBPs 2770 RBPs

6 Final balanced dataset consists of:
Previously established dataset: 2780 RBPs 7077 NRBPs We prepared a balanced dataset by taking a subset of previously established dataset. Proteins with non-standard amino acids removed Redundancy removed Final balanced dataset consists of: 1700 NonRBPs 1700 RBPs

7 Feature Set for ASA Prediction
Feature Set For RNA BINDING PROTEIN PREDICTION Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM The property in which molecules repel water molecues Separation of electric charge leading to a molecule or a chemical group having electric dipole or multipole moment. A measure of how easily an electron cloud is distorted by an electric field. The Volume occupied by an individual atom or an molecule. (Solvent Accessibility) The measure of surface area accessible to a solvent (Secondary Structure) The prediction of structure of an Amino Acid. (Position Specific Scoring Matrix) Evolutionary information obtained from sequence alignment computed using PSI-BLAST

8 Sequence and Feature Vector encoding.
Feature Set For RNA BINDING PROTEIN PREDICTION (2518 Features) Hydrophobicity Polarity Polarizability Van Der Waals Volume SA SS PSSM Solvent Accessibility (13) Two different types of amino acids (buried, exposed ) probabilities predicted using ACCPro. Then C-T-D is applied Secondary Structure Probabilities (21) Three different secondary structure (helix, beta and coil) probabilities predicted using SSPro. Then C-T-D is applied (1900 PSSM-DDT + 100 PSSM-SDT 400 PSSM-EDT) 2400 Features extracted by PSSM distance transformation (21) 20 Amino Acids Divided into three different Groups and later C-T-D is applied.

9 C-T-D (Composition, Transition, and Distribution)
Composition: composition of a particular group of amino acid in the sequence Transition: change of amino acids from one group to other as we go linearly through the sequence Distribution: how one amino acid group is distributed throughout the protein sequence C-T-D (Composition, Transition, and Distribution)

10 C-T-D(Composition, Transition, and Distribution)
Property Group 1 Group 2 Group 3 Hydrophobicity Polar R, K, E, D, Q, N Neutral G, A, S, T, P, H, Y Hydrophobic C, V, L, I, M, F, W Normalized van der Waals Volume 0 – 0.278 G, A, S, C, T, P, D 2.95 – 4.0 N, V, E, Q, I, L 4.43 – 8.08 M, H, K, F, R, Y, W Polarity 4.92 – 6.2 L, I, F, W, C, M, V, Y 8.0 – 9.2 P, A, T, S 10.4 – 13.0 H, Q, R, K, N, E, D Polarizability 0 – 0.108 G, A, S, D, T 0.128 – 0.186 C, P, N, V, E, Q, I, L 0.219 – 0.409 K, M, H, F, R, Y, W

11 Example: A E AAA E A EE AAAAA E A EEE AA EE A EEE AA E
Number of A’s (n1) = 16 | Number of E’s (n2) = 12 Composition for n1 = 16/28 Composition for n2 = 12/28 Transition : (15/29*100) since there are 15 transitions from A to E or E to A Distribution: For A: first position of sequence 1st  (1/30*100) 25%  5 50%  12 75%  20 100%  21 Using this approach we obtain 21 dimensional vector.

12 PSSM-DT (position specific scoring matrix- distance transformation)
PSSM-DDT: PSSM-DDT measures the occurrence probabilities of pairs of different amino acids separated by a distance of d in a protein from the PSSM profile. Distance Between Two pairs of Amino Acids ** i1, i2  the two pairs of different amino acids * L  Length of protein sequence

13 PSSM-DT (position specific scoring matrix- distance transformation)
PSSM-SDT: Measures the occurrence probabilities of a pair of same amino acids separated by a distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** i  individual amino acid * L  Length of protein sequence

14 PSSM-DT (position specific scoring matrix- distance transformation)
PSSM-EDT: Measures non-co-occurrence probability for two amino acids separated by a certain distance d in a protein from the PSSM profile Distance Between Two pairs of Amino Acids ** Ax, Ay  the two pairs of different amino acids * L  Length of protein sequence

15 Final Training set of 1001 Features obtained
Feature Ranking: Minimum redundancy maximum relevance (mrmr) feature selection technique Final Training set of 1001 Features obtained

16 Machine Learning Approach:
Training Features for Base-classifiers X = {f1, f2, f3, …, f2518} KNN Classifier GBC LOGREG Training Features for Meta-classifiers X = {PKNN-bind , PKNN-non-bind , PGBC-bind , PGBC-non-bind, PLOGREG-bind , PLOGREG-non-bind, f1, f2, f3, …, f2518} SVM Classifier StackRBPPrediction

17 Results: Accuracy, Sensitivity, and Specificity was calculated using 10-fold cross validation. Model Prediction of RNA-Binding-Protein using Stacking OVERALL ACCURACY (ACC) 91.24%

18 Comparing different machine learning Approaches:
Accuracy Logistic Regression (LOGREG) 88.52 Support Vector Machine (SVM) 90.53% Stacking 91.24%

19 Fig: Comparison with a recently proposed similar predictor (RBPPred):
Model RBPPred Our Predictor Accuracy 67.82% 91.25% Fig: Comparison with a recently proposed similar predictor (RBPPred):

20 Thank you for your attention.


Download ppt "Prediction of RNA Binding Protein Using Machine Learning Technique"

Similar presentations


Ads by Google