Support Vector Machine (SVM)

Support Vector Machine (SVM)
A Machine Learning based Approach for Disulfide Bond Prediction Avdesh Mishra, Md Tamjidul Hoque {amishra2, Department of Computer Science, University of New Orleans, New Orleans, LA, USA Results Motivation Accurate prediction of disulfide bonds can help improve the accuracy of ab initio protein structure prediction (aiPSP), since: They impose geometrical constraints on the protein backbone which greatly reduces the search space We are motivated to apply the results from disulfide bond prediction to improve the accuracy of our existing ab initio protein structure prediction method, called 3DIGARS-PSP. Table 2: Name and definition of the performance measures. Table 3: Performance of individual cysteine bonding prediction obtained by SVM based machine learning method. Performance Measures Definition Recall/Sensitivity (%) 𝑇𝑃 𝑇𝑃+𝐹𝑁 Specificity (%) 𝑇𝑁 𝑇𝑁+𝐹𝑃 False Positive Rate 𝐹𝑃 𝐹𝑃+𝑇𝑁 False Negative Rate 𝐹𝑁 𝐹𝑁+𝑇𝑃 Precision (%) 𝑇𝑃 𝑇𝑃+𝐹𝑃 F-measure 2𝑇𝑃 2𝑇𝑃+𝐹𝑃+𝐹𝑁 MCC 𝑇𝑃∗𝑇𝑁 −(𝐹𝑃∗𝐹𝑁) 𝑇𝑃+𝐹𝑁 ∗ 𝑇𝑃+𝐹𝑃 ∗ 𝑇𝑁+𝐹𝑃 ∗(𝑇𝑁+𝐹𝑁) Accuracy Balanced (%) 𝑇𝑃+𝑇𝑁 𝐹𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 Accuracy Overall (%) 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑁 𝑇𝑁+𝐹𝑃 ) Performance Measures Support Vector Machine (SVM) Recall/Sensitivity (%) 93.15 Specificity (%) 60.70 False Positive Rate 0.393 False Negative Rate 0.068 Precision (%) 83.9 F-measure 0.883 MCC 0.587 Accuracy Balanced (%) 76.93 Accuracy Overall (%) 83.0 Introduction Disulfide bonds are covalent bonds formed during post translational modification by the oxidation of a pair of cysteines. These bonds between cysteines are one of the major forces responsible for Stabilizing protein conformations Post-translational modification Plays an important role in ab initio protein structure prediction (aiPSP) and Protein folding In this study, we established a machine learning based method, for disulfide bond prediction using support vector machine (SVM) For an effective training, various useful features are extracted Conservation profile Solvent accessibility Torsion angle flexibility Disorder probability Sequential distance between cysteines etc. The process of disulfide bonds prediction is carried out in two stages: First, individual cysteines are predicted as either bonding or non-bonding Second, the cysteine-pairs are predicted as either bonding or non-bonding This stage includes the results from individual cysteine bonding as a feature The comparison of our method with the state-of-the-art methods show that the proposed method attains higher prediction accuracy. Figure 1: Shows the best window size of 33, obtained for individual cysteine prediction through 10 fold cross-validation over the dataset of 2303 proteins consisting of cysteine residues. Table 4: Comparison of the performance of SVM on balanced and imbalanced dataset. Performance Measures SVM Balanced Set SVM Imbalanced Set Recall/Sensitivity (%) 88.76 49.52 Specificity (%) 73.93 97.47 False Positive Rate 0.2607 0.0253 False Negative Rate 0.1124 0.5048 Precision (%) 77.29 79.63 F-measure 0.8263 0.6106 MCC 0.6338 0.5745 Accuracy Balanced (%) 81.34 73.49 Accuracy Overall (%) 89.47 Methods Training Data Sets We collected a dataset of protein sequences consisting of disulfide bonds established previously by Shen et al. This dataset was filtered to remove inconsistencies. Furthermore, dataset of 4120 fasta sequences containing at least one disulfide bond was collected from UniProt database. The fasta sequences from two different sources mentioned above were combined and only the sequences with < 25% sequence similarity were selected as the final set for this study. The final dataset consisted of 2303 non redundant proteins. Next, we created two different datasets A set consisting of balanced number of binding and non binding cysteines A set consisting of binding and non binding cysteines in a ratio of 1:5 Feature Construction The residues of primary protein sequence are encoded by 59 features shown above. For individual cysteine bond prediction, 59 features are used. For cysteine pair prediction, we used total of 61 features; 59 of the features used for individual cysteine prediction and 2 additional features, sequence distance between cysteines and individual cysteine bonding probability. Next the feature windowing is applied to include the neighboring residue features. After feature windowing the absolute values of sum and difference of the features are used to train the machine learning method. Machine Learning Method – Support Vector Machine (SVM) SVM is a machine learning method, which classifies by maximizing the separating hyperplane between two classes and penalizes the instances on the wrong side of the decision boundary using a cost parameter, C. SVM consist of several kernel functions among which we used radial basis function (RBF) as a kernel. The RBF kernel consist of a “gamma” parameter, which is the inverse of the standard deviation, which is used as similarity measure between two points. The RBF kernel parameter, “gamma” and the cost parameter, C are optimized to achieve best accuracy using grid search approach. Figure 2: Shows the best window size of 1, obtained for cysteine pair prediction on balanced dataset, excluding individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of cysteine pairs. Figure 3: Shows the best window size of 5, obtained for cysteine pair prediction on imbalanced dataset, excluding individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of cysteine pairs. Table 5: Comparison of the performance of SVM on balanced and imbalanced dataset. Performance Measures SVM Balanced Set SVM Imbalanced Set Recall/Sensitivity (%) 88.67 53.01 Specificity (%) 80.11 97.48 False Positive Rate 0.1989 0.0252 False Negative Rate 0.1135 0.4699 Precision (%) 81.67 80.81 F-measure 0.8503 0.6402 MCC 0.6903 0.603 Accuracy Balanced (%) 84.39 75.25 Accuracy Overall (%) 90.07 ⋯GSMYQLQFINLVYDT⋯ Protein Sequence Residue Profile Amino acid type and Terminal indicator (2 feature) Chemical Profile Polarity score, Secondary structure score, Molecular volume score, Codon diversity score and Electrostatic charge score (5 features) Conservation Profile PSSM scores, Monogram and Bigram (41 features) Structural Profile Secondary structure probability and Accessible Surface Area (7 features) Flexibility Profile Phi angle fluctuation, Psi angle fluctuation and Disorder probability (3 features) Energy Profile Position specific estimated energy score (1 feature) Distance Profile Sequential distance between cysteines Figure 5: Shows the best window size of 5, obtained for cysteine pair prediction on imbalanced dataset, including individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of cysteine pairs. Figure 4: Shows the best window size of 1, obtained for cysteine pair prediction on balanced dataset, including individual CYS probability features through 10 fold cross-validation over the dataset of 2303 proteins consisting of cysteine pairs. Comparative Study Based on Features Comparative Study Based on ML-Methods Table 1: Performance of the existing nearest neighbor algorithm (NNA) depending on the features employed to train the model. The accuracies presented in the table above are obtained using Jackknife validation approach on the dataset established previously by Shen et al. but, after filtering the samples. Obsolete proteins as well as samples which did not contain cysteine residues were discarded from further consideration. Performance Measures Features Used in NNA Features Proposed in This Study Sensitivity 45.24 58.08 Specificity 87.34 89.90 Balanced Accuracy 66.29 73.99 Overall Accuracy (%Improvement) 79.38 83.88 (5.68%) Figure 6: Shows the comparison of the proposed SVM based method with the existing NNA based method based on sensitivity, specificity, balanced accuracy and overall accuracy. It is clear from the figure that the proposed method attains an overall accuracy of 90.07% which is 13.48% better than the NNA based method. Discussions Conclusions Acknowledgements We propose an accurate predictor which incorporating novel structural, flexibility and energy features and utilizes optimized machine learning method, called SVM The improved predictor can be utilized to Annotate the sequences whose structure are unknown Can further aid in experimental studies of the disulfide bond and structure determination Improve the prediction accuracy of ab initio protein structure prediction Improve the accuracy of fold recognition Altogether, the proposed predictor achieves an overall improvement of 13.48% in comparison to the stat-of-arts approaches. Prediction of disulfide bonds plays crucial role in ab initio protein structure prediction and protein folding. Improved prediction of disulfide bonds can be useful in improving the accuracy of ab initio protein structure prediction Since they impose geometrical constrains on the protein backbone Thus, can help greatly reduce the search space We propose, disulfide bond prediction from protein sequence. We introduce several novel features Structural profile Flexibility profile Energy profile etc. We carried out optimization of the C and ‘gamma’ parameters of SVM for improved accuracy. Two stage prediction; first, individual cysteine bonding prediction followed by cysteine pair bonding prediction helped improve the accuracy of cysteine pair prediction while using individual cysteine bonding prediction probabilities as features. Our motivation is to apply the results from disulfide bond prediction to improve the accuracy of our existing ab initio protein structure prediction method, called 3DIGARS-PSP. We gratefully acknowledge the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF ( )-RD-B-07. References Niu, S., Huang, T., Feng, K.Y., He, Z., Cui, W. Inter-and intra-chain disulfide bond prediction based on optimal feature selection. Protein Pept Lett. 2013; 20: 324–35 Mis, A., Hoque, T. Next Generation Evolutionary Sampling and Energy Function Guided ab initio Protein Structure Prediction, Biophysical Journal, DOI:

Support Vector Machine (SVM)

Similar presentations

Presentation on theme: "Support Vector Machine (SVM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machine (SVM)

Similar presentations

Presentation on theme: "Support Vector Machine (SVM)"— Presentation transcript:

Similar presentations

About project

Feedback