Presentation is loading. Please wait.

Presentation is loading. Please wait.

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

Similar presentations


Presentation on theme: "A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr."— Presentation transcript:

1 A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr

2 Content 1 2 3 4 Abstract Introduction Materials and Methods Dealing with Class Imbalance: A New Supervised Over- Sampling Method 6 5 Experimental Results and Analysis Conclusion

3 Abstract Ubiquitous Useful for both protein function annotation and drug design It’s a typical imbalanced learning problem Little attention has been paid to the negative impact of class imbalance Proposed over-sampling algorithm, a predictor, called TargetSOS Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS Identifying interaction residues solely from protein sequences

4 Introduction Why? Effort in early stage, Great success, Imbalanced learning problem, Solutions, Over-sampling technique in this study Why? 1: Nucleotides play critical roles in various metabolic 2: significant importance for protein function analysis and drug design Effort in early stage 1:motif-based methods dominated this field 2:challenges,characterize protein-nucleotide interaction within a relatively narrow range(usually only for a single nucleotide type); require tertiary protein structure as input Machine-learning-based methods(great success) 1: ATPint demonstrated the feasibility of predicting…..solely from protein sequence information 2: NsitePred used for multiple nucleotides based on larger training datasets

5 Imbalanced learning problem 1: The number of negative samples is significantly larger than that of positive samples 2: No methods have considered this serious class imbalance phenomenon Solutions Sample rescaling-based methods, learning-based methods, active learning, kernel learning,hybrid methods. Supervised over-sampling technique in this study 1: Sample rescaling strategy is basic technique by balancing the sizes of different class by changing the number and distributions within them 2: It’s different with under-sampling technique 3:ROS, SMOTE,ADASYN,SOS 4: New predictor: TargetSOS

6 Materials and Methods Benchmark Datasets Feature Representation and Classifier A. Extract Feature Vector from the Position-Specific Scoring Matrix B. Extract Feature Vector from the Predicted Protein Secondary Structure C. Support Vector Machine

7 Benchmark Datasets Two benchmark were chosen to evaluate the efficacy of the proposed SOS algorithm and of the implemented predictor ATP168 168 non-redundant, ATP-interacting protein sequences 3104 for ATP binding, 59226 for ATP non-binding NUC5 Multiple nucleotide-interacting dataset(5) NUC5 consists of 227, 321, 140, 56, and 105 protein sequences that interact with five types of nucleotides, i.e., ATP, ADP, AMP, GTP, and GDP, respectively, Similar point: which the maximal pairwise sequence identity is less than 40%. Table 1 summarizes the detailed compositions of the two benchmarks datasets:

8

9 Feature Representation and Classifier the position-specific scoring matrix (PSSM) and predicted protein secondary structure (PSS), both of which have been demonstrated to be especially useful for protein-nucleotide binding residue prediction, are taken to extract discriminative feature vectors. Support vector machine (SVM) is used as a classifier for constructing a prediction model.

10 A. Extract Feature Vector from the Position-Specific Scoring Matrix. PSSM is widely used in bioinformatics. In this study, we obtain the PSSM of a query protein sequence by performing PSI-BLAST to search the Swiss-Prot database through three iterations and with 0.001 as the E-value cutoff against the query sequence. normalize each score, denoted as x, that is contained in the PSSM using the logistic function Based on the normalized PSSM, the feature vector, denoted Logistic PSSM, for each residue in the protein sequence can be extracted by applying a sliding-window technique. the dimensionality of the Logistic PSSM feature vector of a residue is 17*20= 340-D.

11 B. Extract Feature Vector from the Predicted Protein Secondary Structure PSIPRED can predict the probabilities of each residue in a query protein sequence belonging to three secondary structure classes, i.e., coil, helix, and strand. We obtained the predicted protein secondary structure by performing PSIPRED against the query sequence. The obtained predicted secondary structure is an L*3 probability matrix, where L is the length of the protein sequence. Similar to the Logistic PSSM feature extraction, we can extract a 1763 =51-D feature vector, denoted as PSS, for each residue in the protein by applying a sliding window of size 17.

12 C. Support Vector Machine. We use SVM as the base-learning model to evaluate the efficacy of the proposed SOS algorithm Let be the set of samples, and +1 and -1 are the labels of positive class and negative class, respectively. In linearly separable cases, SVM constructs a hyperplane that separates the samples of two classes with a maximum margin. The optimal separating hyperplane (OSH) is constructed by finding another vector, w, and a parameter, b, that minimizes and satisfies the following conditions:

13 To allow for mislabeled examples, we use a soft,margin technique For each training sample, a corresponding slack variable is introduced, i=1,2,3,…,N. Accordingly, the relaxed separation constraint is given as:

14 Then, the OSH can be solved by minimizing. Furthermore, to address non-linearly separable cases, the ‘‘kernel substitution’’ technique is introduced as follows: first, the input vector xi [Rd is mapped into a higher dimensional Hilbert space, H, by a non-linear kernel function, K(xi,xj); then, the OSH in the mapped space, H, is solved using a procedure similar to that for a linear case, and the decision function is given by:

15 Dealing with Class Imbalance: A New Supervised Over-Sampling Method A.Random Over-sampling. B. Synthetic Minority Over-sampling Technique. C. Adaptive Synthetic Sampling D. Proposed Supervised Over-sampling.

16 A. Random Over-sampling. In the ROS technique, the minority set Smin is augmented by replicating randomly selected samples within the set. Easy to perform, but tend to be over-fitted. To solve the problem, We will introduce SMOTE and ADASYN.

17 B. Synthetic Minority Over-sampling Technique. For each sample xi in Smin, let be the set of the K-nearest neighbors of xi in Smin under the Euclidian distance metric. To synthesize a new sample, an element in SK i, denoted as ^xi, is selected and then multiplied by the feature vector difference between ^xi and xi and by a random number between [0, 1]. Finally, this vector is added to xi : The parameter in the function is a random number. Between 0 and 1.

18 C. Adaptive Synthetic Sampling SMOTE creates the same number of synthetic samples for each original minority sample without considering the neighboring majority samples, which increases the occurrence of overlapping between classes. In view of this limitation of it, ADASYN is introduced.

19

20 D. Proposed Supervised Over-sampling.

21

22

23 Experimental Results and Analysis Evaluation indexes Supervised Over-Sampling Helps to Enhance Prediction Performance Comparisons with Other Over-Sampling Methods Comparisons with Existing Predictors A.Cross-Validation Test. B. Independent Validation Test.

24 Evaluation Indexes Let TP, FP, TN, and FN be the abbreviations for true positive, false positive, true negative, and false negative, respectively. Then, Sensitivity(Sen), Specificity(Spe), Accuracy(Acc), and the Matthews correlation coefficient (MCC) can be defined as follows:

25 Supervised Over-Sampling Helps to Enhance Prediction Performance

26 Figure 1. ROC curves of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation. (a) ROC curves for ATP168; (b) ROC curves for ATP227.

27 Comparisons with Other Over-Sampling Methods

28 Comparisons with Existing Predictors A.Cross-Validation Test. B. Independent Validation Test.

29 A. Cross-Validation Test.

30

31 B. Independent Validation Test.

32 Conclusion In this study, a new SOS algorithm that balances the samples of different classes by synthesizing additional samples for minority class with a supervised process is proposed to address imbalanced learning problems. We apply the proposed SOS algorithm to protein-nucleotide binding residue prediction, and a web-server, called TargetSOS, is implemented. Cross-validation tests and independent validation tests on two benchmark datasets demonstrate that the proposed SOS algorithm helps to improve the performance of protein-nucleotide binding residue prediction. The findings of this study enrich the understanding of class imbalance learning and are sufficiently flexible to be applied to other bioinformatics problems in which class imbalance exists, such as protein functional residue prediction and disulfide bond prediction.

33 Thank you for your attention! ·


Download ppt "A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr."

Similar presentations


Ads by Google