Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is used when making predictions on the sites of the sequence to be annotated. In this work we seek to improve such classifiers by taking into account the global sequence similarity between the test sequence and the sequences in the training set. Jivko Sinapov, Cornelia Caragea, Drena Dobbs and Vasant Honavar Using Global Sequence Similarity Improves Biological Site-Specific Classifiers Many problems in bioinformatics involve the prediction of class labels for each element in a protein sequences. Examples include:  Prediction of RNA and DNA binding protein residues  Prediction of post-translational modification sites  Prediction of secondary structure elements in sequences M K LI TI L C F L S R L L P S L T Q E S S Q EID Non O-Glycosylated? O-Glycosylated? H3N+H3N+ COO - Example Problems: Protein-RNA binding site prediction:Glycosylation site prediction: 1. Prediction of O-linked glycosylation sites 2. Prediction of RNA-binding protein residues 3. Prediction of protein-protein interface residues Let x test = {f 1, f 2, …,f n } be a n-dimensional test data point  Apply Bayes rule:  Independence assumption:  Assign class that maximizes: Let S 1, S 2, …, S N be a dataset of protein sequences. 1. Compute an N by N pair-wise similarity matrix using Global Alignment scores with Blosum62 substitution matrix 2. Using Spectral Clustering algorithm, recursively partition the set of training sequences to obtain a Hierarchical Clustering of the Sequences. 147 25 122 9428 4945 2623 3. Use the structure of the hierarchical partitioning to learn a Hierarchical Mixture of Experts model such that: Let be the leaf nodes in the hierarchical partitioning Let be the parameters for the trained Naïve Bayes models at each leaf node in Let be the input features for some residue in sequence Each leaf node computes the class probability for x test according to: Each non-leaf node combines the predictions from its children: 1. Performed 10-fold sequence based cross validation 2. Compared Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) O-Glycosylation Protein-RNA interactions Protein-Protein interface Naïve BayesHME-NB Naïve Bayes HME-NB Naïve Bayes HME-NB Accuracy0.89 0.830.840.790.81 MCC0.570.580.320.370.080.25 Sensitivity0.610.650.240.310.060.18 Specificity0.650.630.650.660.380.60 AUC0.880.910.740.760.620.72 Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Label: 1111110011111110011111001011111100000001111101000000 target residue class label... VKKFGGEVVKAGNIL,-1 KKFGGEVVKAGNILV,-1 KFGGEVVKAGNILVR,+1 FGGEVVKAGNILVRQ,+1... Data points used for training and testing a classifier a) O-Glycosylation b) Protein-RNA interaction sites c) Protein-Protein interface sites Biological Motivation: Datasets: DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase Protein-RNA Protein-Protein 216 147 42 2168 4336 2350 12147 27988 9204 Naïve Bayes (NB): A window of 21 amino-acids centered on the target residue: Feature Representation: Results: Hierarchical Mixture of Naïve Bayes Experts (HME-NB): A qualitative comparison of Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) on the task of predicting protein-protein interface sites of Anionic trypsin-2 precursor of Rattus norvegicus (shown in spheres) interfaced with Ecotin precursor of E.coli (in green). Each residue of the Anionic trypsin-2 precursor is colored based on whether the prediction is a True Positive (red), True Negative (gray), False Positive (blue), False Negative (yellow). For both methods, the False Positive Rate (FPR) is fixed at 0.1. HME-NB is able to achive higher TPR (0.88) than that of NB (0.56) for the same FPR. Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs Developed a classifier that improves labeling biological sequence data Conclusion:

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Similar presentations

Presentation on theme: "Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Similar presentations

Presentation on theme: "Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is."— Presentation transcript:

Similar presentations

About project

Feedback