Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in.

Similar presentations


Presentation on theme: "A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in."— Presentation transcript:

1 A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in sequencing has generated a huge quantity of genomic data. Today, for one protein family hundreds of proteins may exist and the family can often be divided into functional subfamilies. Knowing the subfamily of a sequence can give hints about its function, phylogeny or structure. The aim of this work is the development of a naive Bayesian classifier to assign a new protein sequence to its subfamily. The Bayesian classifier method has been used to predict protein-protein interaction, structural conformation, drug resistance, for proteome annotations on public databases (1), etc. Here we propose a Bayesian classifier which uses a distance matrix based on percent identities. This new approach requires a strategy to convert the distances to coordinates involving the resolution of a least mean square minimization problem. Computation of percent identities with each subfamily Multidimensional scaling method (2) used to obtain coordinates y. Sequences set: alignment of k protein subfamilies s ij : percent of identity between sequences i and j n: number of identical residues. l : length of the shorter sequence between i and j. Conversion to distances between sequences and subfamilies Compute classifications Conversion to the subfamily coordinates in Computation of the percent identities between all sequence pairs Conversion to coordinates in The starting points of Newton-Raphson are the subfamily coordinates. Then, the best solution is kept. The Newton-Raphson algorithm is used to search the x l coordinates of each sequence i. The function to minimize is : The similarities S' ij are converted to distances D' ij with: with f j the density function of a multivariate normal distribution: Sequence i is assigned to the subfamily j that maximize: Algorithm for Assigning New Sequences to Subfamilies Using a Multiple Alignment Test case: the ARP Families Conclusion and Perspectives Actin-related proteins (ARPs) are very important for cytoskeleton activities (intracellular locomotion, cellular division ), and nuclear functions (chromatin modulation, regulation of transcription and DNA repair ). For studies of ARP families, a high- quality Multiple Alignment of Complete Sequences has been built (available on http://bips.u- strasbg.fr/ARPAnno/ARPMACS.html). This alignment is accessible through the ARPAnno web-server http://bips.u-strasbg.fr/ARPAnno/ (3) which uses this alignment to classify and annotate newly sequenced actin-like proteins.http://bips.u- strasbg.fr/ARPAnno/ARPMACS.htmlhttp://bips.u-strasbg.fr/ARPAnno/ ARP alignment representation References: 1-D. Szafron, P. Lu, R. Greiner, D.S. Wischart, B. Poulain, R. Eisner, Z. Lu, J. Anvik, C. Macdonell, A. Fyshe and D. Meeuwis (2004) Proteome Analyst: custom predictions in a web-based tool for high- throughput proteme annotations. Nucleic Acids Research vol 32, w365-w371 2-K. V. Mardia, J. T. Kent, J. M. Bibby (1980) Multivariate Analysis (Probability and Mathematical Statistics). Academic Press. 3-J. Muller, Y. Oma, L. Vallard, E.Friederich,O. Poch and B. Winsor (2005) Sequence and Comparative Genomic Analysis of Actin-related Proteins. Molecular Biology of the Cell vol 16,5736-5748. 4-J.D. Thompson, J.C. Thierry and O.Poch (2003) Rascal: rapid scanning and correction of multiple sequence alignments. Bioinformaics vol 19, 1155-1161. We have shown that it is possible to predict the subfamily of a sequence using a multiple alignment of subfamilies after the conversion of percent identities to coordinates. However the percent identity is a global parameter whereas local parameters (insertion/deletion, specific conserved residue...) are often discriminant between subfamilies. David Kieffer *§, Nicolas Wicker §, Olivier Poch § contact: dkieffer@igbmc.u-strasbg.fr * Genclis 15 rue du bois de la Champelle 54500 Vandoeuvre les Nancy § IGBMC, Laboratoire de Bioinformatique et Génomique Intégratives, 1 rue Laurent Fries 67404 Illkirch (France) As a consequence, the presented method should involve other descriptors of multiple alignments. In particular, the “blocks” of the Rascal program (4) could be introduced. These “blocks” are local conserved regions inside a multiple alignment. Another improvement could be a refinement of the optimization method through the introduction of simulated annealing, genetic algorithms, etc. Actin ARP1 ARP2 ARP3 ARP4 ARP5 ARP6 ARP7 ARP8 ARP9 ARP10 For each sequence i is mean percent of identity with each subfamily j is computed using formula: Conversion to distances between subfamilies Similarity matrix between the k subfamilies, convert to distance matrix Our naive Bayesian classifier has been tested on this ARP subfamily alignment. 1/3 of the sequences of each subfamily is randomly selected for the test set and 2/3 for the learning set. The results of this test are shown in the following histogram. More than 98% of all 273 tested sequences are classified correctly (last column). Human -actin reference sequence is in green. Amino acid Insertion in red and deletion in blue. Discriminating residues and “blocks” are in black dots and red boxes highlighted in yellow respectively. This representation shows the potential importance of “blocks” of local conservation to discriminate subfamilies.


Download ppt "A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in."

Similar presentations


Ads by Google